Generator + in-process harness timing the real loader's three stages and template/YAML call counts, across tunable profiles. cases/ git-ignored; see test/benchmark/README.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5.0 KiB
Load-time benchmark
Measures how long testium takes to load a .tum test tree — template
rendering (jinja) + YAML parsing + test-tree construction — without executing
it. Purpose: get reproducible numbers before/after load-path optimisations, and
attribute any gain to a specific part of the pipeline.
It is meant for very long tests, the kind you can build with jinja loops and
!include, where load time becomes noticeable.
Files
| File | Role |
|---|---|
gen_bench_test.py |
Generates a synthetic .tum tree (the test input). |
load_bench.py |
Drives the real loader in-process and times it. |
run.sh |
Convenience: generate + time across profiles, using the project venv. |
cases/ |
Generated trees (git-ignored, recreated on demand). |
The benchmark .tum files are generated, not committed — the generator is
the artifact. They use only let leaves and group containers, so loading has
no runtime side effect (no subprocess, no <| |> eval) and the timing reflects
the parse/build pipeline alone.
Quick start
# default matrix (all profiles), 5 repeats each
./test/benchmark/run.sh
# one profile at one size
./test/benchmark/run.sh repeat 2000
# more repeats for a tighter min
REPEAT=10 ./test/benchmark/run.sh includes 1000
run.sh uses the project venv at test/tmp/.venv (created by ./run.sh). If it
is missing, run ./run.sh once first.
To drive the harness directly on any .tum (not just generated ones):
test/tmp/.venv/bin/python3 test/benchmark/load_bench.py --repeat 5 --quiet path/to/main.tum
Profiles
Each profile isolates one cost. --size is the profile-specific count.
| Profile | What it builds | Stresses |
|---|---|---|
flat |
one main file, N inline let steps |
big YAML parse + linear object build |
includes |
main !includes N distinct sub-files |
per-include template+YAML+tempfile, sequence splice |
repeat |
main !includes the same parametrised leaf N times |
jinja recompilation of an identical template |
jinja |
one main file, {% for %} emitting N steps |
single large render + single large parse |
deep |
nested includes, depth N | include recursion (see caveat) |
mix |
groups + jinja loop + distinct + repeated includes | realistic blend |
Reading the output
phase min median
initial 0.1131 0.1285 <- pass 1: discover config files (no includes)
loadtest 1.0724 1.0900 <- config fixpoint loop + full recursive include load
build 0.1850 0.1976 <- TestSet: load_test_recursively tree build
total 1.3886 1.4227
counters (last run):
templates : 1003 calls 0.5247s (exclusive: jinja compile+render+tempfile)
yaml : 1004 parses 1.4696s (inclusive of nested includes)
- min is the headline (least noisy); median is a sanity check.
- initial / loadtest / build map to the three pipeline stages in
interpreter/process.pyandinterpreter/test_set.py. The main file is rendered+parsed acrossinitialandloadtest(the loader does ~3 passes). - templates = number of
template_to_test()calls and their exclusive wall time (one file render each — pure jinja compile+render+tempfile I/O). A high count with the same source file = recompilation, therepeatcase. - yaml = number of
yaml_load()parses. Its time is inclusive of nested includes, so use the count for attribution, not the seconds.
Mapping to the optimisation axes
| Axis (see DESIGN / discussion) | Watch | Best profile to prove it |
|---|---|---|
| 1 — cache compiled jinja templates | templates time drops, count unchanged |
repeat |
| 2 — drop the tempfile round-trip | templates time drops |
includes, repeat, mix |
| 3 — C YAML loader (libyaml) | yaml time / loadtest drops |
flat, jinja |
| 6 — O(n²) sequence splice | build drops |
includes, mix |
How to compare before/after a change
- Run the matrix on the current code, keep the output.
- Apply one axis.
- Re-run the same profiles/sizes; compare
minper phase and the counters.
Change one axis at a time so the attribution is clean. Run on an idle machine (and note the disk: on a USB stick the tempfile round-trip of axis 2 weighs more).
Caveat: deep includes
The loader is recursive and spends ~10 stack frames per include level, so
deep hits Python's RecursionError around ~90 nested levels. The harness
reports this cleanly instead of crashing. Real tests are wide (many steps /
many includes), not deep, so includes/repeat/jinja/mix are the
representative "very long" cases.
Notes
- No execution is triggered — timing stops where
Batchwould mark the test loaded. - The profiles contain no
<| |>, so the external eval process is not started. Pass--with-evaltoload_bench.pyfor trees that evaluate at load time. - Numbers are machine- and disk-specific; only compare runs from the same host.