Files

François ef49789780 test: add load-time benchmark (jinja/include trees)

Generator + in-process harness timing the real loader's three stages and
template/YAML call counts, across tunable profiles. cases/ git-ignored;
see test/benchmark/README.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-31 10:41:42 +02:00

5.0 KiB

Raw Blame History

Load-time benchmark

Measures how long testium takes to load a .tum test tree — template rendering (jinja) + YAML parsing + test-tree construction — without executing it. Purpose: get reproducible numbers before/after load-path optimisations, and attribute any gain to a specific part of the pipeline.

It is meant for very long tests, the kind you can build with jinja loops and !include, where load time becomes noticeable.

Files

File	Role
`gen_bench_test.py`	Generates a synthetic `.tum` tree (the test input).
`load_bench.py`	Drives the real loader in-process and times it.
`run.sh`	Convenience: generate + time across profiles, using the project venv.
`cases/`	Generated trees (git-ignored, recreated on demand).

The benchmark .tum files are generated, not committed — the generator is the artifact. They use only let leaves and group containers, so loading has no runtime side effect (no subprocess, no <| |> eval) and the timing reflects the parse/build pipeline alone.

Quick start

# default matrix (all profiles), 5 repeats each
./test/benchmark/run.sh

# one profile at one size
./test/benchmark/run.sh repeat 2000

# more repeats for a tighter min
REPEAT=10 ./test/benchmark/run.sh includes 1000

run.sh uses the project venv at test/tmp/.venv (created by ./run.sh). If it is missing, run ./run.sh once first.

To drive the harness directly on any .tum (not just generated ones):

test/tmp/.venv/bin/python3 test/benchmark/load_bench.py --repeat 5 --quiet path/to/main.tum

Profiles

Each profile isolates one cost. --size is the profile-specific count.

Profile	What it builds	Stresses
`flat`	one main file, N inline `let` steps	big YAML parse + linear object build
`includes`	main `!include`s N distinct sub-files	per-include template+YAML+tempfile, `sequence` splice
`repeat`	main `!include`s the same parametrised leaf N times	jinja recompilation of an identical template
`jinja`	one main file, `{% for %}` emitting N steps	single large render + single large parse
`deep`	nested includes, depth N	include recursion (see caveat)
`mix`	groups + jinja loop + distinct + repeated includes	realistic blend

Reading the output

phase              min      median
initial         0.1131      0.1285   <- pass 1: discover config files (no includes)
loadtest        1.0724      1.0900   <- config fixpoint loop + full recursive include load
build           0.1850      0.1976   <- TestSet: load_test_recursively tree build
total           1.3886      1.4227
counters  (last run):
  templates :    1003 calls   0.5247s  (exclusive: jinja compile+render+tempfile)
  yaml      :    1004 parses   1.4696s  (inclusive of nested includes)

min is the headline (least noisy); median is a sanity check.
initial / loadtest / build map to the three pipeline stages in interpreter/process.py and interpreter/test_set.py. The main file is rendered+parsed across initial and loadtest (the loader does ~3 passes).
templates = number of template_to_test() calls and their exclusive wall time (one file render each — pure jinja compile+render+tempfile I/O). A high count with the same source file = recompilation, the repeat case.
yaml = number of yaml_load() parses. Its time is inclusive of nested includes, so use the count for attribution, not the seconds.

Mapping to the optimisation axes

Axis (see DESIGN / discussion)	Watch	Best profile to prove it
1 — cache compiled jinja templates	`templates` time drops, count unchanged	`repeat`
2 — drop the tempfile round-trip	`templates` time drops	`includes`, `repeat`, `mix`
3 — C YAML loader (libyaml)	`yaml` time / `loadtest` drops	`flat`, `jinja`
6 — O(n²) sequence splice	`build` drops	`includes`, `mix`

How to compare before/after a change

Run the matrix on the current code, keep the output.
Apply one axis.
Re-run the same profiles/sizes; compare min per phase and the counters.

Change one axis at a time so the attribution is clean. Run on an idle machine (and note the disk: on a USB stick the tempfile round-trip of axis 2 weighs more).

Caveat: deep includes

The loader is recursive and spends ~10 stack frames per include level, so deep hits Python's RecursionError around ~90 nested levels. The harness reports this cleanly instead of crashing. Real tests are wide (many steps / many includes), not deep, so includes/repeat/jinja/mix are the representative "very long" cases.

Notes

No execution is triggered — timing stops where Batch would mark the test loaded.
The profiles contain no <| |>, so the external eval process is not started. Pass --with-eval to load_bench.py for trees that evaluate at load time.
Numbers are machine- and disk-specific; only compare runs from the same host.

5.0 KiB Raw Blame History