Files
François ef49789780 test: add load-time benchmark (jinja/include trees)
Generator + in-process harness timing the real loader's three stages and
template/YAML call counts, across tunable profiles. cases/ git-ignored;
see test/benchmark/README.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 10:41:42 +02:00
..

Load-time benchmark

Measures how long testium takes to load a .tum test tree — template rendering (jinja) + YAML parsing + test-tree construction — without executing it. Purpose: get reproducible numbers before/after load-path optimisations, and attribute any gain to a specific part of the pipeline.

It is meant for very long tests, the kind you can build with jinja loops and !include, where load time becomes noticeable.

Files

File Role
gen_bench_test.py Generates a synthetic .tum tree (the test input).
load_bench.py Drives the real loader in-process and times it.
run.sh Convenience: generate + time across profiles, using the project venv.
cases/ Generated trees (git-ignored, recreated on demand).

The benchmark .tum files are generated, not committed — the generator is the artifact. They use only let leaves and group containers, so loading has no runtime side effect (no subprocess, no <| |> eval) and the timing reflects the parse/build pipeline alone.

Quick start

# default matrix (all profiles), 5 repeats each
./test/benchmark/run.sh

# one profile at one size
./test/benchmark/run.sh repeat 2000

# more repeats for a tighter min
REPEAT=10 ./test/benchmark/run.sh includes 1000

run.sh uses the project venv at test/tmp/.venv (created by ./run.sh). If it is missing, run ./run.sh once first.

To drive the harness directly on any .tum (not just generated ones):

test/tmp/.venv/bin/python3 test/benchmark/load_bench.py --repeat 5 --quiet path/to/main.tum

Profiles

Each profile isolates one cost. --size is the profile-specific count.

Profile What it builds Stresses
flat one main file, N inline let steps big YAML parse + linear object build
includes main !includes N distinct sub-files per-include template+YAML+tempfile, sequence splice
repeat main !includes the same parametrised leaf N times jinja recompilation of an identical template
jinja one main file, {% for %} emitting N steps single large render + single large parse
deep nested includes, depth N include recursion (see caveat)
mix groups + jinja loop + distinct + repeated includes realistic blend

Reading the output

phase              min      median
initial         0.1131      0.1285   <- pass 1: discover config files (no includes)
loadtest        1.0724      1.0900   <- config fixpoint loop + full recursive include load
build           0.1850      0.1976   <- TestSet: load_test_recursively tree build
total           1.3886      1.4227
counters  (last run):
  templates :    1003 calls   0.5247s  (exclusive: jinja compile+render+tempfile)
  yaml      :    1004 parses   1.4696s  (inclusive of nested includes)
  • min is the headline (least noisy); median is a sanity check.
  • initial / loadtest / build map to the three pipeline stages in interpreter/process.py and interpreter/test_set.py. The main file is rendered+parsed across initial and loadtest (the loader does ~3 passes).
  • templates = number of template_to_test() calls and their exclusive wall time (one file render each — pure jinja compile+render+tempfile I/O). A high count with the same source file = recompilation, the repeat case.
  • yaml = number of yaml_load() parses. Its time is inclusive of nested includes, so use the count for attribution, not the seconds.

Mapping to the optimisation axes

Axis (see DESIGN / discussion) Watch Best profile to prove it
1 — cache compiled jinja templates templates time drops, count unchanged repeat
2 — drop the tempfile round-trip templates time drops includes, repeat, mix
3 — C YAML loader (libyaml) yaml time / loadtest drops flat, jinja
6 — O(n²) sequence splice build drops includes, mix

How to compare before/after a change

  1. Run the matrix on the current code, keep the output.
  2. Apply one axis.
  3. Re-run the same profiles/sizes; compare min per phase and the counters.

Change one axis at a time so the attribution is clean. Run on an idle machine (and note the disk: on a USB stick the tempfile round-trip of axis 2 weighs more).

Caveat: deep includes

The loader is recursive and spends ~10 stack frames per include level, so deep hits Python's RecursionError around ~90 nested levels. The harness reports this cleanly instead of crashing. Real tests are wide (many steps / many includes), not deep, so includes/repeat/jinja/mix are the representative "very long" cases.

Notes

  • No execution is triggered — timing stops where Batch would mark the test loaded.
  • The profiles contain no <| |>, so the external eval process is not started. Pass --with-eval to load_bench.py for trees that evaluate at load time.
  • Numbers are machine- and disk-specific; only compare runs from the same host.