test: add load-time benchmark (jinja/include trees)

Generator + in-process harness timing the real loader's three stages and template/YAML call counts, across tunable profiles. cases/ git-ignored; see test/benchmark/README.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 10:41:42 +02:00
parent 6e31ae971a
commit ef49789780
5 changed files with 545 additions and 0 deletions
--- a/test/benchmark/README.md
+++ b/test/benchmark/README.md
@@ -0,0 +1,116 @@
+# Load-time benchmark
+
+Measures how long *testium* takes to **load** a `.tum` test tree — template
+rendering (jinja) + YAML parsing + test-tree construction — *without* executing
+it. Purpose: get reproducible numbers before/after load-path optimisations, and
+attribute any gain to a specific part of the pipeline.
+
+It is meant for *very long* tests, the kind you can build with `jinja` loops and
+`!include`, where load time becomes noticeable.
+
+## Files
+
+| File | Role |
+|------|------|
+| `gen_bench_test.py` | Generates a synthetic `.tum` tree (the test input). |
+| `load_bench.py` | Drives the **real** loader in-process and times it. |
+| `run.sh` | Convenience: generate + time across profiles, using the project venv. |
+| `cases/` | Generated trees (git-ignored, recreated on demand). |
+
+The benchmark `.tum` files are **generated**, not committed — the generator is
+the artifact. They use only `let` leaves and `group` containers, so loading has
+no runtime side effect (no subprocess, no `<| |>` eval) and the timing reflects
+the parse/build pipeline alone.
+
+## Quick start
+
+```bash
+# default matrix (all profiles), 5 repeats each
+./test/benchmark/run.sh
+
+# one profile at one size
+./test/benchmark/run.sh repeat 2000
+
+# more repeats for a tighter min
+REPEAT=10 ./test/benchmark/run.sh includes 1000
+```
+
+`run.sh` uses the project venv at `test/tmp/.venv` (created by `./run.sh`). If it
+is missing, run `./run.sh` once first.
+
+To drive the harness directly on any `.tum` (not just generated ones):
+
+```bash
+test/tmp/.venv/bin/python3 test/benchmark/load_bench.py --repeat 5 --quiet path/to/main.tum
+```
+
+## Profiles
+
+Each profile isolates one cost. `--size` is the profile-specific count.
+
+| Profile | What it builds | Stresses |
+|---------|----------------|----------|
+| `flat` | one main file, N inline `let` steps | big YAML parse + linear object build |
+| `includes` | main `!include`s N **distinct** sub-files | per-include template+YAML+tempfile, `sequence` splice |
+| `repeat` | main `!include`s the **same** parametrised leaf N times | jinja **recompilation** of an identical template |
+| `jinja` | one main file, `{% for %}` emitting N steps | single large render + single large parse |
+| `deep` | nested includes, depth N | include recursion (see caveat) |
+| `mix` | groups + jinja loop + distinct + repeated includes | realistic blend |
+
+## Reading the output
+
+```
+phase              min      median
+initial         0.1131      0.1285   <- pass 1: discover config files (no includes)
+loadtest        1.0724      1.0900   <- config fixpoint loop + full recursive include load
+build           0.1850      0.1976   <- TestSet: load_test_recursively tree build
+total           1.3886      1.4227
+counters  (last run):
+  templates :    1003 calls   0.5247s  (exclusive: jinja compile+render+tempfile)
+  yaml      :    1004 parses   1.4696s  (inclusive of nested includes)
+```
+
+- **min** is the headline (least noisy); median is a sanity check.
+- **initial / loadtest / build** map to the three pipeline stages in
+  `interpreter/process.py` and `interpreter/test_set.py`. The main file is
+  rendered+parsed across `initial` *and* `loadtest` (the loader does ~3 passes).
+- **templates** = number of `template_to_test()` calls and their *exclusive*
+  wall time (one file render each — pure jinja compile+render+tempfile I/O).
+  A high count with the same source file = recompilation, the `repeat` case.
+- **yaml** = number of `yaml_load()` parses. Its time is *inclusive* of nested
+  includes, so use the **count** for attribution, not the seconds.
+
+## Mapping to the optimisation axes
+
+| Axis (see DESIGN / discussion) | Watch | Best profile to prove it |
+|--------------------------------|-------|--------------------------|
+| 1 — cache compiled jinja templates | `templates` time drops, count unchanged | `repeat` |
+| 2 — drop the tempfile round-trip | `templates` time drops | `includes`, `repeat`, `mix` |
+| 3 — C YAML loader (libyaml) | `yaml` time / `loadtest` drops | `flat`, `jinja` |
+| 6 — O(n²) sequence splice | `build` drops | `includes`, `mix` |
+
+## How to compare before/after a change
+
+1. Run the matrix on the current code, keep the output.
+2. Apply one axis.
+3. Re-run the **same** profiles/sizes; compare `min` per phase and the counters.
+
+Change one axis at a time so the attribution is clean. Run on an idle machine
+(and note the disk: on a USB stick the tempfile round-trip of axis 2 weighs
+more).
+
+## Caveat: deep includes
+
+The loader is recursive and spends ~10 stack frames per include level, so
+`deep` hits Python's `RecursionError` around ~90 nested levels. The harness
+reports this cleanly instead of crashing. Real tests are *wide* (many steps /
+many includes), not deep, so `includes`/`repeat`/`jinja`/`mix` are the
+representative "very long" cases.
+
+## Notes
+
+- No execution is triggered — timing stops where `Batch` would mark the test
+  *loaded*.
+- The profiles contain no `<| |>`, so the external eval process is not started.
+  Pass `--with-eval` to `load_bench.py` for trees that evaluate at load time.
+- Numbers are machine- and disk-specific; only compare runs from the same host.