Files
testium/test/benchmark/README.md
François ef49789780 test: add load-time benchmark (jinja/include trees)
Generator + in-process harness timing the real loader's three stages and
template/YAML call counts, across tunable profiles. cases/ git-ignored;
see test/benchmark/README.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 10:41:42 +02:00

117 lines
5.0 KiB
Markdown

# Load-time benchmark
Measures how long *testium* takes to **load** a `.tum` test tree — template
rendering (jinja) + YAML parsing + test-tree construction — *without* executing
it. Purpose: get reproducible numbers before/after load-path optimisations, and
attribute any gain to a specific part of the pipeline.
It is meant for *very long* tests, the kind you can build with `jinja` loops and
`!include`, where load time becomes noticeable.
## Files
| File | Role |
|------|------|
| `gen_bench_test.py` | Generates a synthetic `.tum` tree (the test input). |
| `load_bench.py` | Drives the **real** loader in-process and times it. |
| `run.sh` | Convenience: generate + time across profiles, using the project venv. |
| `cases/` | Generated trees (git-ignored, recreated on demand). |
The benchmark `.tum` files are **generated**, not committed — the generator is
the artifact. They use only `let` leaves and `group` containers, so loading has
no runtime side effect (no subprocess, no `<| |>` eval) and the timing reflects
the parse/build pipeline alone.
## Quick start
```bash
# default matrix (all profiles), 5 repeats each
./test/benchmark/run.sh
# one profile at one size
./test/benchmark/run.sh repeat 2000
# more repeats for a tighter min
REPEAT=10 ./test/benchmark/run.sh includes 1000
```
`run.sh` uses the project venv at `test/tmp/.venv` (created by `./run.sh`). If it
is missing, run `./run.sh` once first.
To drive the harness directly on any `.tum` (not just generated ones):
```bash
test/tmp/.venv/bin/python3 test/benchmark/load_bench.py --repeat 5 --quiet path/to/main.tum
```
## Profiles
Each profile isolates one cost. `--size` is the profile-specific count.
| Profile | What it builds | Stresses |
|---------|----------------|----------|
| `flat` | one main file, N inline `let` steps | big YAML parse + linear object build |
| `includes` | main `!include`s N **distinct** sub-files | per-include template+YAML+tempfile, `sequence` splice |
| `repeat` | main `!include`s the **same** parametrised leaf N times | jinja **recompilation** of an identical template |
| `jinja` | one main file, `{% for %}` emitting N steps | single large render + single large parse |
| `deep` | nested includes, depth N | include recursion (see caveat) |
| `mix` | groups + jinja loop + distinct + repeated includes | realistic blend |
## Reading the output
```
phase min median
initial 0.1131 0.1285 <- pass 1: discover config files (no includes)
loadtest 1.0724 1.0900 <- config fixpoint loop + full recursive include load
build 0.1850 0.1976 <- TestSet: load_test_recursively tree build
total 1.3886 1.4227
counters (last run):
templates : 1003 calls 0.5247s (exclusive: jinja compile+render+tempfile)
yaml : 1004 parses 1.4696s (inclusive of nested includes)
```
- **min** is the headline (least noisy); median is a sanity check.
- **initial / loadtest / build** map to the three pipeline stages in
`interpreter/process.py` and `interpreter/test_set.py`. The main file is
rendered+parsed across `initial` *and* `loadtest` (the loader does ~3 passes).
- **templates** = number of `template_to_test()` calls and their *exclusive*
wall time (one file render each — pure jinja compile+render+tempfile I/O).
A high count with the same source file = recompilation, the `repeat` case.
- **yaml** = number of `yaml_load()` parses. Its time is *inclusive* of nested
includes, so use the **count** for attribution, not the seconds.
## Mapping to the optimisation axes
| Axis (see DESIGN / discussion) | Watch | Best profile to prove it |
|--------------------------------|-------|--------------------------|
| 1 — cache compiled jinja templates | `templates` time drops, count unchanged | `repeat` |
| 2 — drop the tempfile round-trip | `templates` time drops | `includes`, `repeat`, `mix` |
| 3 — C YAML loader (libyaml) | `yaml` time / `loadtest` drops | `flat`, `jinja` |
| 6 — O(n²) sequence splice | `build` drops | `includes`, `mix` |
## How to compare before/after a change
1. Run the matrix on the current code, keep the output.
2. Apply one axis.
3. Re-run the **same** profiles/sizes; compare `min` per phase and the counters.
Change one axis at a time so the attribution is clean. Run on an idle machine
(and note the disk: on a USB stick the tempfile round-trip of axis 2 weighs
more).
## Caveat: deep includes
The loader is recursive and spends ~10 stack frames per include level, so
`deep` hits Python's `RecursionError` around ~90 nested levels. The harness
reports this cleanly instead of crashing. Real tests are *wide* (many steps /
many includes), not deep, so `includes`/`repeat`/`jinja`/`mix` are the
representative "very long" cases.
## Notes
- No execution is triggered — timing stops where `Batch` would mark the test
*loaded*.
- The profiles contain no `<| |>`, so the external eval process is not started.
Pass `--with-eval` to `load_bench.py` for trees that evaluate at load time.
- Numbers are machine- and disk-specific; only compare runs from the same host.