test: add load-time benchmark (jinja/include trees)
Generator + in-process harness timing the real loader's three stages and template/YAML call counts, across tunable profiles. cases/ git-ignored; see test/benchmark/README.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
116
test/benchmark/README.md
Normal file
116
test/benchmark/README.md
Normal file
@@ -0,0 +1,116 @@
|
||||
# Load-time benchmark
|
||||
|
||||
Measures how long *testium* takes to **load** a `.tum` test tree — template
|
||||
rendering (jinja) + YAML parsing + test-tree construction — *without* executing
|
||||
it. Purpose: get reproducible numbers before/after load-path optimisations, and
|
||||
attribute any gain to a specific part of the pipeline.
|
||||
|
||||
It is meant for *very long* tests, the kind you can build with `jinja` loops and
|
||||
`!include`, where load time becomes noticeable.
|
||||
|
||||
## Files
|
||||
|
||||
| File | Role |
|
||||
|------|------|
|
||||
| `gen_bench_test.py` | Generates a synthetic `.tum` tree (the test input). |
|
||||
| `load_bench.py` | Drives the **real** loader in-process and times it. |
|
||||
| `run.sh` | Convenience: generate + time across profiles, using the project venv. |
|
||||
| `cases/` | Generated trees (git-ignored, recreated on demand). |
|
||||
|
||||
The benchmark `.tum` files are **generated**, not committed — the generator is
|
||||
the artifact. They use only `let` leaves and `group` containers, so loading has
|
||||
no runtime side effect (no subprocess, no `<| |>` eval) and the timing reflects
|
||||
the parse/build pipeline alone.
|
||||
|
||||
## Quick start
|
||||
|
||||
```bash
|
||||
# default matrix (all profiles), 5 repeats each
|
||||
./test/benchmark/run.sh
|
||||
|
||||
# one profile at one size
|
||||
./test/benchmark/run.sh repeat 2000
|
||||
|
||||
# more repeats for a tighter min
|
||||
REPEAT=10 ./test/benchmark/run.sh includes 1000
|
||||
```
|
||||
|
||||
`run.sh` uses the project venv at `test/tmp/.venv` (created by `./run.sh`). If it
|
||||
is missing, run `./run.sh` once first.
|
||||
|
||||
To drive the harness directly on any `.tum` (not just generated ones):
|
||||
|
||||
```bash
|
||||
test/tmp/.venv/bin/python3 test/benchmark/load_bench.py --repeat 5 --quiet path/to/main.tum
|
||||
```
|
||||
|
||||
## Profiles
|
||||
|
||||
Each profile isolates one cost. `--size` is the profile-specific count.
|
||||
|
||||
| Profile | What it builds | Stresses |
|
||||
|---------|----------------|----------|
|
||||
| `flat` | one main file, N inline `let` steps | big YAML parse + linear object build |
|
||||
| `includes` | main `!include`s N **distinct** sub-files | per-include template+YAML+tempfile, `sequence` splice |
|
||||
| `repeat` | main `!include`s the **same** parametrised leaf N times | jinja **recompilation** of an identical template |
|
||||
| `jinja` | one main file, `{% for %}` emitting N steps | single large render + single large parse |
|
||||
| `deep` | nested includes, depth N | include recursion (see caveat) |
|
||||
| `mix` | groups + jinja loop + distinct + repeated includes | realistic blend |
|
||||
|
||||
## Reading the output
|
||||
|
||||
```
|
||||
phase min median
|
||||
initial 0.1131 0.1285 <- pass 1: discover config files (no includes)
|
||||
loadtest 1.0724 1.0900 <- config fixpoint loop + full recursive include load
|
||||
build 0.1850 0.1976 <- TestSet: load_test_recursively tree build
|
||||
total 1.3886 1.4227
|
||||
counters (last run):
|
||||
templates : 1003 calls 0.5247s (exclusive: jinja compile+render+tempfile)
|
||||
yaml : 1004 parses 1.4696s (inclusive of nested includes)
|
||||
```
|
||||
|
||||
- **min** is the headline (least noisy); median is a sanity check.
|
||||
- **initial / loadtest / build** map to the three pipeline stages in
|
||||
`interpreter/process.py` and `interpreter/test_set.py`. The main file is
|
||||
rendered+parsed across `initial` *and* `loadtest` (the loader does ~3 passes).
|
||||
- **templates** = number of `template_to_test()` calls and their *exclusive*
|
||||
wall time (one file render each — pure jinja compile+render+tempfile I/O).
|
||||
A high count with the same source file = recompilation, the `repeat` case.
|
||||
- **yaml** = number of `yaml_load()` parses. Its time is *inclusive* of nested
|
||||
includes, so use the **count** for attribution, not the seconds.
|
||||
|
||||
## Mapping to the optimisation axes
|
||||
|
||||
| Axis (see DESIGN / discussion) | Watch | Best profile to prove it |
|
||||
|--------------------------------|-------|--------------------------|
|
||||
| 1 — cache compiled jinja templates | `templates` time drops, count unchanged | `repeat` |
|
||||
| 2 — drop the tempfile round-trip | `templates` time drops | `includes`, `repeat`, `mix` |
|
||||
| 3 — C YAML loader (libyaml) | `yaml` time / `loadtest` drops | `flat`, `jinja` |
|
||||
| 6 — O(n²) sequence splice | `build` drops | `includes`, `mix` |
|
||||
|
||||
## How to compare before/after a change
|
||||
|
||||
1. Run the matrix on the current code, keep the output.
|
||||
2. Apply one axis.
|
||||
3. Re-run the **same** profiles/sizes; compare `min` per phase and the counters.
|
||||
|
||||
Change one axis at a time so the attribution is clean. Run on an idle machine
|
||||
(and note the disk: on a USB stick the tempfile round-trip of axis 2 weighs
|
||||
more).
|
||||
|
||||
## Caveat: deep includes
|
||||
|
||||
The loader is recursive and spends ~10 stack frames per include level, so
|
||||
`deep` hits Python's `RecursionError` around ~90 nested levels. The harness
|
||||
reports this cleanly instead of crashing. Real tests are *wide* (many steps /
|
||||
many includes), not deep, so `includes`/`repeat`/`jinja`/`mix` are the
|
||||
representative "very long" cases.
|
||||
|
||||
## Notes
|
||||
|
||||
- No execution is triggered — timing stops where `Batch` would mark the test
|
||||
*loaded*.
|
||||
- The profiles contain no `<| |>`, so the external eval process is not started.
|
||||
Pass `--with-eval` to `load_bench.py` for trees that evaluate at load time.
|
||||
- Numbers are machine- and disk-specific; only compare runs from the same host.
|
||||
Reference in New Issue
Block a user