testium/test/benchmark/README.md

# Load-time benchmark

Measures how long *testium* takes to **load** a `.tum` test tree — template
rendering (jinja) + YAML parsing + test-tree construction — *without* executing
it. Purpose: get reproducible numbers before/after load-path optimisations, and
attribute any gain to a specific part of the pipeline.

It is meant for *very long* tests, the kind you can build with `jinja` loops and
`!include`, where load time becomes noticeable.

## Files

| File | Role |
|------|------|
| `gen_bench_test.py` | Generates a synthetic `.tum` tree (the test input). |
| `load_bench.py` | Drives the **real** loader in-process and times it. |
| `run.sh` | Convenience: generate + time across profiles, using the project venv. |
| `cases/` | Generated trees (git-ignored, recreated on demand). |

The benchmark `.tum` files are **generated**, not committed — the generator is
the artifact. They use only `let` leaves and `group` containers, so loading has
no runtime side effect (no subprocess, no `<| |>` eval) and the timing reflects
the parse/build pipeline alone.

## Quick start

```bash
# default matrix (all profiles), 5 repeats each
./test/benchmark/run.sh

# one profile at one size
./test/benchmark/run.sh repeat 2000

# more repeats for a tighter min
REPEAT=10 ./test/benchmark/run.sh includes 1000
```

`run.sh` uses the project venv at `test/tmp/.venv` (created by `./run.sh`). If it
is missing, run `./run.sh` once first.

To drive the harness directly on any `.tum` (not just generated ones):

```bash
test/tmp/.venv/bin/python3 test/benchmark/load_bench.py --repeat 5 --quiet path/to/main.tum
```

## Profiles

Each profile isolates one cost. `--size` is the profile-specific count.

| Profile | What it builds | Stresses |
|---------|----------------|----------|
| `flat` | one main file, N inline `let` steps | big YAML parse + linear object build |
| `includes` | main `!include`s N **distinct** sub-files | per-include template+YAML+tempfile, `sequence` splice |
| `repeat` | main `!include`s the **same** parametrised leaf N times | jinja **recompilation** of an identical template |
| `jinja` | one main file, `{% for %}` emitting N steps | single large render + single large parse |
| `deep` | nested includes, depth N | include recursion (see caveat) |
| `mix` | groups + jinja loop + distinct + repeated includes | realistic blend |

## Reading the output

```
phase              min      median
initial         0.1131      0.1285   <- pass 1: discover config files (no includes)
loadtest        1.0724      1.0900   <- config fixpoint loop + full recursive include load
build           0.1850      0.1976   <- TestSet: load_test_recursively tree build
total           1.3886      1.4227
counters  (last run):
  templates :    1003 calls   0.5247s  (exclusive: jinja compile+render+tempfile)
  yaml      :    1004 parses   1.4696s  (inclusive of nested includes)
```

- **min** is the headline (least noisy); median is a sanity check.
- **initial / loadtest / build** map to the three pipeline stages in
  `interpreter/process.py` and `interpreter/test_set.py`. The main file is
  rendered+parsed across `initial` *and* `loadtest` (the loader does ~3 passes).
- **templates** = number of `template_to_test()` calls and their *exclusive*
  wall time (one file render each — pure jinja compile+render+tempfile I/O).
  A high count with the same source file = recompilation, the `repeat` case.
- **yaml** = number of `yaml_load()` parses. Its time is *inclusive* of nested
  includes, so use the **count** for attribution, not the seconds.

## Mapping to the optimisation axes

| Axis (see DESIGN / discussion) | Watch | Best profile to prove it |
|--------------------------------|-------|--------------------------|
| 1 — cache compiled jinja templates | `templates` time drops, count unchanged | `repeat` |
| 2 — drop the tempfile round-trip | `templates` time drops | `includes`, `repeat`, `mix` |
| 3 — C YAML loader (libyaml) | `yaml` time / `loadtest` drops | `flat`, `jinja` |
| 6 — O(n²) sequence splice | `build` drops | `includes`, `mix` |

## How to compare before/after a change

1. Run the matrix on the current code, keep the output.
2. Apply one axis.
3. Re-run the **same** profiles/sizes; compare `min` per phase and the counters.

Change one axis at a time so the attribution is clean. Run on an idle machine
(and note the disk: on a USB stick the tempfile round-trip of axis 2 weighs
more).

## Caveat: deep includes

The loader is recursive and spends ~10 stack frames per include level, so
`deep` hits Python's `RecursionError` around ~90 nested levels. The harness
reports this cleanly instead of crashing. Real tests are *wide* (many steps /
many includes), not deep, so `includes`/`repeat`/`jinja`/`mix` are the
representative "very long" cases.

## Notes

- No execution is triggered — timing stops where `Batch` would mark the test
  *loaded*.
- The profiles contain no `<| |>`, so the external eval process is not started.
  Pass `--with-eval` to `load_bench.py` for trees that evaluate at load time.
- Numbers are machine- and disk-specific; only compare runs from the same host.