test: add load-time benchmark (jinja/include trees)

Generator + in-process harness timing the real loader's three stages and template/YAML call counts, across tunable profiles. cases/ git-ignored; see test/benchmark/README.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 10:41:42 +02:00
parent 6e31ae971a
commit ef49789780
5 changed files with 545 additions and 0 deletions
--- a/test/benchmark/.gitignore
+++ b/test/benchmark/.gitignore
@@ -0,0 +1 @@
 cases/
--- a/test/benchmark/README.md
+++ b/test/benchmark/README.md
@@ -0,0 +1,116 @@
 # Load-time benchmark
 Measures how long *testium* takes to **load** a `.tum` test tree — template
 rendering (jinja) + YAML parsing + test-tree construction — *without* executing
 it. Purpose: get reproducible numbers before/after load-path optimisations, and
 attribute any gain to a specific part of the pipeline.
 It is meant for *very long* tests, the kind you can build with `jinja` loops and
 `!include`, where load time becomes noticeable.
 ## Files
 | File | Role |
 |------|------|
 | `gen_bench_test.py` | Generates a synthetic `.tum` tree (the test input). |
 | `load_bench.py` | Drives the **real** loader in-process and times it. |
 | `run.sh` | Convenience: generate + time across profiles, using the project venv. |
 | `cases/` | Generated trees (git-ignored, recreated on demand). |
 The benchmark `.tum` files are **generated**, not committed — the generator is
 the artifact. They use only `let` leaves and `group` containers, so loading has
 no runtime side effect (no subprocess, no `<| |>` eval) and the timing reflects
 the parse/build pipeline alone.
 ## Quick start
 ```bash
 # default matrix (all profiles), 5 repeats each
 ./test/benchmark/run.sh
 # one profile at one size
 ./test/benchmark/run.sh repeat 2000
 # more repeats for a tighter min
 REPEAT=10 ./test/benchmark/run.sh includes 1000
 ```
 `run.sh` uses the project venv at `test/tmp/.venv` (created by `./run.sh`). If it
 is missing, run `./run.sh` once first.
 To drive the harness directly on any `.tum` (not just generated ones):
 ```bash
 test/tmp/.venv/bin/python3 test/benchmark/load_bench.py --repeat 5 --quiet path/to/main.tum
 ```
 ## Profiles
 Each profile isolates one cost. `--size` is the profile-specific count.
 | Profile | What it builds | Stresses |
 |---------|----------------|----------|
 | `flat` | one main file, N inline `let` steps | big YAML parse + linear object build |
 | `includes` | main `!include`s N **distinct** sub-files | per-include template+YAML+tempfile, `sequence` splice |
 | `repeat` | main `!include`s the **same** parametrised leaf N times | jinja **recompilation** of an identical template |
 | `jinja` | one main file, `{% for %}` emitting N steps | single large render + single large parse |
 | `deep` | nested includes, depth N | include recursion (see caveat) |
 | `mix` | groups + jinja loop + distinct + repeated includes | realistic blend |
 ## Reading the output
 ```
 phase              min      median
 initial         0.1131      0.1285   <- pass 1: discover config files (no includes)
 loadtest        1.0724      1.0900   <- config fixpoint loop + full recursive include load
 build           0.1850      0.1976   <- TestSet: load_test_recursively tree build
 total           1.3886      1.4227
 counters  (last run):
  templates :    1003 calls   0.5247s  (exclusive: jinja compile+render+tempfile)
  yaml      :    1004 parses   1.4696s  (inclusive of nested includes)
 ```
 - **min** is the headline (least noisy); median is a sanity check.
 - **initial / loadtest / build** map to the three pipeline stages in
  `interpreter/process.py` and `interpreter/test_set.py`. The main file is
  rendered+parsed across `initial` *and* `loadtest` (the loader does ~3 passes).
 - **templates** = number of `template_to_test()` calls and their *exclusive*
  wall time (one file render each — pure jinja compile+render+tempfile I/O).
  A high count with the same source file = recompilation, the `repeat` case.
 - **yaml** = number of `yaml_load()` parses. Its time is *inclusive* of nested
  includes, so use the **count** for attribution, not the seconds.
 ## Mapping to the optimisation axes
 | Axis (see DESIGN / discussion) | Watch | Best profile to prove it |
 |--------------------------------|-------|--------------------------|
 | 1 — cache compiled jinja templates | `templates` time drops, count unchanged | `repeat` |
 | 2 — drop the tempfile round-trip | `templates` time drops | `includes`, `repeat`, `mix` |
 | 3 — C YAML loader (libyaml) | `yaml` time / `loadtest` drops | `flat`, `jinja` |
 | 6 — O(n²) sequence splice | `build` drops | `includes`, `mix` |
 ## How to compare before/after a change
 1. Run the matrix on the current code, keep the output.
 2. Apply one axis.
 3. Re-run the **same** profiles/sizes; compare `min` per phase and the counters.
 Change one axis at a time so the attribution is clean. Run on an idle machine
 (and note the disk: on a USB stick the tempfile round-trip of axis 2 weighs
 more).
 ## Caveat: deep includes
 The loader is recursive and spends ~10 stack frames per include level, so
 `deep` hits Python's `RecursionError` around ~90 nested levels. The harness
 reports this cleanly instead of crashing. Real tests are *wide* (many steps /
 many includes), not deep, so `includes`/`repeat`/`jinja`/`mix` are the
 representative "very long" cases.
 ## Notes
 - No execution is triggered — timing stops where `Batch` would mark the test
  *loaded*.
 - The profiles contain no `<| |>`, so the external eval process is not started.
  Pass `--with-eval` to `load_bench.py` for trees that evaluate at load time.
 - Numbers are machine- and disk-specific; only compare runs from the same host.
--- a/test/benchmark/gen_bench_test.py
+++ b/test/benchmark/gen_bench_test.py
@@ -0,0 +1,179 @@
 #!/usr/bin/env python3
 """Generate synthetic ``.tum`` test trees to benchmark *load* time.
 The generated trees are deliberately cheap to *build* (only ``let`` leaves and
 ``group`` containers — no subprocess, no runtime side effect) so the load
 benchmark measures the parse / template / tree-build pipeline and nothing else.
 Profiles, each targeting a specific cost in the loader:
  flat      one main file, N inline ``let`` steps, no include, no jinja.
            Baseline: YAML parse of a big document + linear object build.
  includes  main ``!include``s N *distinct* sub-files (a few steps each).
            Stresses the per-include template+YAML+tempfile round-trip and the
            ``sequence`` splice in test_set.load_test_recursively.
  repeat    main ``!include``s the *same* parametrised leaf file N times.
            Stresses jinja *recompilation*: the compiled template is identical
            every time, only the render params (idx) differ -> the case a
            template cache collapses.
  jinja     one main file whose ``{% for %}`` loop emits N steps.
            Stresses a single large jinja render + a single large YAML parse.
  deep      nested includes, depth N (main -> d0 -> d1 -> ...).
            Stresses include recursion and per-level template+YAML.
  mix       a realistic blend: groups, a jinja loop, distinct includes and a
            repeated parametrised include.
 Usage:
    gen_bench_test.py --profile repeat --size 1000 --out cases/repeat_1000
    -> writes <out>/main.tum (+ includes, + param.yaml) and prints the path.
 """
 import argparse
 import os
 import shutil
 def _let(indent, i, name=None):
    name = name if name is not None else f"s{i}"
    pad = " " * indent
    return (
        f"{pad}- let:\n"
        f"{pad}    name: {name}\n"
        f"{pad}    values:\n"
        f"{pad}        - k{i}: {i}\n"
    )
 def gen_flat(out, n):
    body = "".join(_let(8, i) for i in range(n))
    main = f"main:\n    name: bench flat {n}\n    steps:\n{body}"
    _write(out, "main.tum", main)
 def gen_includes(out, n):
    steps = "".join(f"        - !include inc_{i}.tum\n" for i in range(n))
    main = f"main:\n    name: bench includes {n}\n    steps:\n{steps}"
    _write(out, "main.tum", main)
    for i in range(n):
        # each include is a YAML *sequence* (list of steps)
        seq = "".join(_let(0, i * 3 + j, name=f"inc{i}_{j}") for j in range(3))
        _write(out, f"inc_{i}.tum", seq)
 def gen_repeat(out, n):
    steps = "".join(
        f"        - !include {{file: leaf.tum, idx: {i}}}\n" for i in range(n)
    )
    main = f"main:\n    name: bench repeat {n}\n    steps:\n{steps}"
    _write(out, "main.tum", main)
    leaf = (
        "- let:\n"
        "    name: leaf_{{ idx }}\n"
        "    values:\n"
        "        - leaf_{{ idx }}: {{ idx }}\n"
    )
    _write(out, "leaf.tum", leaf)
 def gen_jinja(out, n):
    main = (
        f"main:\n    name: bench jinja {n}\n    steps:\n"
        "{% for i in range(" + str(n) + ") %}\n"
        "        - let:\n"
        "            name: j{{ i }}\n"
        "            values:\n"
        "                - k{{ i }}: {{ i }}\n"
        "{% endfor %}\n"
    )
    _write(out, "main.tum", main)
 def gen_deep(out, n):
    main = (
        f"main:\n    name: bench deep {n}\n    steps:\n"
        "        - let:\n            name: top\n            values:\n                - a: 0\n"
        "        - !include d_0.tum\n"
    )
    _write(out, "main.tum", main)
    for i in range(n):
        seq = _let(0, i, name=f"d{i}")
        if i < n - 1:
            seq += f"- !include d_{i + 1}.tum\n"
        _write(out, f"d_{i}.tum", seq)
 def gen_mix(out, n):
    # n groups, each: 2 inline lets, one distinct include, one repeated include,
    # plus a small jinja loop. Roughly ~6*n steps.
    per = max(1, n)
    parts = [f"main:\n    name: bench mix {n}\n    steps:\n"]
    for g in range(per):
        parts.append(
            f"        - group:\n"
            f"            name: grp{g}\n"
            f"            steps:\n"
        )
        parts.append(_let(16, g * 2, name=f"g{g}_a"))
        parts.append(_let(16, g * 2 + 1, name=f"g{g}_b"))
        parts.append(f"                - !include inc_{g}.tum\n")
        parts.append(f"                - !include {{file: leaf.tum, idx: {g}}}\n")
        parts.append(
            "{% for i in range(3) %}\n"
            f"                - let:\n"
            f"                    name: g{g}_j{{{{ i }}}}\n"
            f"                    values:\n"
            f"                        - g{g}_k{{{{ i }}}}: {{{{ i }}}}\n"
            "{% endfor %}\n"
        )
    _write(out, "main.tum", "".join(parts))
    for g in range(per):
        _write(out, f"inc_{g}.tum", _let(0, g, name=f"mixinc{g}"))
    _write(
        out,
        "leaf.tum",
        "- let:\n    name: mixleaf_{{ idx }}\n    values:\n        - mixleaf_{{ idx }}: {{ idx }}\n",
    )
 PROFILES = {
    "flat": gen_flat,
    "includes": gen_includes,
    "repeat": gen_repeat,
    "jinja": gen_jinja,
    "deep": gen_deep,
    "mix": gen_mix,
 }
 def _write(out, name, content):
    with open(os.path.join(out, name), "w") as f:
        f.write(content)
 def main():
    ap = argparse.ArgumentParser(description=__doc__,
                                 formatter_class=argparse.RawDescriptionHelpFormatter)
    ap.add_argument("--profile", required=True, choices=sorted(PROFILES))
    ap.add_argument("--size", type=int, default=1000,
                    help="profile-specific count (steps / includes / depth)")
    ap.add_argument("--out", required=True, help="output directory (recreated)")
    args = ap.parse_args()
    out = os.path.abspath(args.out)
    if os.path.isdir(out):
        shutil.rmtree(out)
    os.makedirs(out)
    # minimal config file so the loader does not emit "no param file" noise
    _write(out, "param.yaml", "bench_dummy: 1\n")
    PROFILES[args.profile](out, args.size)
    print(os.path.join(out, "main.tum"))
 if __name__ == "__main__":
    main()
--- a/test/benchmark/load_bench.py
+++ b/test/benchmark/load_bench.py
@@ -0,0 +1,200 @@
 #!/usr/bin/env python3
 """Time the testium *load* pipeline on a given ``.tum`` tree.
 It drives the real loader code (``TestProcess._load_initial_params`` /
 ``_load_test`` then ``TestSet(...)``) in-process, so the numbers track the
 production path and stay honest as the code evolves. Execution is never
 triggered — we stop exactly where ``Batch`` would report the test as *loaded*.
 Reported per run, over ``--repeat`` iterations (min is the headline, least
 noisy):
    initial   first pass: discover config files (template+YAML, no includes)
    loadtest  config-file fixpoint loop + full recursive include/template/YAML
    build     TestSet construction: the load_test_recursively tree build
    total     sum of the three
 Plus instrumentation counters (exact call counts, wall time) for the two
 hot leaves the optimisation axes target:
    templates  jinja template_to_test() calls   (axis 1 compile cache, axis 2 tempfile)
    yaml       yaml_load() parses               (axis 3 C loader)
 template time is exclusive (one file render); yaml time is wall-inclusive of
 nested includes, so lean on the *counts* for attribution.
 Must run inside the project venv (jinja2, pyyaml, telnetlib3, ...). The
 benchmark profiles contain no ``<| |>`` so the external eval process is not
 needed; pass --with-eval to start it for faithfulness on eval-heavy trees.
 Usage (see run.sh for the convenience wrapper):
    test/tmp/.venv/bin/python3 test/benchmark/load_bench.py [--repeat 5] <main.tum>
 """
 import argparse
 import os
 import statistics
 import sys
 from queue import Queue
 from time import perf_counter
 # --- bootstrap: src/testium for flat imports, src for `import testium` --------
 HERE = os.path.dirname(os.path.abspath(__file__))
 ROOT = os.path.abspath(os.path.join(HERE, "..", ".."))
 sys.path.insert(0, os.path.join(ROOT, "src"))
 sys.path.insert(0, os.path.join(ROOT, "src", "testium"))
 import api.testium as tm  # noqa: E402
 from interpreter.utils.test_init import env_init, apply_overrides  # noqa: E402
 from interpreter.utils.test_ctrl import TestSetController  # noqa: E402
 from interpreter.process import TestProcess  # noqa: E402
 from interpreter.test_set import TestSet  # noqa: E402
 from interpreter.utils.py_eval import eval_process_init  # noqa: E402
 from interpreter.utils.api_srv import api_request  # noqa: E402
 # --- instrumentation: count + time the two hot leaves -------------------------
 import interpreter.process as _proc  # noqa: E402
 import interpreter.utils.include as _inc  # noqa: E402
 import interpreter.utils.test_init as _ti  # noqa: E402
 import interpreter.utils.template as _tpl  # noqa: E402
 import interpreter.utils.yaml_load as _yl  # noqa: E402
 _C = {"tpl_n": 0, "tpl_t": 0.0, "yaml_n": 0, "yaml_t": 0.0}
 _orig_tpl = _tpl.template_to_test
 _orig_yaml = _yl.yaml_load
 def _wrap_tpl(*a, **k):
    t = perf_counter()
    try:
        return _orig_tpl(*a, **k)
    finally:
        _C["tpl_t"] += perf_counter() - t
        _C["tpl_n"] += 1
 def _wrap_yaml(*a, **k):
    t = perf_counter()
    try:
        return _orig_yaml(*a, **k)
    finally:
        _C["yaml_t"] += perf_counter() - t
        _C["yaml_n"] += 1
 # rebind in every module that did `from ... import template_to_test / yaml_load`
 for _m in (_proc, _inc):
    _m.template_to_test = _wrap_tpl
 for _m in (_proc, _inc, _ti):
    _m.yaml_load = _wrap_yaml
 def _reset_counters():
    _C.update(tpl_n=0, tpl_t=0.0, yaml_n=0, yaml_t=0.0)
 def load_once(tp, fname, test_dir):
    """One full load (no execution). Returns (initial, loadtest, build) seconds."""
    t0 = perf_counter()
    init_pf, gv = tp._load_initial_params(test_dir)
    t1 = perf_counter()
    test_dict, _pf = tp._load_test(init_pf, gv)
    t2 = perf_counter()
    TestSet(fname, test_dict, Queue())
    t3 = perf_counter()
    return (t1 - t0, t2 - t1, t3 - t2)
 def main():
    ap = argparse.ArgumentParser(description=__doc__,
                                 formatter_class=argparse.RawDescriptionHelpFormatter)
    ap.add_argument("main_tum", help="path to the generated main.tum")
    ap.add_argument("--repeat", type=int, default=5)
    ap.add_argument("--with-eval", action="store_true",
                    help="start the external eval process (needed only for <| |> at load)")
    ap.add_argument("--quiet", action="store_true",
                    help="silence the loader's INFO output during runs")
    args = ap.parse_args()
    fname = os.path.abspath(args.main_tum)
    if not os.path.isfile(fname):
        ap.error(f"not found: {fname}")
    test_dir = os.path.dirname(fname)
    env_init()
    apply_overrides({}, {})
    eval_proc = None
    if args.with_eval:
        eval_proc = eval_process_init(api_request, 10, test_dir)
        eval_proc.start()
        eval_proc.wait_ready(10)
    if args.quiet:
        # the loader prints a couple of INFO lines per config file; mute stdout
        # around the measured section to avoid I/O skew.
        devnull = open(os.devnull, "w")
        real_stdout = sys.stdout
    tp = TestProcess(fname, Queue(), TestSetController(),
                     config_files=[], defines={}, gui_defaults={}, text_mode=True)
    samples = []  # list of (initial, loadtest, build)
    last_counters = None
    try:
        for r in range(args.repeat):
            _reset_counters()
            if args.quiet:
                sys.stdout = devnull
            try:
                samples.append(load_once(tp, fname, test_dir))
            except RecursionError:
                if args.quiet:
                    sys.stdout = real_stdout
                print(f"file      : {fname}")
                print("ERROR     : RecursionError during load — the include "
                      "nesting is too deep for the recursive loader.\n"
                      "            (each include level costs ~10 stack frames; "
                      "raise sys.setrecursionlimit to probe further.)")
                return 2
            except Exception as e:  # noqa: BLE001 - report, don't crash the bench
                if args.quiet:
                    sys.stdout = real_stdout
                print(f"file      : {fname}")
                print(f"ERROR     : load failed: {type(e).__name__}: {e}")
                return 2
            finally:
                if args.quiet:
                    sys.stdout = real_stdout
            last_counters = dict(_C)
    finally:
        if eval_proc is not None:
            eval_proc.stop()
            eval_proc.join()
        if args.quiet:
            devnull.close()
    initial = [s[0] for s in samples]
    loadtest = [s[1] for s in samples]
    build = [s[2] for s in samples]
    total = [sum(s) for s in samples]
    def stat(xs):
        return min(xs), statistics.median(xs)
    print(f"file      : {fname}")
    print(f"repeats   : {args.repeat}   (showing  min | median, seconds)")
    print(f"{'phase':<10}{'min':>12}{'median':>12}")
    for name, xs in (("initial", initial), ("loadtest", loadtest),
                     ("build", build), ("total", total)):
        mn, md = stat(xs)
        print(f"{name:<10}{mn:>12.4f}{md:>12.4f}")
    if last_counters:
        print("counters  (last run):")
        print(f"  templates : {last_counters['tpl_n']:>7d} calls   "
              f"{last_counters['tpl_t']:>8.4f}s  (exclusive: jinja compile+render+tempfile)")
        print(f"  yaml      : {last_counters['yaml_n']:>7d} parses  "
              f"{last_counters['yaml_t']:>8.4f}s  (inclusive of nested includes)")
 if __name__ == "__main__":
    sys.exit(main() or 0)
--- a/test/benchmark/run.sh
+++ b/test/benchmark/run.sh
@@ -0,0 +1,49 @@
 #!/bin/bash
 # Load-time benchmark driver: generate synthetic .tum trees and time the
 # testium load pipeline on them, using the project venv.
 #
 # Usage:
 #   ./test/benchmark/run.sh                     # default matrix (all profiles)
 #   ./test/benchmark/run.sh <profile> <size>    # one profile at one size
 #   REPEAT=10 ./test/benchmark/run.sh repeat 2000
 #
 # Profiles: flat includes repeat jinja deep mix   (see gen_bench_test.py)
 #
 # Generated trees go under test/benchmark/cases/ (git-ignored). The numbers
 # are wall-clock; run on an otherwise idle machine and compare min values.
 set -e
 SCRIPT_DIR="$(realpath "$(dirname "$(readlink -f "$0")")")"
 PROJECT_DIR="$(realpath "$SCRIPT_DIR/../..")"
 VPY="$PROJECT_DIR/test/tmp/.venv/bin/python3"
 CASES="$SCRIPT_DIR/cases"
 REPEAT="${REPEAT:-5}"
 if [ ! -x "$VPY" ]; then
    echo "ERROR: project venv not found at $VPY — run ./run.sh once to create it." >&2
    exit 1
 fi
 bench() {
    local profile="$1" size="$2"
    local out="$CASES/${profile}_${size}"
    local main
    main="$("$VPY" "$SCRIPT_DIR/gen_bench_test.py" --profile "$profile" --size "$size" --out "$out")"
    echo "===== profile=$profile size=$size ====="
    "$VPY" "$SCRIPT_DIR/load_bench.py" --repeat "$REPEAT" --quiet "$main"
    echo
 }
 if [ $# -eq 2 ]; then
    bench "$1" "$2"
    exit 0
 fi
 # Default matrix. 'deep' is kept small: the recursive loader hits Python's
 # recursion limit around ~90 nested include levels.
 bench flat     2000
 bench includes 1000
 bench repeat   1000
 bench jinja    2000
 bench deep     40
 bench mix      300