diff --git a/test/benchmark/.gitignore b/test/benchmark/.gitignore new file mode 100644 index 0000000..b325a95 --- /dev/null +++ b/test/benchmark/.gitignore @@ -0,0 +1 @@ +cases/ diff --git a/test/benchmark/README.md b/test/benchmark/README.md new file mode 100644 index 0000000..6db171e --- /dev/null +++ b/test/benchmark/README.md @@ -0,0 +1,116 @@ +# Load-time benchmark + +Measures how long *testium* takes to **load** a `.tum` test tree — template +rendering (jinja) + YAML parsing + test-tree construction — *without* executing +it. Purpose: get reproducible numbers before/after load-path optimisations, and +attribute any gain to a specific part of the pipeline. + +It is meant for *very long* tests, the kind you can build with `jinja` loops and +`!include`, where load time becomes noticeable. + +## Files + +| File | Role | +|------|------| +| `gen_bench_test.py` | Generates a synthetic `.tum` tree (the test input). | +| `load_bench.py` | Drives the **real** loader in-process and times it. | +| `run.sh` | Convenience: generate + time across profiles, using the project venv. | +| `cases/` | Generated trees (git-ignored, recreated on demand). | + +The benchmark `.tum` files are **generated**, not committed — the generator is +the artifact. They use only `let` leaves and `group` containers, so loading has +no runtime side effect (no subprocess, no `<| |>` eval) and the timing reflects +the parse/build pipeline alone. + +## Quick start + +```bash +# default matrix (all profiles), 5 repeats each +./test/benchmark/run.sh + +# one profile at one size +./test/benchmark/run.sh repeat 2000 + +# more repeats for a tighter min +REPEAT=10 ./test/benchmark/run.sh includes 1000 +``` + +`run.sh` uses the project venv at `test/tmp/.venv` (created by `./run.sh`). If it +is missing, run `./run.sh` once first. + +To drive the harness directly on any `.tum` (not just generated ones): + +```bash +test/tmp/.venv/bin/python3 test/benchmark/load_bench.py --repeat 5 --quiet path/to/main.tum +``` + +## Profiles + +Each profile isolates one cost. `--size` is the profile-specific count. + +| Profile | What it builds | Stresses | +|---------|----------------|----------| +| `flat` | one main file, N inline `let` steps | big YAML parse + linear object build | +| `includes` | main `!include`s N **distinct** sub-files | per-include template+YAML+tempfile, `sequence` splice | +| `repeat` | main `!include`s the **same** parametrised leaf N times | jinja **recompilation** of an identical template | +| `jinja` | one main file, `{% for %}` emitting N steps | single large render + single large parse | +| `deep` | nested includes, depth N | include recursion (see caveat) | +| `mix` | groups + jinja loop + distinct + repeated includes | realistic blend | + +## Reading the output + +``` +phase min median +initial 0.1131 0.1285 <- pass 1: discover config files (no includes) +loadtest 1.0724 1.0900 <- config fixpoint loop + full recursive include load +build 0.1850 0.1976 <- TestSet: load_test_recursively tree build +total 1.3886 1.4227 +counters (last run): + templates : 1003 calls 0.5247s (exclusive: jinja compile+render+tempfile) + yaml : 1004 parses 1.4696s (inclusive of nested includes) +``` + +- **min** is the headline (least noisy); median is a sanity check. +- **initial / loadtest / build** map to the three pipeline stages in + `interpreter/process.py` and `interpreter/test_set.py`. The main file is + rendered+parsed across `initial` *and* `loadtest` (the loader does ~3 passes). +- **templates** = number of `template_to_test()` calls and their *exclusive* + wall time (one file render each — pure jinja compile+render+tempfile I/O). + A high count with the same source file = recompilation, the `repeat` case. +- **yaml** = number of `yaml_load()` parses. Its time is *inclusive* of nested + includes, so use the **count** for attribution, not the seconds. + +## Mapping to the optimisation axes + +| Axis (see DESIGN / discussion) | Watch | Best profile to prove it | +|--------------------------------|-------|--------------------------| +| 1 — cache compiled jinja templates | `templates` time drops, count unchanged | `repeat` | +| 2 — drop the tempfile round-trip | `templates` time drops | `includes`, `repeat`, `mix` | +| 3 — C YAML loader (libyaml) | `yaml` time / `loadtest` drops | `flat`, `jinja` | +| 6 — O(n²) sequence splice | `build` drops | `includes`, `mix` | + +## How to compare before/after a change + +1. Run the matrix on the current code, keep the output. +2. Apply one axis. +3. Re-run the **same** profiles/sizes; compare `min` per phase and the counters. + +Change one axis at a time so the attribution is clean. Run on an idle machine +(and note the disk: on a USB stick the tempfile round-trip of axis 2 weighs +more). + +## Caveat: deep includes + +The loader is recursive and spends ~10 stack frames per include level, so +`deep` hits Python's `RecursionError` around ~90 nested levels. The harness +reports this cleanly instead of crashing. Real tests are *wide* (many steps / +many includes), not deep, so `includes`/`repeat`/`jinja`/`mix` are the +representative "very long" cases. + +## Notes + +- No execution is triggered — timing stops where `Batch` would mark the test + *loaded*. +- The profiles contain no `<| |>`, so the external eval process is not started. + Pass `--with-eval` to `load_bench.py` for trees that evaluate at load time. +- Numbers are machine- and disk-specific; only compare runs from the same host. diff --git a/test/benchmark/gen_bench_test.py b/test/benchmark/gen_bench_test.py new file mode 100755 index 0000000..618aa6f --- /dev/null +++ b/test/benchmark/gen_bench_test.py @@ -0,0 +1,179 @@ +#!/usr/bin/env python3 +"""Generate synthetic ``.tum`` test trees to benchmark *load* time. + +The generated trees are deliberately cheap to *build* (only ``let`` leaves and +``group`` containers — no subprocess, no runtime side effect) so the load +benchmark measures the parse / template / tree-build pipeline and nothing else. + +Profiles, each targeting a specific cost in the loader: + + flat one main file, N inline ``let`` steps, no include, no jinja. + Baseline: YAML parse of a big document + linear object build. + + includes main ``!include``s N *distinct* sub-files (a few steps each). + Stresses the per-include template+YAML+tempfile round-trip and the + ``sequence`` splice in test_set.load_test_recursively. + + repeat main ``!include``s the *same* parametrised leaf file N times. + Stresses jinja *recompilation*: the compiled template is identical + every time, only the render params (idx) differ -> the case a + template cache collapses. + + jinja one main file whose ``{% for %}`` loop emits N steps. + Stresses a single large jinja render + a single large YAML parse. + + deep nested includes, depth N (main -> d0 -> d1 -> ...). + Stresses include recursion and per-level template+YAML. + + mix a realistic blend: groups, a jinja loop, distinct includes and a + repeated parametrised include. + +Usage: + gen_bench_test.py --profile repeat --size 1000 --out cases/repeat_1000 + -> writes /main.tum (+ includes, + param.yaml) and prints the path. +""" +import argparse +import os +import shutil + + +def _let(indent, i, name=None): + name = name if name is not None else f"s{i}" + pad = " " * indent + return ( + f"{pad}- let:\n" + f"{pad} name: {name}\n" + f"{pad} values:\n" + f"{pad} - k{i}: {i}\n" + ) + + +def gen_flat(out, n): + body = "".join(_let(8, i) for i in range(n)) + main = f"main:\n name: bench flat {n}\n steps:\n{body}" + _write(out, "main.tum", main) + + +def gen_includes(out, n): + steps = "".join(f" - !include inc_{i}.tum\n" for i in range(n)) + main = f"main:\n name: bench includes {n}\n steps:\n{steps}" + _write(out, "main.tum", main) + for i in range(n): + # each include is a YAML *sequence* (list of steps) + seq = "".join(_let(0, i * 3 + j, name=f"inc{i}_{j}") for j in range(3)) + _write(out, f"inc_{i}.tum", seq) + + +def gen_repeat(out, n): + steps = "".join( + f" - !include {{file: leaf.tum, idx: {i}}}\n" for i in range(n) + ) + main = f"main:\n name: bench repeat {n}\n steps:\n{steps}" + _write(out, "main.tum", main) + leaf = ( + "- let:\n" + " name: leaf_{{ idx }}\n" + " values:\n" + " - leaf_{{ idx }}: {{ idx }}\n" + ) + _write(out, "leaf.tum", leaf) + + +def gen_jinja(out, n): + main = ( + f"main:\n name: bench jinja {n}\n steps:\n" + "{% for i in range(" + str(n) + ") %}\n" + " - let:\n" + " name: j{{ i }}\n" + " values:\n" + " - k{{ i }}: {{ i }}\n" + "{% endfor %}\n" + ) + _write(out, "main.tum", main) + + +def gen_deep(out, n): + main = ( + f"main:\n name: bench deep {n}\n steps:\n" + " - let:\n name: top\n values:\n - a: 0\n" + " - !include d_0.tum\n" + ) + _write(out, "main.tum", main) + for i in range(n): + seq = _let(0, i, name=f"d{i}") + if i < n - 1: + seq += f"- !include d_{i + 1}.tum\n" + _write(out, f"d_{i}.tum", seq) + + +def gen_mix(out, n): + # n groups, each: 2 inline lets, one distinct include, one repeated include, + # plus a small jinja loop. Roughly ~6*n steps. + per = max(1, n) + parts = [f"main:\n name: bench mix {n}\n steps:\n"] + for g in range(per): + parts.append( + f" - group:\n" + f" name: grp{g}\n" + f" steps:\n" + ) + parts.append(_let(16, g * 2, name=f"g{g}_a")) + parts.append(_let(16, g * 2 + 1, name=f"g{g}_b")) + parts.append(f" - !include inc_{g}.tum\n") + parts.append(f" - !include {{file: leaf.tum, idx: {g}}}\n") + parts.append( + "{% for i in range(3) %}\n" + f" - let:\n" + f" name: g{g}_j{{{{ i }}}}\n" + f" values:\n" + f" - g{g}_k{{{{ i }}}}: {{{{ i }}}}\n" + "{% endfor %}\n" + ) + _write(out, "main.tum", "".join(parts)) + for g in range(per): + _write(out, f"inc_{g}.tum", _let(0, g, name=f"mixinc{g}")) + _write( + out, + "leaf.tum", + "- let:\n name: mixleaf_{{ idx }}\n values:\n - mixleaf_{{ idx }}: {{ idx }}\n", + ) + + +PROFILES = { + "flat": gen_flat, + "includes": gen_includes, + "repeat": gen_repeat, + "jinja": gen_jinja, + "deep": gen_deep, + "mix": gen_mix, +} + + +def _write(out, name, content): + with open(os.path.join(out, name), "w") as f: + f.write(content) + + +def main(): + ap = argparse.ArgumentParser(description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter) + ap.add_argument("--profile", required=True, choices=sorted(PROFILES)) + ap.add_argument("--size", type=int, default=1000, + help="profile-specific count (steps / includes / depth)") + ap.add_argument("--out", required=True, help="output directory (recreated)") + args = ap.parse_args() + + out = os.path.abspath(args.out) + if os.path.isdir(out): + shutil.rmtree(out) + os.makedirs(out) + + # minimal config file so the loader does not emit "no param file" noise + _write(out, "param.yaml", "bench_dummy: 1\n") + + PROFILES[args.profile](out, args.size) + print(os.path.join(out, "main.tum")) + + +if __name__ == "__main__": + main() diff --git a/test/benchmark/load_bench.py b/test/benchmark/load_bench.py new file mode 100755 index 0000000..9fed9ee --- /dev/null +++ b/test/benchmark/load_bench.py @@ -0,0 +1,200 @@ +#!/usr/bin/env python3 +"""Time the testium *load* pipeline on a given ``.tum`` tree. + +It drives the real loader code (``TestProcess._load_initial_params`` / +``_load_test`` then ``TestSet(...)``) in-process, so the numbers track the +production path and stay honest as the code evolves. Execution is never +triggered — we stop exactly where ``Batch`` would report the test as *loaded*. + +Reported per run, over ``--repeat`` iterations (min is the headline, least +noisy): + + initial first pass: discover config files (template+YAML, no includes) + loadtest config-file fixpoint loop + full recursive include/template/YAML + build TestSet construction: the load_test_recursively tree build + total sum of the three + +Plus instrumentation counters (exact call counts, wall time) for the two +hot leaves the optimisation axes target: + + templates jinja template_to_test() calls (axis 1 compile cache, axis 2 tempfile) + yaml yaml_load() parses (axis 3 C loader) + +template time is exclusive (one file render); yaml time is wall-inclusive of +nested includes, so lean on the *counts* for attribution. + +Must run inside the project venv (jinja2, pyyaml, telnetlib3, ...). The +benchmark profiles contain no ``<| |>`` so the external eval process is not +needed; pass --with-eval to start it for faithfulness on eval-heavy trees. + +Usage (see run.sh for the convenience wrapper): + test/tmp/.venv/bin/python3 test/benchmark/load_bench.py [--repeat 5] +""" +import argparse +import os +import statistics +import sys +from queue import Queue +from time import perf_counter + +# --- bootstrap: src/testium for flat imports, src for `import testium` -------- +HERE = os.path.dirname(os.path.abspath(__file__)) +ROOT = os.path.abspath(os.path.join(HERE, "..", "..")) +sys.path.insert(0, os.path.join(ROOT, "src")) +sys.path.insert(0, os.path.join(ROOT, "src", "testium")) + +import api.testium as tm # noqa: E402 +from interpreter.utils.test_init import env_init, apply_overrides # noqa: E402 +from interpreter.utils.test_ctrl import TestSetController # noqa: E402 +from interpreter.process import TestProcess # noqa: E402 +from interpreter.test_set import TestSet # noqa: E402 +from interpreter.utils.py_eval import eval_process_init # noqa: E402 +from interpreter.utils.api_srv import api_request # noqa: E402 + +# --- instrumentation: count + time the two hot leaves ------------------------- +import interpreter.process as _proc # noqa: E402 +import interpreter.utils.include as _inc # noqa: E402 +import interpreter.utils.test_init as _ti # noqa: E402 +import interpreter.utils.template as _tpl # noqa: E402 +import interpreter.utils.yaml_load as _yl # noqa: E402 + +_C = {"tpl_n": 0, "tpl_t": 0.0, "yaml_n": 0, "yaml_t": 0.0} +_orig_tpl = _tpl.template_to_test +_orig_yaml = _yl.yaml_load + + +def _wrap_tpl(*a, **k): + t = perf_counter() + try: + return _orig_tpl(*a, **k) + finally: + _C["tpl_t"] += perf_counter() - t + _C["tpl_n"] += 1 + + +def _wrap_yaml(*a, **k): + t = perf_counter() + try: + return _orig_yaml(*a, **k) + finally: + _C["yaml_t"] += perf_counter() - t + _C["yaml_n"] += 1 + + +# rebind in every module that did `from ... import template_to_test / yaml_load` +for _m in (_proc, _inc): + _m.template_to_test = _wrap_tpl +for _m in (_proc, _inc, _ti): + _m.yaml_load = _wrap_yaml + + +def _reset_counters(): + _C.update(tpl_n=0, tpl_t=0.0, yaml_n=0, yaml_t=0.0) + + +def load_once(tp, fname, test_dir): + """One full load (no execution). Returns (initial, loadtest, build) seconds.""" + t0 = perf_counter() + init_pf, gv = tp._load_initial_params(test_dir) + t1 = perf_counter() + test_dict, _pf = tp._load_test(init_pf, gv) + t2 = perf_counter() + TestSet(fname, test_dict, Queue()) + t3 = perf_counter() + return (t1 - t0, t2 - t1, t3 - t2) + + +def main(): + ap = argparse.ArgumentParser(description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter) + ap.add_argument("main_tum", help="path to the generated main.tum") + ap.add_argument("--repeat", type=int, default=5) + ap.add_argument("--with-eval", action="store_true", + help="start the external eval process (needed only for <| |> at load)") + ap.add_argument("--quiet", action="store_true", + help="silence the loader's INFO output during runs") + args = ap.parse_args() + + fname = os.path.abspath(args.main_tum) + if not os.path.isfile(fname): + ap.error(f"not found: {fname}") + test_dir = os.path.dirname(fname) + + env_init() + apply_overrides({}, {}) + + eval_proc = None + if args.with_eval: + eval_proc = eval_process_init(api_request, 10, test_dir) + eval_proc.start() + eval_proc.wait_ready(10) + + if args.quiet: + # the loader prints a couple of INFO lines per config file; mute stdout + # around the measured section to avoid I/O skew. + devnull = open(os.devnull, "w") + real_stdout = sys.stdout + + tp = TestProcess(fname, Queue(), TestSetController(), + config_files=[], defines={}, gui_defaults={}, text_mode=True) + + samples = [] # list of (initial, loadtest, build) + last_counters = None + try: + for r in range(args.repeat): + _reset_counters() + if args.quiet: + sys.stdout = devnull + try: + samples.append(load_once(tp, fname, test_dir)) + except RecursionError: + if args.quiet: + sys.stdout = real_stdout + print(f"file : {fname}") + print("ERROR : RecursionError during load — the include " + "nesting is too deep for the recursive loader.\n" + " (each include level costs ~10 stack frames; " + "raise sys.setrecursionlimit to probe further.)") + return 2 + except Exception as e: # noqa: BLE001 - report, don't crash the bench + if args.quiet: + sys.stdout = real_stdout + print(f"file : {fname}") + print(f"ERROR : load failed: {type(e).__name__}: {e}") + return 2 + finally: + if args.quiet: + sys.stdout = real_stdout + last_counters = dict(_C) + finally: + if eval_proc is not None: + eval_proc.stop() + eval_proc.join() + if args.quiet: + devnull.close() + + initial = [s[0] for s in samples] + loadtest = [s[1] for s in samples] + build = [s[2] for s in samples] + total = [sum(s) for s in samples] + + def stat(xs): + return min(xs), statistics.median(xs) + + print(f"file : {fname}") + print(f"repeats : {args.repeat} (showing min | median, seconds)") + print(f"{'phase':<10}{'min':>12}{'median':>12}") + for name, xs in (("initial", initial), ("loadtest", loadtest), + ("build", build), ("total", total)): + mn, md = stat(xs) + print(f"{name:<10}{mn:>12.4f}{md:>12.4f}") + if last_counters: + print("counters (last run):") + print(f" templates : {last_counters['tpl_n']:>7d} calls " + f"{last_counters['tpl_t']:>8.4f}s (exclusive: jinja compile+render+tempfile)") + print(f" yaml : {last_counters['yaml_n']:>7d} parses " + f"{last_counters['yaml_t']:>8.4f}s (inclusive of nested includes)") + + +if __name__ == "__main__": + sys.exit(main() or 0) diff --git a/test/benchmark/run.sh b/test/benchmark/run.sh new file mode 100755 index 0000000..222fedb --- /dev/null +++ b/test/benchmark/run.sh @@ -0,0 +1,49 @@ +#!/bin/bash +# Load-time benchmark driver: generate synthetic .tum trees and time the +# testium load pipeline on them, using the project venv. +# +# Usage: +# ./test/benchmark/run.sh # default matrix (all profiles) +# ./test/benchmark/run.sh # one profile at one size +# REPEAT=10 ./test/benchmark/run.sh repeat 2000 +# +# Profiles: flat includes repeat jinja deep mix (see gen_bench_test.py) +# +# Generated trees go under test/benchmark/cases/ (git-ignored). The numbers +# are wall-clock; run on an otherwise idle machine and compare min values. +set -e + +SCRIPT_DIR="$(realpath "$(dirname "$(readlink -f "$0")")")" +PROJECT_DIR="$(realpath "$SCRIPT_DIR/../..")" +VPY="$PROJECT_DIR/test/tmp/.venv/bin/python3" +CASES="$SCRIPT_DIR/cases" +REPEAT="${REPEAT:-5}" + +if [ ! -x "$VPY" ]; then + echo "ERROR: project venv not found at $VPY — run ./run.sh once to create it." >&2 + exit 1 +fi + +bench() { + local profile="$1" size="$2" + local out="$CASES/${profile}_${size}" + local main + main="$("$VPY" "$SCRIPT_DIR/gen_bench_test.py" --profile "$profile" --size "$size" --out "$out")" + echo "===== profile=$profile size=$size =====" + "$VPY" "$SCRIPT_DIR/load_bench.py" --repeat "$REPEAT" --quiet "$main" + echo +} + +if [ $# -eq 2 ]; then + bench "$1" "$2" + exit 0 +fi + +# Default matrix. 'deep' is kept small: the recursive loader hits Python's +# recursion limit around ~90 nested include levels. +bench flat 2000 +bench includes 1000 +bench repeat 1000 +bench jinja 2000 +bench deep 40 +bench mix 300