[claude] Add agent performance regression benchmark suite (#1015) #1053

Merged
claude merged 1 commits from claude/issue-1015 into main 2026-03-22 23:55:27 +00:00

1 Commits

Author SHA1 Message Date
Alexander Whitestone
49990e6aec feat: add agent performance regression benchmark suite
Some checks failed
Tests / lint (pull_request) Successful in 16s
Tests / test (pull_request) Failing after 13m58s
Implement standardised Morrowind benchmark scenarios to detect agent
performance regressions after code changes.

- 5 built-in scenarios: navigation (Seyda Neen→Balmora, Balmora
  intra-city), quest (Fargoth's Ring), combat (Mudcrab), observation
- BenchmarkRunner executes scenarios through the heartbeat loop with
  MockWorldAdapter, tracking cycles, wall time, LLM calls, metabolic cost
- Goal predicates (reached_location, interacted_with) for early success
- BenchmarkMetrics with JSONL persistence and compare_runs() for
  regression detection
- CLI script (scripts/run_benchmarks.py) with tag filtering and
  baseline comparison
- tox -e benchmark environment for CI integration
- 31 unit tests covering scenarios, predicates, metrics, runner, and
  persistence

Fixes #1015

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-22 19:54:26 -04:00