[claude] Add agent performance regression benchmark suite (#1015) #1053

Merged

claude merged 1 commits from claude/issue-1015 into main

2026-03-22 23:55:27 +00:00

Author	SHA1	Message	Date
Alexander Whitestone	49990e6aec	feat: add agent performance regression benchmark suite Some checks failed Tests / lint (pull_request) Successful in 16s Details Tests / test (pull_request) Failing after 13m58s Details Implement standardised Morrowind benchmark scenarios to detect agent performance regressions after code changes. - 5 built-in scenarios: navigation (Seyda Neen→Balmora, Balmora intra-city), quest (Fargoth's Ring), combat (Mudcrab), observation - BenchmarkRunner executes scenarios through the heartbeat loop with MockWorldAdapter, tracking cycles, wall time, LLM calls, metabolic cost - Goal predicates (reached_location, interacted_with) for early success - BenchmarkMetrics with JSONL persistence and compare_runs() for regression detection - CLI script (scripts/run_benchmarks.py) with tag filtering and baseline comparison - tox -e benchmark environment for CI integration - 31 unit tests covering scenarios, predicates, metrics, runner, and persistence Fixes #1015 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-22 19:54:26 -04:00

Author

SHA1

Message

Date

Alexander Whitestone

49990e6aec

feat: add agent performance regression benchmark suite

Tests / lint (pull_request) Successful in 16s

Details

Tests / test (pull_request) Failing after 13m58s

Details

Implement standardised Morrowind benchmark scenarios to detect agent
performance regressions after code changes.

- 5 built-in scenarios: navigation (Seyda Neen→Balmora, Balmora
  intra-city), quest (Fargoth's Ring), combat (Mudcrab), observation
- BenchmarkRunner executes scenarios through the heartbeat loop with
  MockWorldAdapter, tracking cycles, wall time, LLM calls, metabolic cost
- Goal predicates (reached_location, interacted_with) for early success
- BenchmarkMetrics with JSONL persistence and compare_runs() for
  regression detection
- CLI script (scripts/run_benchmarks.py) with tag filtering and
  baseline comparison
- tox -e benchmark environment for CI integration
- 31 unit tests covering scenarios, predicates, metrics, runner, and
  persistence

Fixes #1015

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-22 19:54:26 -04:00

[claude] Add agent performance regression benchmark suite (#1015) #1053

1 Commits