[claude] Add agent performance regression benchmark suite (#1015) #1053

claude · 2026-03-22T23:54:51Z

claude commented

2026-03-22 23:54:51 +00:00

Fixes #1015

Summary

5 built-in Morrowind benchmark scenarios: Seyda Neen→Balmora navigation, Fargoth quest, Balmora intra-city navigation, Mudcrab combat, Balmora market observation
BenchmarkRunner executes scenarios through the heartbeat loop with MockWorldAdapter, collecting metrics per cycle
Metrics tracked: cycles used, wall-clock time, LLM call count, metabolic cost (≈3 units/cycle for gather+reason+act phases)
Goal predicates (reached_location, interacted_with) enable early success detection
BenchmarkMetrics with JSONL persistence and compare_runs() for regression detection (catches REGRESSION, IMPROVEMENT, SLOWER)
CLI runner (scripts/run_benchmarks.py) with --tags filtering and --compare baseline analysis
tox -e benchmark environment for CI integration
31 unit tests covering scenarios, predicates, metrics, runner, and persistence

Test plan

All 31 benchmark tests pass (pytest tests/infrastructure/world/test_benchmark.py)
Lint clean (tox -e lint)
No regressions in existing test suite (pre-existing failures only)
Run tox -e benchmark to verify CLI execution
Run tox -e benchmark -- --tags navigation to verify tag filtering

Fixes #1015 ## Summary - **5 built-in Morrowind benchmark scenarios**: Seyda Neen→Balmora navigation, Fargoth quest, Balmora intra-city navigation, Mudcrab combat, Balmora market observation - **BenchmarkRunner** executes scenarios through the heartbeat loop with MockWorldAdapter, collecting metrics per cycle - **Metrics tracked**: cycles used, wall-clock time, LLM call count, metabolic cost (≈3 units/cycle for gather+reason+act phases) - **Goal predicates** (`reached_location`, `interacted_with`) enable early success detection - **BenchmarkMetrics** with JSONL persistence and `compare_runs()` for regression detection (catches REGRESSION, IMPROVEMENT, SLOWER) - **CLI runner** (`scripts/run_benchmarks.py`) with `--tags` filtering and `--compare` baseline analysis - **`tox -e benchmark`** environment for CI integration - **31 unit tests** covering scenarios, predicates, metrics, runner, and persistence ## Test plan - [x] All 31 benchmark tests pass (`pytest tests/infrastructure/world/test_benchmark.py`) - [x] Lint clean (`tox -e lint`) - [x] No regressions in existing test suite (pre-existing failures only) - [ ] Run `tox -e benchmark` to verify CLI execution - [ ] Run `tox -e benchmark -- --tags navigation` to verify tag filtering

claude added 1 commit 2026-03-22 23:54:52 +00:00

feat: add agent performance regression benchmark suite

Tests / lint (pull_request) Successful in 16s

Details

Tests / test (pull_request) Failing after 13m58s

Details

49990e6aec

Implement standardised Morrowind benchmark scenarios to detect agent
performance regressions after code changes.

- 5 built-in scenarios: navigation (Seyda Neen→Balmora, Balmora
  intra-city), quest (Fargoth's Ring), combat (Mudcrab), observation
- BenchmarkRunner executes scenarios through the heartbeat loop with
  MockWorldAdapter, tracking cycles, wall time, LLM calls, metabolic cost
- Goal predicates (reached_location, interacted_with) for early success
- BenchmarkMetrics with JSONL persistence and compare_runs() for
  regression detection
- CLI script (scripts/run_benchmarks.py) with tag filtering and
  baseline comparison
- tox -e benchmark environment for CI integration
- 31 unit tests covering scenarios, predicates, metrics, runner, and
  persistence

Fixes #1015

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude referenced this pull request

2026-03-22 23:55:06 +00:00

Feature: Agent "Performance Regression" Suite #1015

claude merged commit 45bde4df58 into main

2026-03-22 23:55:27 +00:00

claude deleted branch claude/issue-1015

2026-03-22 23:55:28 +00:00

claude referenced this issue from a commit

2026-03-22 23:55:29 +00:00

[claude] Add agent performance regression benchmark suite (#1015) (#1053)

Sign in to join this conversation.