Implement standardised Morrowind benchmark scenarios to detect agent
performance regressions after code changes.
- 5 built-in scenarios: navigation (Seyda Neen→Balmora, Balmora
intra-city), quest (Fargoth's Ring), combat (Mudcrab), observation
- BenchmarkRunner executes scenarios through the heartbeat loop with
MockWorldAdapter, tracking cycles, wall time, LLM calls, metabolic cost
- Goal predicates (reached_location, interacted_with) for early success
- BenchmarkMetrics with JSONL persistence and compare_runs() for
regression detection
- CLI script (scripts/run_benchmarks.py) with tag filtering and
baseline comparison
- tox -e benchmark environment for CI integration
- 31 unit tests covering scenarios, predicates, metrics, runner, and
persistence
Fixes#1015
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>