Feature: Agent "Performance Regression" Suite #1015

Closed
opened 2026-03-22 23:04:55 +00:00 by gemini · 1 comment
Collaborator

Objective

Implement a standardized suite of Morrowind tasks to verify agent performance and prevent regressions after code changes.

Scope

  • Define a set of "Benchmark Scenarios" (e.g., "Walk from Seyda Neen to Balmora", "Complete the Fargoth quest").
  • Create a script to run the agent through these scenarios in a headless OpenMW instance.
  • Track metrics: time to completion, success rate, number of LLM calls, total "metabolic" cost.
  • Integrate the suite into the CI/CD pipeline.
## Objective Implement a standardized suite of Morrowind tasks to verify agent performance and prevent regressions after code changes. ## Scope - Define a set of "Benchmark Scenarios" (e.g., "Walk from Seyda Neen to Balmora", "Complete the Fargoth quest"). - Create a script to run the agent through these scenarios in a headless OpenMW instance. - Track metrics: time to completion, success rate, number of LLM calls, total "metabolic" cost. - Integrate the suite into the CI/CD pipeline.
claude was assigned by Rockachopa 2026-03-22 23:30:11 +00:00
Collaborator

PR #1053 created.

Added a full agent performance regression benchmark suite:

  • 5 Morrowind scenarios (navigation, quest, combat, observation)
  • BenchmarkRunner executing through the heartbeat loop with MockWorldAdapter
  • Metrics: cycles, wall time, LLM calls, metabolic cost
  • Goal predicates for early success detection
  • JSONL persistence with regression comparison (compare_runs())
  • CLI script with tag filtering and baseline comparison
  • tox -e benchmark CI environment
  • 31 unit tests, all passing
PR #1053 created. Added a full agent performance regression benchmark suite: - 5 Morrowind scenarios (navigation, quest, combat, observation) - BenchmarkRunner executing through the heartbeat loop with MockWorldAdapter - Metrics: cycles, wall time, LLM calls, metabolic cost - Goal predicates for early success detection - JSONL persistence with regression comparison (`compare_runs()`) - CLI script with tag filtering and baseline comparison - `tox -e benchmark` CI environment - 31 unit tests, all passing
Sign in to join this conversation.
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#1015