[Autoresearch H2] Define Timmy-Task Benchmark Suite (5-10 Initial Tasks) #909

Closed
opened 2026-03-22 13:06:05 +00:00 by perplexity · 0 comments
Collaborator

Parent

Part of #904 (Autoresearch Integration Proposal v2) — Action Item #8

Goal

Create a standardized benchmark suite of 5-10 real Timmy tasks with scoring functions. This gives the meta-autoresearch loop (H2) something beyond training loss to optimize against.

Proposed Task Categories

Category Example Task Scoring Function
Code generation Generate a FastAPI endpoint from spec Passes pytest + matches OpenAPI schema
Memory retrieval Recall a fact from 50 conversations ago Exact match or semantic similarity > 0.9
Tool use Search Drive, fetch doc, summarize Task completion + latency under threshold
Multi-step reasoning Debug a failing test across 3 files Correct root cause identification
Creative writing Draft comic dialogue in Timmy voice Style consistency score (LLM-as-judge)
Morrowind reasoning Plan optimal quest route in Balmora Valid path + NPC/faction awareness

Implementation

  1. Create tests/benchmarks/ directory
  2. Each benchmark is a YAML/JSON config specifying:
    • Task description (natural language prompt)
    • Input fixtures (files, context, prior conversations)
    • Expected output or scoring function
    • Time budget
  3. Runner: python -m timmy.cli benchmark executes all tasks, reports scores
  4. Results logged to Vault memory for tracking over time

Key Insight (from Karpathy)

"Any metric you care about that is efficient to evaluate can be autoresearched."

The benchmark suite makes Timmy's capabilities measurable, and therefore optimizable.

Cross-references

  • #873 (Three-Tier Memory — retrieval accuracy becomes a metric)
  • #882 (Model Tiering Router — cascade config becomes an experiment parameter)
  • #874 (Highlight Detection — detection quality becomes a benchmark task)
  • #883 (UESP Knowledge Base — RAG retrieval accuracy benchmarked)

Owner

Engineering + rockachopa (task selection)

## Parent Part of #904 (Autoresearch Integration Proposal v2) — Action Item #8 ## Goal Create a standardized benchmark suite of 5-10 real Timmy tasks with scoring functions. This gives the meta-autoresearch loop (H2) something beyond training loss to optimize against. ## Proposed Task Categories | Category | Example Task | Scoring Function | |---|---|---| | Code generation | Generate a FastAPI endpoint from spec | Passes pytest + matches OpenAPI schema | | Memory retrieval | Recall a fact from 50 conversations ago | Exact match or semantic similarity > 0.9 | | Tool use | Search Drive, fetch doc, summarize | Task completion + latency under threshold | | Multi-step reasoning | Debug a failing test across 3 files | Correct root cause identification | | Creative writing | Draft comic dialogue in Timmy voice | Style consistency score (LLM-as-judge) | | Morrowind reasoning | Plan optimal quest route in Balmora | Valid path + NPC/faction awareness | ## Implementation 1. Create `tests/benchmarks/` directory 2. Each benchmark is a YAML/JSON config specifying: - Task description (natural language prompt) - Input fixtures (files, context, prior conversations) - Expected output or scoring function - Time budget 3. Runner: `python -m timmy.cli benchmark` executes all tasks, reports scores 4. Results logged to Vault memory for tracking over time ## Key Insight (from Karpathy) > "Any metric you care about that is efficient to evaluate can be autoresearched." The benchmark suite makes Timmy's capabilities measurable, and therefore optimizable. ## Cross-references - #873 (Three-Tier Memory — retrieval accuracy becomes a metric) - #882 (Model Tiering Router — cascade config becomes an experiment parameter) - #874 (Highlight Detection — detection quality becomes a benchmark task) - #883 (UESP Knowledge Base — RAG retrieval accuracy benchmarked) ## Owner Engineering + rockachopa (task selection)
claude added the harnessmorrowindp1-important labels 2026-03-23 13:53:52 +00:00
Timmy self-assigned this 2026-03-23 15:26:37 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#909