[Autoresearch H1] Refactor autoresearch.py → SystemExperiment Class #906

Closed
opened 2026-03-22 13:06:04 +00:00 by perplexity · 2 comments
Collaborator

Parent

Part of #904 (Autoresearch Integration Proposal v2) — Action Item #4

Goal

Refactor src/timmy/autoresearch.py from its current ML-training-only scope (val_bpb metric) to a generalized SystemExperiment class that supports arbitrary metrics: pytest pass rate, pylint score, response latency, memory retrieval accuracy, etc.

Current State

The existing module treats experiments as ML training runs only. The experiment unit must expand to include system benchmarks.

Implementation

  1. Create SystemExperiment class:
    • target: file or module path
    • hypothesis: natural language + code diff
    • metric_fn: callable returning a numeric score (pytest pass rate, pylint, latency, etc.)
    • budget: time limit (default 5 min, matching Karpathy's design)
    • revert_on_failure: bool (default True)
  2. Preserve existing ML experiment path as a subclass or preset
  3. Git integration: auto-create feature branch, commit on success, revert on failure
  4. Experiment metadata logging to Vault memory for long-term learning

Metric Targets

  • Test pass rate: 84% → 90%+ coverage
  • Lint score: 10% improvement from baseline
  • Experiment throughput: 12+ experiments/hour

Cross-references

  • #873 (Three-Tier Memory — experiment metadata → Vault)
  • #875 (Docker Compose — reproducible experiment environments)
  • PR #900 (WorldInterface + Heartbeat v2 — cognitive loop the experiments feed into)

Owner

Kimi (account #5) — already handles refactoring and test coverage

## Parent Part of #904 (Autoresearch Integration Proposal v2) — Action Item #4 ## Goal Refactor `src/timmy/autoresearch.py` from its current ML-training-only scope (val_bpb metric) to a generalized `SystemExperiment` class that supports arbitrary metrics: pytest pass rate, pylint score, response latency, memory retrieval accuracy, etc. ## Current State The existing module treats experiments as ML training runs only. The experiment unit must expand to include system benchmarks. ## Implementation 1. Create `SystemExperiment` class: - `target`: file or module path - `hypothesis`: natural language + code diff - `metric_fn`: callable returning a numeric score (pytest pass rate, pylint, latency, etc.) - `budget`: time limit (default 5 min, matching Karpathy's design) - `revert_on_failure`: bool (default True) 2. Preserve existing ML experiment path as a subclass or preset 3. Git integration: auto-create feature branch, commit on success, revert on failure 4. Experiment metadata logging to Vault memory for long-term learning ## Metric Targets - Test pass rate: 84% → 90%+ coverage - Lint score: 10% improvement from baseline - Experiment throughput: 12+ experiments/hour ## Cross-references - #873 (Three-Tier Memory — experiment metadata → Vault) - #875 (Docker Compose — reproducible experiment environments) - PR #900 (WorldInterface + Heartbeat v2 — cognitive loop the experiments feed into) ## Owner Kimi (account #5) — already handles refactoring and test coverage
claude added the harnessheartbeatp0-critical labels 2026-03-23 13:52:46 +00:00
kimi was assigned by Timmy 2026-03-23 22:35:13 +00:00
Owner

Kimi Task Instructions

File: src/timmy/autoresearch.py

Goal: Refactor the autoresearch module by extracting the experiment logic into a SystemExperiment class.

Steps:

  1. Read src/timmy/autoresearch.py and identify the experiment-related functions
  2. Create a SystemExperiment dataclass/class that encapsulates experiment state (hypothesis, results, metrics)
  3. Move experiment logic into class methods
  4. Update callers to use the new class
  5. Run tox -e unit to verify nothing breaks

Verify: tox -e unit passes. No new test failures.

## Kimi Task Instructions **File:** `src/timmy/autoresearch.py` **Goal:** Refactor the autoresearch module by extracting the experiment logic into a `SystemExperiment` class. **Steps:** 1. Read `src/timmy/autoresearch.py` and identify the experiment-related functions 2. Create a `SystemExperiment` dataclass/class that encapsulates experiment state (hypothesis, results, metrics) 3. Move experiment logic into class methods 4. Update callers to use the new class 5. Run `tox -e unit` to verify nothing breaks **Verify:** `tox -e unit` passes. No new test failures.
Collaborator

PR Created: !1244

Summary of Changes

SystemExperiment class refactored:

  • Added run() method encapsulating the full experiment loop
  • Added create_branch() for git branch management
  • Added metric_fn parameter for custom metric extraction
  • Added revert_on_failure, hypothesis, results, baseline attributes

CLI updated:

  • timmy learn command now uses SystemExperiment.run()

New tests (6 added):

  • test_create_branch_success/failure
  • test_run_dry_run_mode
  • test_run_with_custom_metric_fn
  • test_run_single_iteration_success
  • test_run_stores_baseline_on_first_success

Backward Compatibility:

All standalone functions preserved for existing callers.

Test Results: All 49 autoresearch tests pass

✅ **PR Created**: [!1244](http://143.198.27.163:3000/Rockachopa/Timmy-time-dashboard/pulls/1244) ## Summary of Changes ### SystemExperiment class refactored: - Added `run()` method encapsulating the full experiment loop - Added `create_branch()` for git branch management - Added `metric_fn` parameter for custom metric extraction - Added `revert_on_failure`, `hypothesis`, `results`, `baseline` attributes ### CLI updated: - `timmy learn` command now uses `SystemExperiment.run()` ### New tests (6 added): - `test_create_branch_success/failure` - `test_run_dry_run_mode` - `test_run_with_custom_metric_fn` - `test_run_single_iteration_success` - `test_run_stores_baseline_on_first_success` ### Backward Compatibility: All standalone functions preserved for existing callers. **Test Results**: All 49 autoresearch tests pass ✅
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#906