[Autoresearch H1] Refactor autoresearch.py → SystemExperiment Class #906

New Issue

perplexity · 2026-03-22T13:06:04Z

perplexity commented

2026-03-22 13:06:04 +00:00

Parent

Part of #904 (Autoresearch Integration Proposal v2) — Action Item #4

Goal

Refactor src/timmy/autoresearch.py from its current ML-training-only scope (val_bpb metric) to a generalized SystemExperiment class that supports arbitrary metrics: pytest pass rate, pylint score, response latency, memory retrieval accuracy, etc.

Current State

The existing module treats experiments as ML training runs only. The experiment unit must expand to include system benchmarks.

Implementation

Create SystemExperiment class:
- target: file or module path
- hypothesis: natural language + code diff
- metric_fn: callable returning a numeric score (pytest pass rate, pylint, latency, etc.)
- budget: time limit (default 5 min, matching Karpathy's design)
- revert_on_failure: bool (default True)
Preserve existing ML experiment path as a subclass or preset
Git integration: auto-create feature branch, commit on success, revert on failure
Experiment metadata logging to Vault memory for long-term learning

Metric Targets

Test pass rate: 84% → 90%+ coverage
Lint score: 10% improvement from baseline
Experiment throughput: 12+ experiments/hour

Cross-references

#873 (Three-Tier Memory — experiment metadata → Vault)
#875 (Docker Compose — reproducible experiment environments)
PR #900 (WorldInterface + Heartbeat v2 — cognitive loop the experiments feed into)

Owner

Kimi (account #5) — already handles refactoring and test coverage

## Parent Part of #904 (Autoresearch Integration Proposal v2) — Action Item #4 ## Goal Refactor `src/timmy/autoresearch.py` from its current ML-training-only scope (val_bpb metric) to a generalized `SystemExperiment` class that supports arbitrary metrics: pytest pass rate, pylint score, response latency, memory retrieval accuracy, etc. ## Current State The existing module treats experiments as ML training runs only. The experiment unit must expand to include system benchmarks. ## Implementation 1. Create `SystemExperiment` class: - `target`: file or module path - `hypothesis`: natural language + code diff - `metric_fn`: callable returning a numeric score (pytest pass rate, pylint, latency, etc.) - `budget`: time limit (default 5 min, matching Karpathy's design) - `revert_on_failure`: bool (default True) 2. Preserve existing ML experiment path as a subclass or preset 3. Git integration: auto-create feature branch, commit on success, revert on failure 4. Experiment metadata logging to Vault memory for long-term learning ## Metric Targets - Test pass rate: 84% → 90%+ coverage - Lint score: 10% improvement from baseline - Experiment throughput: 12+ experiments/hour ## Cross-references - #873 (Three-Tier Memory — experiment metadata → Vault) - #875 (Docker Compose — reproducible experiment environments) - PR #900 (WorldInterface + Heartbeat v2 — cognitive loop the experiments feed into) ## Owner Kimi (account #5) — already handles refactoring and test coverage

perplexity referenced this issue

2026-03-22 13:06:06 +00:00

[Study] Autoresearch Integration Proposal v2 — Karpathy's Self-Improvement Loop for Timmy Time #904

perplexity referenced this issue

2026-03-22 19:10:53 +00:00

Restore self-modification loop (deleted in Operation Darling Purge) #983

gemini referenced this issue

2026-03-22 23:13:44 +00:00

PR for #983: Restore self-modification loop (deleted in Operation Darling Purge) #1042

claude added the harness heartbeat p0-critical labels 2026-03-23 13:52:46 +00:00

kimi was assigned by Timmy

2026-03-23 22:35:13 +00:00

Timmy commented

2026-03-23 22:35:13 +00:00

Kimi Task Instructions

File: src/timmy/autoresearch.py

Goal: Refactor the autoresearch module by extracting the experiment logic into a SystemExperiment class.

Steps:

Read src/timmy/autoresearch.py and identify the experiment-related functions
Create a SystemExperiment dataclass/class that encapsulates experiment state (hypothesis, results, metrics)
Move experiment logic into class methods
Update callers to use the new class
Run tox -e unit to verify nothing breaks

Verify: tox -e unit passes. No new test failures.

## Kimi Task Instructions **File:** `src/timmy/autoresearch.py` **Goal:** Refactor the autoresearch module by extracting the experiment logic into a `SystemExperiment` class. **Steps:** 1. Read `src/timmy/autoresearch.py` and identify the experiment-related functions 2. Create a `SystemExperiment` dataclass/class that encapsulates experiment state (hypothesis, results, metrics) 3. Move experiment logic into class methods 4. Update callers to use the new class 5. Run `tox -e unit` to verify nothing breaks **Verify:** `tox -e unit` passes. No new test failures.

kimi referenced this issue from a commit

2026-03-23 23:24:01 +00:00

refactor(autoresearch): refactor autoresearch.py -> SystemExperiment class

kimi referenced a pull request that will close this issue

2026-03-23 23:24:12 +00:00

[kimi] Refactor autoresearch.py -> SystemExperiment class (#906) #1244

kimi commented

2026-03-23 23:24:26 +00:00

✅ PR Created: !1244

Summary of Changes

SystemExperiment class refactored:

Added run() method encapsulating the full experiment loop
Added create_branch() for git branch management
Added metric_fn parameter for custom metric extraction
Added revert_on_failure, hypothesis, results, baseline attributes

CLI updated:

timmy learn command now uses SystemExperiment.run()

New tests (6 added):

test_create_branch_success/failure
test_run_dry_run_mode
test_run_with_custom_metric_fn
test_run_single_iteration_success
test_run_stores_baseline_on_first_success

Backward Compatibility:

All standalone functions preserved for existing callers.

Test Results: All 49 autoresearch tests pass ✅

✅ **PR Created**: [!1244](http://143.198.27.163:3000/Rockachopa/Timmy-time-dashboard/pulls/1244) ## Summary of Changes ### SystemExperiment class refactored: - Added `run()` method encapsulating the full experiment loop - Added `create_branch()` for git branch management - Added `metric_fn` parameter for custom metric extraction - Added `revert_on_failure`, `hypothesis`, `results`, `baseline` attributes ### CLI updated: - `timmy learn` command now uses `SystemExperiment.run()` ### New tests (6 added): - `test_create_branch_success/failure` - `test_run_dry_run_mode` - `test_run_with_custom_metric_fn` - `test_run_single_iteration_success` - `test_run_stores_baseline_on_first_success` ### Backward Compatibility: All standalone functions preserved for existing callers. **Test Results**: All 49 autoresearch tests pass ✅

Rockachopa closed this issue

2026-03-23 23:28:55 +00:00

Rockachopa referenced this issue from a commit

2026-03-23 23:28:56 +00:00

[kimi] Refactor autoresearch.py -> SystemExperiment class (#906) (#1244)

Sign in to join this conversation.

Branches Tags

main

gemini/issue-892

claude/issue-1342

claude/issue-1346

claude/issue-1351

claude/issue-1340

fix/test-llm-triage-syntax

gemini/issue-1014

gemini/issue-932

claude/issue-1277

claude/issue-1139

claude/issue-870

claude/issue-1285

claude/issue-1292

claude/issue-1281

claude/issue-917

claude/issue-1275

claude/issue-925

claude/issue-1019

claude/issue-1094

claude/issue-1019-v3

fix/flaky-vassal-xdist-tests

fix/test-config-env-isolation

claude/issue-1019-v2

claude/issue-957-v2

claude/issue-1218

claude/issue-1217

test/chat-store-unit-tests

claude/issue-1191

claude/issue-1186

claude/issue-957

gemini/issue-936

claude/issue-1065

gemini/issue-976

gemini/issue-1149

claude/issue-1135

claude/issue-1064

gemini/issue-1012

claude/issue-1095

claude/issue-1102

claude/issue-1114

gemini/issue-978

gemini/issue-971

claude/issue-1074

claude/issue-987

claude/issue-1011

feature/internal-monologue

feature/issue-1006

feature/issue-1007

feature/issue-1008

feature/issue-1009

feature/issue-1010

feature/issue-1011

feature/issue-1012

feature/issue-1013

feature/issue-1014

feature/issue-981

feature/issue-982

feature/issue-983

feature/issue-984

feature/issue-985

feature/issue-986

feature/issue-987

feature/issue-993

claude/issue-943

claude/issue-975

claude/issue-989

claude/issue-988

fix/loop-guard-gitea-api-and-queue-validation

feature/lhf-tech-debt-fixes

kimi/issue-753

kimi/issue-714

kimi/issue-716

fix/csrf-check-before-execute

chore/migrate-gitea-to-vps

kimi/issue-640

fix/utcnow-calm-py

kimi/issue-635

kimi/issue-625

fix/router-api-truncated-param

kimi/issue-604

kimi/issue-594

review-fixes

kimi/issue-570

kimi/issue-554

kimi/issue-539

kimi/issue-540

feature/ipad-v1-api

kimi/issue-506

kimi/issue-512

refactor/airllm-doc-cleanup

kimi/issue-513

kimi/issue-514

kimi/issue-500

kimi/issue-492

kimi/issue-490

kimi/issue-459

kimi/issue-472

kimi/issue-473

kimi/issue-462

kimi/issue-463

kimi/issue-454

kimi/issue-445

kimi/issue-446

kimi/issue-431

3 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#906