[Study] Autoresearch Integration Proposal v2 — Karpathy's Self-Improvement Loop for Timmy Time #904

Closed
opened 2026-03-22 13:04:23 +00:00 by perplexity · 2 comments
Collaborator

Summary

19-page engineering proposal (v2, March 22 2026) for integrating Karpathy's autoresearch framework into Timmy Time's existing Gitea-based development workflow. Authored by Alexander Whitestone & Timmy Time. The core insight: Timmy's Gitea infrastructure (8 agent accounts, fork-based PRs, tox CI, Timmy auto-reviewing) already implements the governance layer autoresearch needs. This is not building new infrastructure — it's plugging autoresearch into what already works.

PDF attached — Autoresearch Integration Proposal v2 (19 pages)

What Is Autoresearch

Karpathy's 630-line Python framework: give an AI agent code + a metric + a fixed compute budget → agent experiments autonomously. Results:

  • Karpathy: Single H100, 2 days, ~700 experiments → 20 improvements, 11% training speedup
  • Shopify CEO (Lutke): overnight → 19% quality gain, smaller model beat 2x larger baseline
  • SkyPilot: 16 GPUs, 8 hours, ~910 experiments → 2.87% improvement
  • Community (Mac Mini M4): 26/35 failed, 7 succeeded → model improved by getting simpler

Three-Horizon Strategy

Horizon 1 — Codebase Self-Optimization (Weeks 1-3)

  • Create SystemExperiment class replacing ML-focused experiment abstraction
  • Wire Aider as execution engine for multi-file code edits
  • CLI entry point: python -m timmy.cli learn
  • Gitea integration: feature branch → tox → PR → Timmy auto-review
  • Assign to Kimi (account #5) — kimi-task backlog (~12 open) becomes initial experiment queue
  • Targets: 84% → 90%+ test coverage, 12+ experiments/hour, kimi-task backlog ~12 → <4
  • Cross-ref: #875 (Docker Compose), kimi-task backlog

Horizon 2 — Meta-Autoresearch / System Benchmarks (Weeks 4-8)

  • Experiment unit expands from code changes → system configuration changes (prompts, tools, memory strategies)
  • Timmy-Task Benchmark Suite: code gen, memory retrieval, tool use, multi-step reasoning, creative writing, Morrowind reasoning
  • Self-tuning loop: identify weakness → edit prompts/tools → benchmark → PR if improved
  • Cross-ref: #873 (Memory), #882 (Model Router), #874 (Highlights), #883 (UESP RAG)

Horizon 3 — Multi-Agent Research Swarm (Weeks 9-16)

  • Sub-agent personas map to Gitea accounts running parallel autoresearch loops:
    • Forge (kimi #5) → codebase optimization
    • Echo (hermes #4) → memory retrieval
    • Seer (manus #3) → inference quality / model routing
    • Quill (perplexity #7) → prompt quality / content
    • Mace (replit #6) → security / hardening
    • Helm (Timmy #2) → orchestration / approval
  • Metabolic protocol (Lightning balance) governs compute budget per agent
  • Cross-ref: #878 (TES3MP Bridge), #877 (Nostr Identity), #881 (Off-Grid Compute)

Identified Gaps (from Current State)

  1. Autoresearch scope too narrowsrc/timmy/autoresearch.py only handles ML training (val_bpb)
  2. No meta-evaluation suite — Timmy can't benchmark prompts/tools against standardized tasks
  3. MCP gap — Tools tightly coupled to Agno; needs MCP client for community tool integration

Hardware Tiers

Tier Hardware Use Case Experiments/Hr
Local (Resting) M3 Max + Ollama (8B) H1 code experiments ~6-8
VPS (Active) Hermes VPS + Ollama (70B) H2 benchmarks ~10-12
Cloud (Burst) Modal / RunPod / Groq H3 parallel swarm ~50-100+

Governance Rules (from KT document)

  • Never create test issues on the real repo
  • Morrowind is the direction (no pivots)
  • OpenGauss rejected — don't reference
  • No cloud dependencies (Ollama on localhost required)
  • Timmy is autonomous but guided; rockachopa holds override via program.md

Priority Alignment

Autoresearch plugs in after Docker Compose (#875) — needs reproducible environment first. Dependency chain: merge PRs #864/#865/#900 → TES3MP Bridge (#878) → TES3MP Server (#818) → Three-Tier Memory (#873) → Docker Compose (#875) → Autoresearch.

Immediate Action Items (logged as separate issues)

See child issues linked in comment below.

Labels

  • 222-epic — This is a multi-horizon epic spanning 16 weeks
  • needs-extraction — Action items extracted as individual issues
## Summary 19-page engineering proposal (v2, March 22 2026) for integrating Karpathy's autoresearch framework into Timmy Time's existing Gitea-based development workflow. Authored by Alexander Whitestone & Timmy Time. The core insight: Timmy's Gitea infrastructure (8 agent accounts, fork-based PRs, tox CI, Timmy auto-reviewing) already implements the governance layer autoresearch needs. This is not building new infrastructure — it's plugging autoresearch into what already works. **PDF attached** — Autoresearch Integration Proposal v2 (19 pages) ## What Is Autoresearch Karpathy's 630-line Python framework: give an AI agent code + a metric + a fixed compute budget → agent experiments autonomously. Results: - Karpathy: Single H100, 2 days, ~700 experiments → 20 improvements, 11% training speedup - Shopify CEO (Lutke): overnight → 19% quality gain, smaller model beat 2x larger baseline - SkyPilot: 16 GPUs, 8 hours, ~910 experiments → 2.87% improvement - Community (Mac Mini M4): 26/35 failed, 7 succeeded → model improved by getting simpler ## Three-Horizon Strategy ### Horizon 1 — Codebase Self-Optimization (Weeks 1-3) - Create `SystemExperiment` class replacing ML-focused experiment abstraction - Wire Aider as execution engine for multi-file code edits - CLI entry point: `python -m timmy.cli learn` - Gitea integration: feature branch → tox → PR → Timmy auto-review - Assign to Kimi (account #5) — kimi-task backlog (~12 open) becomes initial experiment queue - **Targets**: 84% → 90%+ test coverage, 12+ experiments/hour, kimi-task backlog ~12 → <4 - Cross-ref: #875 (Docker Compose), kimi-task backlog ### Horizon 2 — Meta-Autoresearch / System Benchmarks (Weeks 4-8) - Experiment unit expands from code changes → system configuration changes (prompts, tools, memory strategies) - **Timmy-Task Benchmark Suite**: code gen, memory retrieval, tool use, multi-step reasoning, creative writing, Morrowind reasoning - Self-tuning loop: identify weakness → edit prompts/tools → benchmark → PR if improved - Cross-ref: #873 (Memory), #882 (Model Router), #874 (Highlights), #883 (UESP RAG) ### Horizon 3 — Multi-Agent Research Swarm (Weeks 9-16) - Sub-agent personas map to Gitea accounts running parallel autoresearch loops: - Forge (kimi #5) → codebase optimization - Echo (hermes #4) → memory retrieval - Seer (manus #3) → inference quality / model routing - Quill (perplexity #7) → prompt quality / content - Mace (replit #6) → security / hardening - Helm (Timmy #2) → orchestration / approval - Metabolic protocol (Lightning balance) governs compute budget per agent - Cross-ref: #878 (TES3MP Bridge), #877 (Nostr Identity), #881 (Off-Grid Compute) ## Identified Gaps (from Current State) 1. **Autoresearch scope too narrow** — `src/timmy/autoresearch.py` only handles ML training (val_bpb) 2. **No meta-evaluation suite** — Timmy can't benchmark prompts/tools against standardized tasks 3. **MCP gap** — Tools tightly coupled to Agno; needs MCP client for community tool integration ## Hardware Tiers | Tier | Hardware | Use Case | Experiments/Hr | |---|---|---|---| | Local (Resting) | M3 Max + Ollama (8B) | H1 code experiments | ~6-8 | | VPS (Active) | Hermes VPS + Ollama (70B) | H2 benchmarks | ~10-12 | | Cloud (Burst) | Modal / RunPod / Groq | H3 parallel swarm | ~50-100+ | ## Governance Rules (from KT document) - Never create test issues on the real repo - Morrowind is the direction (no pivots) - OpenGauss rejected — don't reference - No cloud dependencies (Ollama on localhost required) - Timmy is autonomous but guided; rockachopa holds override via program.md ## Priority Alignment Autoresearch plugs in after Docker Compose (#875) — needs reproducible environment first. Dependency chain: merge PRs #864/#865/#900 → TES3MP Bridge (#878) → TES3MP Server (#818) → Three-Tier Memory (#873) → Docker Compose (#875) → **Autoresearch**. ## Immediate Action Items (logged as separate issues) See child issues linked in comment below. ## Labels - `222-epic` — This is a multi-horizon epic spanning 16 weeks - `needs-extraction` — Action items extracted as individual issues
Author
Collaborator

📎 PDF attached above — 19-page full proposal with architecture diagrams, agent mapping, and governance framework.

Child Issues (Work Suggestions from Section 14: Immediate Action Items)

  • #905: [Autoresearch H1] Clone karpathy/autoresearch & Run M3 Max Baseline
  • #906: [Autoresearch H1] Refactor autoresearch.py → SystemExperiment Class
  • #907: [Autoresearch H1] Implement python -m timmy.cli learn Entry Point
  • #908: [Autoresearch H1] Wire Gitea API for Automated PR Creation from Experiments
  • #909: [Autoresearch H2] Define Timmy-Task Benchmark Suite (5-10 Initial Tasks)
  • #910: [Autoresearch H2] Initialize MCP Client — src/timmy/mcp_client.py
  • #911: [Autoresearch] Document Experiment Governance Runbook

Dependency Order

These are ordered by dependency — each step enables the next:

  1. Merge PRs #864, #865, #900 (foundation) — already in progress
  2. Complete Docker Compose (#875) — prerequisite environment
  3. #905 — Clone & baseline (establishes what autoresearch can do on M3 Max)
  4. #906 — SystemExperiment class (core abstraction)
  5. #907 — CLI entry point (how to invoke the loop)
  6. #908 — Gitea PR automation (governance integration)
  7. #909 — Benchmark suite (H2 starts here)
  8. #910 — MCP client (tool integration)
  9. #911 — Governance runbook (documents all of the above)

Existing Ticket Acceleration Map

Horizon Ticket How Autoresearch Helps
H1 #875 Docker Compose = reproducible experiment environments
H1 kimi-tasks Kimi autoresearch loop clears backlog autonomously
H2 #873 Memory retrieval accuracy becomes an autoresearch metric
H2 #882 Cascade config becomes an experiment parameter
H2 #883 RAG retrieval accuracy benchmarked and optimized
H2 #874 Detection quality becomes a benchmark task
H3 #878 Morrowind task benchmarks test bridge quality e2e
H3 #877 Agent discovery/benchmarking via NIP-89/NIP-90
H3 #881 Long-running experiment loops on sovereign hardware
📎 **PDF attached above** — 19-page full proposal with architecture diagrams, agent mapping, and governance framework. ## Child Issues (Work Suggestions from Section 14: Immediate Action Items) - #905: [Autoresearch H1] Clone karpathy/autoresearch & Run M3 Max Baseline - #906: [Autoresearch H1] Refactor autoresearch.py → SystemExperiment Class - #907: [Autoresearch H1] Implement `python -m timmy.cli learn` Entry Point - #908: [Autoresearch H1] Wire Gitea API for Automated PR Creation from Experiments - #909: [Autoresearch H2] Define Timmy-Task Benchmark Suite (5-10 Initial Tasks) - #910: [Autoresearch H2] Initialize MCP Client — src/timmy/mcp_client.py - #911: [Autoresearch] Document Experiment Governance Runbook ## Dependency Order These are ordered by dependency — each step enables the next: 1. Merge PRs #864, #865, #900 (foundation) — already in progress 2. Complete Docker Compose (#875) — prerequisite environment 3. #905 — Clone & baseline (establishes what autoresearch can do on M3 Max) 4. #906 — SystemExperiment class (core abstraction) 5. #907 — CLI entry point (how to invoke the loop) 6. #908 — Gitea PR automation (governance integration) 7. #909 — Benchmark suite (H2 starts here) 8. #910 — MCP client (tool integration) 9. #911 — Governance runbook (documents all of the above) ## Existing Ticket Acceleration Map | Horizon | Ticket | How Autoresearch Helps | |---|---|---| | H1 | #875 | Docker Compose = reproducible experiment environments | | H1 | kimi-tasks | Kimi autoresearch loop clears backlog autonomously | | H2 | #873 | Memory retrieval accuracy becomes an autoresearch metric | | H2 | #882 | Cascade config becomes an experiment parameter | | H2 | #883 | RAG retrieval accuracy benchmarked and optimized | | H2 | #874 | Detection quality becomes a benchmark task | | H3 | #878 | Morrowind task benchmarks test bridge quality e2e | | H3 | #877 | Agent discovery/benchmarking via NIP-89/NIP-90 | | H3 | #881 | Long-running experiment loops on sovereign hardware |
gemini was assigned by Rockachopa 2026-03-22 23:32:36 +00:00
claude added the rejected-direction label 2026-03-23 13:51:20 +00:00
Author
Collaborator

🧹 Closed — Superseded

Autoresearch direction has been superseded by the Sovereignty Loop spec (#953) and the harness-first architecture. The research pipeline now routes through the cascade router and local model stack.

Ref: Deep Backlog Triage #1076. Reopen if needed.

🧹 **Closed — Superseded** Autoresearch direction has been superseded by the Sovereignty Loop spec (#953) and the harness-first architecture. The research pipeline now routes through the cascade router and local model stack. Ref: Deep Backlog Triage #1076. Reopen if needed.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#904