[Study] Autoresearch Integration Proposal v2 — Karpathy's Self-Improvement Loop for Timmy Time #904

New Issue

perplexity · 2026-03-22T13:04:23Z

perplexity commented

2026-03-22 13:04:23 +00:00

Summary

19-page engineering proposal (v2, March 22 2026) for integrating Karpathy's autoresearch framework into Timmy Time's existing Gitea-based development workflow. Authored by Alexander Whitestone & Timmy Time. The core insight: Timmy's Gitea infrastructure (8 agent accounts, fork-based PRs, tox CI, Timmy auto-reviewing) already implements the governance layer autoresearch needs. This is not building new infrastructure — it's plugging autoresearch into what already works.

PDF attached — Autoresearch Integration Proposal v2 (19 pages)

What Is Autoresearch

Karpathy's 630-line Python framework: give an AI agent code + a metric + a fixed compute budget → agent experiments autonomously. Results:

Karpathy: Single H100, 2 days, ~700 experiments → 20 improvements, 11% training speedup
Shopify CEO (Lutke): overnight → 19% quality gain, smaller model beat 2x larger baseline
SkyPilot: 16 GPUs, 8 hours, ~910 experiments → 2.87% improvement
Community (Mac Mini M4): 26/35 failed, 7 succeeded → model improved by getting simpler

Three-Horizon Strategy

Horizon 1 — Codebase Self-Optimization (Weeks 1-3)

Create SystemExperiment class replacing ML-focused experiment abstraction
Wire Aider as execution engine for multi-file code edits
CLI entry point: python -m timmy.cli learn
Gitea integration: feature branch → tox → PR → Timmy auto-review
Assign to Kimi (account #5) — kimi-task backlog (~12 open) becomes initial experiment queue
Targets: 84% → 90%+ test coverage, 12+ experiments/hour, kimi-task backlog ~12 → <4
Cross-ref: #875 (Docker Compose), kimi-task backlog

Horizon 2 — Meta-Autoresearch / System Benchmarks (Weeks 4-8)

Experiment unit expands from code changes → system configuration changes (prompts, tools, memory strategies)
Timmy-Task Benchmark Suite: code gen, memory retrieval, tool use, multi-step reasoning, creative writing, Morrowind reasoning
Self-tuning loop: identify weakness → edit prompts/tools → benchmark → PR if improved
Cross-ref: #873 (Memory), #882 (Model Router), #874 (Highlights), #883 (UESP RAG)

Horizon 3 — Multi-Agent Research Swarm (Weeks 9-16)

Sub-agent personas map to Gitea accounts running parallel autoresearch loops:
- Forge (kimi #5) → codebase optimization
- Echo (hermes #4) → memory retrieval
- Seer (manus #3) → inference quality / model routing
- Quill (perplexity #7) → prompt quality / content
- Mace (replit #6) → security / hardening
- Helm (Timmy #2) → orchestration / approval
Metabolic protocol (Lightning balance) governs compute budget per agent
Cross-ref: #878 (TES3MP Bridge), #877 (Nostr Identity), #881 (Off-Grid Compute)

Identified Gaps (from Current State)

Autoresearch scope too narrow — src/timmy/autoresearch.py only handles ML training (val_bpb)
No meta-evaluation suite — Timmy can't benchmark prompts/tools against standardized tasks
MCP gap — Tools tightly coupled to Agno; needs MCP client for community tool integration

Hardware Tiers

Tier	Hardware	Use Case	Experiments/Hr
Local (Resting)	M3 Max + Ollama (8B)	H1 code experiments	~6-8
VPS (Active)	Hermes VPS + Ollama (70B)	H2 benchmarks	~10-12
Cloud (Burst)	Modal / RunPod / Groq	H3 parallel swarm	~50-100+

Governance Rules (from KT document)

Never create test issues on the real repo
Morrowind is the direction (no pivots)
OpenGauss rejected — don't reference
No cloud dependencies (Ollama on localhost required)
Timmy is autonomous but guided; rockachopa holds override via program.md

Priority Alignment

Autoresearch plugs in after Docker Compose (#875) — needs reproducible environment first. Dependency chain: merge PRs #864/#865/#900 → TES3MP Bridge (#878) → TES3MP Server (#818) → Three-Tier Memory (#873) → Docker Compose (#875) → Autoresearch.

Immediate Action Items (logged as separate issues)

See child issues linked in comment below.

Labels

222-epic — This is a multi-horizon epic spanning 16 weeks
needs-extraction — Action items extracted as individual issues

## Summary 19-page engineering proposal (v2, March 22 2026) for integrating Karpathy's autoresearch framework into Timmy Time's existing Gitea-based development workflow. Authored by Alexander Whitestone & Timmy Time. The core insight: Timmy's Gitea infrastructure (8 agent accounts, fork-based PRs, tox CI, Timmy auto-reviewing) already implements the governance layer autoresearch needs. This is not building new infrastructure — it's plugging autoresearch into what already works. **PDF attached** — Autoresearch Integration Proposal v2 (19 pages) ## What Is Autoresearch Karpathy's 630-line Python framework: give an AI agent code + a metric + a fixed compute budget → agent experiments autonomously. Results: - Karpathy: Single H100, 2 days, ~700 experiments → 20 improvements, 11% training speedup - Shopify CEO (Lutke): overnight → 19% quality gain, smaller model beat 2x larger baseline - SkyPilot: 16 GPUs, 8 hours, ~910 experiments → 2.87% improvement - Community (Mac Mini M4): 26/35 failed, 7 succeeded → model improved by getting simpler ## Three-Horizon Strategy ### Horizon 1 — Codebase Self-Optimization (Weeks 1-3) - Create `SystemExperiment` class replacing ML-focused experiment abstraction - Wire Aider as execution engine for multi-file code edits - CLI entry point: `python -m timmy.cli learn` - Gitea integration: feature branch → tox → PR → Timmy auto-review - Assign to Kimi (account #5) — kimi-task backlog (~12 open) becomes initial experiment queue - **Targets**: 84% → 90%+ test coverage, 12+ experiments/hour, kimi-task backlog ~12 → <4 - Cross-ref: #875 (Docker Compose), kimi-task backlog ### Horizon 2 — Meta-Autoresearch / System Benchmarks (Weeks 4-8) - Experiment unit expands from code changes → system configuration changes (prompts, tools, memory strategies) - **Timmy-Task Benchmark Suite**: code gen, memory retrieval, tool use, multi-step reasoning, creative writing, Morrowind reasoning - Self-tuning loop: identify weakness → edit prompts/tools → benchmark → PR if improved - Cross-ref: #873 (Memory), #882 (Model Router), #874 (Highlights), #883 (UESP RAG) ### Horizon 3 — Multi-Agent Research Swarm (Weeks 9-16) - Sub-agent personas map to Gitea accounts running parallel autoresearch loops: - Forge (kimi #5) → codebase optimization - Echo (hermes #4) → memory retrieval - Seer (manus #3) → inference quality / model routing - Quill (perplexity #7) → prompt quality / content - Mace (replit #6) → security / hardening - Helm (Timmy #2) → orchestration / approval - Metabolic protocol (Lightning balance) governs compute budget per agent - Cross-ref: #878 (TES3MP Bridge), #877 (Nostr Identity), #881 (Off-Grid Compute) ## Identified Gaps (from Current State) 1. **Autoresearch scope too narrow** — `src/timmy/autoresearch.py` only handles ML training (val_bpb) 2. **No meta-evaluation suite** — Timmy can't benchmark prompts/tools against standardized tasks 3. **MCP gap** — Tools tightly coupled to Agno; needs MCP client for community tool integration ## Hardware Tiers | Tier | Hardware | Use Case | Experiments/Hr | |---|---|---|---| | Local (Resting) | M3 Max + Ollama (8B) | H1 code experiments | ~6-8 | | VPS (Active) | Hermes VPS + Ollama (70B) | H2 benchmarks | ~10-12 | | Cloud (Burst) | Modal / RunPod / Groq | H3 parallel swarm | ~50-100+ | ## Governance Rules (from KT document) - Never create test issues on the real repo - Morrowind is the direction (no pivots) - OpenGauss rejected — don't reference - No cloud dependencies (Ollama on localhost required) - Timmy is autonomous but guided; rockachopa holds override via program.md ## Priority Alignment Autoresearch plugs in after Docker Compose (#875) — needs reproducible environment first. Dependency chain: merge PRs #864/#865/#900 → TES3MP Bridge (#878) → TES3MP Server (#818) → Three-Tier Memory (#873) → Docker Compose (#875) → **Autoresearch**. ## Immediate Action Items (logged as separate issues) See child issues linked in comment below. ## Labels - `222-epic` — This is a multi-horizon epic spanning 16 weeks - `needs-extraction` — Action items extracted as individual issues

autoresearch-integration-proposal-v2.pdf

36 KiB

Horizon	Ticket	How Autoresearch Helps
H1	#875	Docker Compose = reproducible experiment environments
H1	kimi-tasks	Kimi autoresearch loop clears backlog autonomously
H2	#873	Memory retrieval accuracy becomes an autoresearch metric
H2	#882	Cascade config becomes an experiment parameter
H2	#883	RAG retrieval accuracy benchmarked and optimized
H2	#874	Detection quality becomes a benchmark task
H3	#878	Morrowind task benchmarks test bridge quality e2e
H3	#877	Agent discovery/benchmarking via NIP-89/NIP-90
H3	#881	Long-running experiment loops on sovereign hardware

[Study] Autoresearch Integration Proposal v2 — Karpathy's Self-Improvement Loop for Timmy Time #904