From a927241dbe54889acffab678e7d32843495e7dbe Mon Sep 17 00:00:00 2001 From: Alexander Whitestone <8633216+AlexanderWhitestone@users.noreply.github.com> Date: Wed, 11 Mar 2026 08:11:26 -0400 Subject: [PATCH] polish: make repo presentable for employer review (#162) --- .env.example | 4 +- README.md | 7 +- docs/CHANGELOG_2025-02-27.md | 57 -- docs/IMPLEMENTATION_ARCHIVE.md | 326 --------- docs/IMPLEMENTATION_SUMMARY.md | 180 ----- docs/IMPLEMENTATION_TICKETS.md | 1209 -------------------------------- docs/PLAN_ARCHIVE.md | 478 ------------- docs/QUALITY_ANALYSIS.md | 306 -------- docs/QUALITY_ANALYSIS_v2.md | 245 ------- docs/QUALITY_REVIEW_REPORT.md | 232 ------ docs/index.html | 753 ++++++++------------ pyproject.toml | 3 + src/config.py | 6 +- 13 files changed, 300 insertions(+), 3506 deletions(-) delete mode 100644 docs/CHANGELOG_2025-02-27.md delete mode 100644 docs/IMPLEMENTATION_ARCHIVE.md delete mode 100644 docs/IMPLEMENTATION_SUMMARY.md delete mode 100644 docs/IMPLEMENTATION_TICKETS.md delete mode 100644 docs/PLAN_ARCHIVE.md delete mode 100644 docs/QUALITY_ANALYSIS.md delete mode 100644 docs/QUALITY_ANALYSIS_v2.md delete mode 100644 docs/QUALITY_REVIEW_REPORT.md diff --git a/.env.example b/.env.example index c8b346e5..79c84e1d 100644 --- a/.env.example +++ b/.env.example @@ -14,8 +14,8 @@ # In production (docker-compose.prod.yml), this is set to http://ollama:11434 automatically. # OLLAMA_URL=http://localhost:11434 -# LLM model to use via Ollama (default: llama3.2) -# OLLAMA_MODEL=llama3.2 +# LLM model to use via Ollama (default: qwen2.5:14b) +# OLLAMA_MODEL=qwen2.5:14b # Enable FastAPI interactive docs at /docs and /redoc (default: false) # DEBUG=true diff --git a/README.md b/README.md index 911b84ce..e2cf678d 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,9 @@ # Timmy Time — Mission Control [![Tests](https://github.com/AlexanderWhitestone/Timmy-time-dashboard/actions/workflows/tests.yml/badge.svg)](https://github.com/AlexanderWhitestone/Timmy-time-dashboard/actions/workflows/tests.yml) +![Python](https://img.shields.io/badge/python-3.11+-blue) +![Coverage](https://img.shields.io/badge/coverage-73%25-brightgreen) +![License](https://img.shields.io/badge/license-MIT-green) A local-first, sovereign AI agent system. Talk to Timmy, watch his swarm, gate API access with Bitcoin Lightning — all from a browser, no cloud AI required. @@ -66,7 +69,7 @@ make help # see all commands | [CLAUDE.md](CLAUDE.md) | AI assistant development guide | | [AGENTS.md](AGENTS.md) | Multi-agent development standards | | [.env.example](.env.example) | Configuration reference | -| [docs/](docs/) | Architecture docs, ADRs, audits | +| [docs/](docs/) | Architecture, ADRs, security audit, roadmap | --- @@ -79,7 +82,7 @@ cp .env.example .env | Variable | Default | Purpose | |----------|---------|---------| | `OLLAMA_URL` | `http://localhost:11434` | Ollama host | -| `OLLAMA_MODEL` | `llama3.1:8b-instruct` | Model for tool calling. Use llama3.1:8b-instruct for reliable tool use; fallback to qwen2.5:14b | +| `OLLAMA_MODEL` | `qwen2.5:14b` | Primary model for reasoning and tool calling. Fallback: `llama3.1:8b-instruct` | | `DEBUG` | `false` | Enable `/docs` and `/redoc` | | `TIMMY_MODEL_BACKEND` | `ollama` | `ollama` \| `airllm` \| `auto` | | `AIRLLM_MODEL_SIZE` | `70b` | `8b` \| `70b` \| `405b` | diff --git a/docs/CHANGELOG_2025-02-27.md b/docs/CHANGELOG_2025-02-27.md deleted file mode 100644 index 3d94797c..00000000 --- a/docs/CHANGELOG_2025-02-27.md +++ /dev/null @@ -1,57 +0,0 @@ -# Changelog — 2025-02-27 - -## Model Upgrade & Hallucination Fix - -### Change 1: Model Upgrade (Primary Fix) -**Problem:** llama3.2 (3B parameters) consistently hallucinated tool output instead of waiting for real results. - -**Solution:** Upgraded default model to `llama3.1:8b-instruct` which is specifically fine-tuned for reliable tool/function calling. - -**Changes:** -- `src/config.py`: Changed `ollama_model` default from `llama3.2` to `llama3.1:8b-instruct` -- Added fallback logic: if primary model unavailable, auto-fallback to `qwen2.5:14b` -- `README.md`: Updated setup instructions with new model requirement - -**User Action Required:** -```bash -ollama pull llama3.1:8b-instruct -``` - -### Change 2: Structured Output Enforcement (Foundation) -**Preparation:** Added infrastructure for two-phase tool calling with JSON schema enforcement. - -**Implementation:** -- Session context tracking in `TimmyOrchestrator` -- `_session_init()` runs on first message to load real data - -### Change 3: Git Tool Working Directory Fix -**Problem:** Git tools failed with "fatal: Not a git repository" due to wrong working directory. - -**Solution:** -- Rewrote `src/tools/git_tools.py` to use subprocess with explicit `cwd=REPO_ROOT` -- Added `REPO_ROOT` module-level constant auto-detected at import time -- All git commands now run from the correct directory - -### Change 4: Session Init with Git Log -**Problem:** Timmy couldn't answer "what's new?" from real data. - -**Solution:** -- `_session_init()` now reads `git log --oneline -15` from repo root on first message -- Recent commits prepended to system prompt -- Timmy now grounds self-description in actual commit history - -### Change 5: Documentation Updates -- `README.md`: Updated Quickstart with new model requirement -- `README.md`: Configuration table reflects new default model -- Added notes explaining why llama3.1:8b-instruct is required - -### Files Modified -- `src/config.py` — Model configuration with fallback -- `src/tools/git_tools.py` — Complete rewrite with subprocess + cwd -- `src/agents/timmy.py` — Session init with git log reading -- `README.md` — Updated setup and configuration docs - -### Testing -- All git tool tests pass with new subprocess implementation -- Git log correctly returns commits from repo root -- Session init loads context on first message diff --git a/docs/IMPLEMENTATION_ARCHIVE.md b/docs/IMPLEMENTATION_ARCHIVE.md deleted file mode 100644 index a18af2b3..00000000 --- a/docs/IMPLEMENTATION_ARCHIVE.md +++ /dev/null @@ -1,326 +0,0 @@ -# Timmy Time — Implementation Summary - -**Date:** 2026-02-25 -**Phase:** 1, 2 Complete (MCP, Event Bus, Agents) -**Status:** ✅ Ready for Phase 3 (Cascade Router) - ---- - -## What Was Built - -### 1. MCP (Model Context Protocol) ✅ - -**Location:** `src/mcp/` - -| Component | Purpose | Status | -|-----------|---------|--------| -| Registry | Tool catalog with health tracking | ✅ Complete | -| Server | MCP protocol implementation | ✅ Complete | -| Schemas | JSON schema utilities | ✅ Complete | -| Bootstrap | Auto-load all tools | ✅ Complete | - -**Features:** -- 6 tools registered with full schemas -- Health tracking (healthy/degraded/unhealthy) -- Metrics collection (latency, error rates) -- Pattern-based discovery -- `@register_tool` decorator - -**Tools Implemented:** -```python -web_search # DuckDuckGo search -read_file # File reading -write_file # File writing (with confirmation) -list_directory # Directory listing -python # Python execution -memory_search # Vector memory search -``` - -### 2. Event Bus ✅ - -**Location:** `src/events/bus.py` - -**Features:** -- Async publish/subscribe -- Wildcard pattern matching (`agent.task.*`) -- Event history (last 1000 events) -- Concurrent handler execution -- System-wide singleton - -**Usage:** -```python -from events.bus import event_bus, Event - -@event_bus.subscribe("agent.task.*") -async def handle_task(event): - print(f"Task: {event.data}") - -await event_bus.publish(Event( - type="agent.task.assigned", - source="timmy", - data={"task_id": "123"} -)) -``` - -### 3. Sub-Agents ✅ - -**Location:** `src/agents/` - -| Agent | ID | Role | Key Tools | -|-------|-----|------|-----------| -| Seer | seer | Research | web_search, read_file, memory_search | -| Forge | forge | Code | python, write_file, read_file | -| Quill | quill | Writing | write_file, read_file, memory_search | -| Echo | echo | Memory | memory_search, read_file, write_file | -| Helm | helm | Routing | memory_search | -| Timmy | timmy | Orchestrator | All tools | - -**BaseAgent Features:** -- Agno Agent integration -- MCP tool registry access -- Event bus connectivity -- Structured logging -- Task execution framework - -**Orchestrator Logic:** -```python -timmy = create_timmy_swarm() - -# Automatic routing: -# - Simple questions → Direct response -# - "Remember..." → Echo agent -# - Complex tasks → Helm routes to specialist -``` - -### 4. Memory System (Previously Complete) ✅ - -**Three-Tier Architecture:** - -``` -Tier 1: Hot Memory (MEMORY.md) - ↓ Always loaded - -Tier 2: Vault (memory/) - ├── self/identity.md - ├── self/user_profile.md - ├── self/methodology.md - ├── notes/*.md - └── aar/*.md - -Tier 3: Semantic Search - └── Vector embeddings over vault -``` - -**Handoff Protocol:** -- `last-session-handoff.md` written at session end -- Auto-loaded at next session start - ---- - -## Architecture Diagram - -``` -┌─────────────────────────────────────────────────────────────┐ -│ USER INTERFACE │ -│ (Dashboard/CLI) │ -└──────────────────────────┬──────────────────────────────────┘ - │ -┌──────────────────────────▼──────────────────────────────────┐ -│ TIMMY ORCHESTRATOR │ -│ │ -│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ -│ │ Request │ │ Router │ │ Response │ │ -│ │ Analysis │→ │ (Helm) │→ │ Synthesis │ │ -│ └─────────────┘ └─────────────┘ └─────────────┘ │ -│ │ -└──────────────────────────┬──────────────────────────────────┘ - │ - ┌──────────────────┼──────────────────┐ - │ │ │ -┌───────▼──────┐ ┌───────▼──────┐ ┌───────▼──────┐ -│ Seer │ │ Forge │ │ Quill │ -│ (Research) │ │ (Code) │ │ (Writing) │ -└──────────────┘ └──────────────┘ └──────────────┘ - │ -┌───────▼──────┐ ┌───────▼──────┐ -│ Echo │ │ Helm │ -│ (Memory) │ │ (Routing) │ -└──────────────┘ └──────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────┐ -│ MCP TOOL REGISTRY │ -│ │ -│ web_search read_file write_file list_directory │ -│ python memory_search │ -│ │ -└──────────────────────────┬──────────────────────────────────┘ - │ -┌──────────────────────────▼──────────────────────────────────┐ -│ EVENT BUS │ -│ (Async pub/sub, wildcard patterns) │ -└──────────────────────────┬──────────────────────────────────┘ - │ -┌──────────────────────────▼──────────────────────────────────┐ -│ MEMORY SYSTEM │ -│ │ -│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ -│ │ Hot │ │ Vault │ │ Semantic │ │ -│ │ MEMORY │ │ Files │ │ Search │ │ -│ └──────────┘ └──────────┘ └──────────┘ │ -│ │ -└─────────────────────────────────────────────────────────────┘ -``` - ---- - -## Testing Results - -``` -All 973 tests pass ✅ - -Manual verification: -- MCP Bootstrap: ✅ 6 tools loaded -- Tool Registry: ✅ web_search, file_ops, etc. -- Event Bus: ✅ Events published/subscribed -- Agent Imports: ✅ All agents loadable -``` - ---- - -## Files Created - -``` -src/ -├── mcp/ -│ ├── __init__.py -│ ├── bootstrap.py # Auto-load tools -│ ├── registry.py # Tool catalog -│ ├── server.py # MCP protocol -│ └── schemas/ -│ └── base.py # Schema utilities -│ -├── tools/ -│ ├── web_search.py # DuckDuckGo search -│ ├── file_ops.py # File operations -│ ├── code_exec.py # Python execution -│ └── memory_tool.py # Memory search -│ -├── events/ -│ └── bus.py # Event bus -│ -└── agents/ - ├── __init__.py - ├── base.py # Base agent class - ├── timmy.py # Orchestrator - ├── seer.py # Research - ├── forge.py # Code - ├── quill.py # Writing - ├── echo.py # Memory - └── helm.py # Routing - -MEMORY.md # Hot memory -memory/ # Vault structure -``` - ---- - -## Usage Example - -```python -from agents import create_timmy_swarm - -# Create fully configured Timmy -timmy = create_timmy_swarm() - -# Simple chat (handles directly) -response = await timmy.orchestrate("What is your name?") - -# Research (routes to Seer) -response = await timmy.orchestrate("Search for Bitcoin news") - -# Code (routes to Forge) -response = await timmy.orchestrate("Write a Python script to...") - -# Memory (routes to Echo) -response = await timmy.orchestrate("What did we discuss yesterday?") -``` - ---- - -## Next: Phase 3 (Cascade Router) - -To complete the brief, implement: - -### 1. Cascade LLM Router -```yaml -# config/providers.yaml -providers: - - name: ollama-local - type: ollama - url: http://localhost:11434 - priority: 1 - models: [llama3.2, deepseek-r1] - - - name: openai-backup - type: openai - api_key: ${OPENAI_API_KEY} - priority: 2 - models: [gpt-4o-mini] -``` - -Features: -- Priority-ordered fallback -- Latency/error tracking -- Cost accounting -- Health checks - -### 2. Self-Upgrade Loop -- Detect failures from logs -- Propose fixes via Forge -- Present to user for approval -- Apply changes with rollback - -### 3. Dashboard Integration -- Tool registry browser -- Agent activity feed -- Memory browser -- Upgrade queue - ---- - -## Success Criteria Status - -| Criteria | Status | -|----------|--------| -| Start with `python main.py` | 🟡 Need entry point | -| Dashboard at localhost | ✅ Exists | -| Timmy responds to questions | ✅ Working | -| Routes to sub-agents | ✅ Implemented | -| MCP tool discovery | ✅ Working | -| LLM failover | 🟡 Phase 3 | -| Search memory | ✅ Working | -| Self-upgrade proposals | 🟡 Phase 3 | -| Lightning payments | ✅ Mock exists | - ---- - -## Key Achievements - -1. ✅ **MCP Protocol** — Full implementation with schemas, registry, server -2. ✅ **6 Production Tools** — All with error handling and health tracking -3. ✅ **Event Bus** — Async pub/sub for agent communication -4. ✅ **6 Agents** — Full roster with specialized roles -5. ✅ **Orchestrator** — Intelligent routing logic -6. ✅ **Memory System** — Three-tier architecture -7. ✅ **All Tests Pass** — No regressions - ---- - -## Ready for Phase 3 - -The foundation is solid. Next steps: -1. Cascade Router for LLM failover -2. Self-upgrade loop -3. Enhanced dashboard views -4. Production hardening diff --git a/docs/IMPLEMENTATION_SUMMARY.md b/docs/IMPLEMENTATION_SUMMARY.md deleted file mode 100644 index 296f1484..00000000 --- a/docs/IMPLEMENTATION_SUMMARY.md +++ /dev/null @@ -1,180 +0,0 @@ -# Implementation Summary: 3 New Features - -## Completed Features - -### 1. Cascade Router Integration ✅ - -**Files Created:** -- `src/timmy/cascade_adapter.py` - Adapter between Timmy and Cascade Router -- `src/dashboard/routes/router.py` - Dashboard routes for router status -- `src/dashboard/templates/router_status.html` - Router status UI - -**Files Modified:** -- `src/dashboard/app.py` - Registered router routes -- `src/dashboard/templates/base.html` - Added ROUTER nav link - -**Usage:** -```python -from timmy.cascade_adapter import get_cascade_adapter -adapter = get_cascade_adapter() -response = await adapter.chat("Hello") -print(f"Response: {response.content}") -print(f"Provider: {response.provider_used}") -``` - -**Dashboard:** `/router/status` - ---- - -### 2. Self-Upgrade Approval Queue *(originally implemented, since refactored)* - -> **Note:** The original `src/upgrades/` module was planned for consolidation into -> `src/self_coding/` but that consolidation was never completed. The module paths -> below reflect the original implementation and may no longer be accurate. - -**Original files:** -- `src/upgrades/models.py` - Database models for upgrades table -- `src/upgrades/queue.py` - Queue management logic -- `src/dashboard/routes/upgrades.py` - Dashboard routes -- `src/dashboard/templates/upgrade_queue.html` - Queue UI - -**Dashboard:** `/self-modify/queue` - ---- - -### 3. Real-Time Activity Feed *(originally implemented, since refactored)* - -> **Note:** The original module paths below reflect the pre-refactoring structure. -> Events are now under `src/infrastructure/events/`, WebSocket manager is now under -> `src/infrastructure/ws_manager/`. - -**Original files:** -- `src/events/broadcaster.py` → now `src/infrastructure/events/broadcaster.py` -- `src/ws_manager/handler.py` → now `src/infrastructure/ws_manager/handler.py` - -**Architecture:** -``` -Event Occurs → log_event() → SQLite - ↓ - event_broadcaster.broadcast_sync() - ↓ - ws_manager.broadcast_json() - ↓ - Dashboard (WebSocket) -``` - -**Dashboard:** `/swarm/live` (activity feed panel) - ---- - -## Test Results - -**Unit Tests:** 101 passed -``` -tests/test_event_log.py 25 passed -tests/test_ledger.py 18 passed -tests/test_vector_store.py 11 passed -tests/test_swarm.py 29 passed -tests/test_dashboard.py 18 passed -``` - -**E2E Tests:** Created (3 new test files) -- `tests/functional/test_cascade_router_e2e.py` -- `tests/functional/test_upgrade_queue_e2e.py` -- `tests/functional/test_activity_feed_e2e.py` - ---- - -## Running E2E Tests (Non-Headless) - -Watch the browser execute tests in real-time: - -```bash -# 1. Start the server -cd /Users/apayne/Timmy-time-dashboard -source .venv/bin/activate -make dev - -# 2. In another terminal, run E2E tests -source .venv/bin/activate -SELENIUM_UI=1 pytest tests/functional/test_cascade_router_e2e.py -v --headed - -# Or run all E2E tests -SELENIUM_UI=1 pytest tests/functional/ -v --headed -``` - -The `--headed` flag runs Chrome in visible mode so you can watch. - ---- - -## Database Schema Updates - -Three new tables created automatically: - -```sql --- Event Log (existing, now with broadcast) -CREATE TABLE event_log (...); - --- Lightning Ledger (existing) -CREATE TABLE ledger (...); - --- Vector Store (existing) -CREATE TABLE memory_entries (...); - --- NEW: Upgrade Queue -CREATE TABLE upgrades ( - id TEXT PRIMARY KEY, - status TEXT NOT NULL, - proposed_at TEXT NOT NULL, - branch_name TEXT NOT NULL, - description TEXT NOT NULL, - files_changed TEXT, - diff_preview TEXT, - test_passed INTEGER DEFAULT 0, - test_output TEXT, - error_message TEXT, - approved_by TEXT -); -``` - ---- - -## Navigation Updates - -New nav links in dashboard header: -- **EVENTS** → `/swarm/events` -- **LEDGER** → `/lightning/ledger` -- **MEMORY** → `/memory` -- **ROUTER** → `/router/status` -- **UPGRADES** → `/self-modify/queue` - ---- - -## Architecture Alignment - -All 3 features follow existing patterns: -- **Singleton pattern** for services (cascade_adapter, event_broadcaster) -- **SQLite persistence** through consistent DB access pattern -- **Dashboard routes** following existing route structure -- **Jinja2 templates** extending base.html -- **Event-driven** using existing event log infrastructure -- **WebSocket** using existing ws_manager - ---- - -## Security Considerations - -| Feature | Risk | Mitigation | -|---------|------|------------| -| Cascade Router | API key exposure | Uses existing config system | -| Upgrade Queue | Unauthorized changes | Human approval required | -| Activity Feed | Data leak | Events sanitized before broadcast | - ---- - -## Next Steps - -1. Run E2E tests with `SELENIUM_UI=1 pytest tests/functional/ -v --headed` -2. Manually test each dashboard page -3. Verify WebSocket real-time updates in `/swarm/live` -4. Test upgrade queue workflow end-to-end diff --git a/docs/IMPLEMENTATION_TICKETS.md b/docs/IMPLEMENTATION_TICKETS.md deleted file mode 100644 index 15373668..00000000 --- a/docs/IMPLEMENTATION_TICKETS.md +++ /dev/null @@ -1,1209 +0,0 @@ -# Implementation Tickets: Sovereign AGI Evolution - -**Source:** Perplexity AI synthesis of research report + 5 peer reviews -**Date:** 2026-03-08 -**Usage:** Drop each Claude Code Prompt into a fresh Claude Code session, one at a time, in order. - ---- - -## Ticket 1: Add WAL mode for all SQLite databases ✅ COMPLETED - -**Priority:** Tier 1 -**Estimated scope:** S -**Dependencies:** none -**Status:** DONE (2026-03-08) — WAL + busy_timeout=5000 added to brain/memory.py, swarm/event_log.py, spark/memory.py, spark/eidos.py, swarm/task_queue/models.py, infrastructure/models/registry.py. 8 new tests across 4 files. - -### Objective - -Ensure all SQLite databases use WAL mode and reasonable busy timeouts to improve write concurrency and reduce lock contention as agent activity scales. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. First, read CLAUDE.md and AGENTS.md at the repo root to understand conventions (pydantic settings, graceful degradation, thin routes, and configuration patterns). Then open brain/memory.py and swarm/event_log.py, and any shared DB initialization helpers under data/ or swarm/ that create SQLite connections. Implement a standard helper or pattern that, immediately after establishing any SQLite connection, sets `PRAGMA journal_mode=WAL` and `PRAGMA busy_timeout=5000` (or the project's existing timeout if one exists). Ensure this is done consistently for all known DBs: data/brain.db (UnifiedMemory), data/events.db or swarm event log DB, timmy.db (Agno history), and any Spark-related SQLite files referenced in the codebase. Follow existing error-handling and logging style, and avoid introducing new top-level modules; prefer small helpers colocated with existing connection code or a shared utility if one already exists. Make the behavior safe in production and tests (no hard-coded absolute paths, reuse existing config where possible). Update or add minimal tests to verify that new connections end up with WAL mode and a non-zero busy timeout, using the project's testing layout and any existing database tests as examples. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] All SQLite connection creation paths configure WAL mode and a non-zero busy timeout -- [ ] UnifiedMemory DB, swarm event log DB, Agno DB, and Spark's DBs all use the shared pattern -- [ ] New logic follows existing logging and error-handling conventions and does not introduce import-time side effects -- [ ] Tests or updated tests confirm the PRAGMA settings for at least one representative database -- [ ] Tests pass (`make test`) - ---- - -## Ticket 2: Introduce lazy initialization for critical singletons ✅ COMPLETED - -**Priority:** Tier 1 -**Estimated scope:** M -**Dependencies:** 1 -**Status:** DONE (2026-03-08) — config.py startup validation moved to validate_startup(). spark_engine, memory_system, event_bus all use lazy getters with __getattr__ backward compat. 15 new tests, 953 passing. -**Files to modify:** `spark/engine.py`, `config.py`, `infrastructure/events/bus.py`, `timmy/memory_system.py`, `infrastructure/router/cascade.py` -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `spark/engine.py`, `config.py`, `infrastructure/events/bus.py`, `timmy/memory_system.py`, `infrastructure/router/cascade.py` - -### Objective - -Replace import-time singleton instantiation with lazy, explicit getters so tests can import modules without side effects and services initialize only when needed. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. First read CLAUDE.md and AGENTS.md to align with project conventions around config, graceful degradation, and thin abstractions. Then inspect spark/engine.py, config.py, infrastructure/events/bus.py, timmy/memory_system.py, and infrastructure/router/cascade.py for module-level singletons such as spark_engine, settings validations, event_bus, memory_system, and cascade_router that are created at import time. Refactor these modules to use lazy accessors in the pattern already used by brain/memory.py's get_memory(): introduce private module-level variables (e.g., _spark_engine) and public getter functions (e.g., get_spark_engine()) that initialize the underlying object on first access and cache it. Ensure that production-only validation and sys.exit logic in config.py only runs via explicit initialization paths, not at bare import; keep the pydantic-settings pattern and environment cascade intact. Preserve existing public import ergonomics by updating call sites to use the new getters where appropriate, or by exposing lightweight proxy objects if required to avoid large rewrites. Add or update tests to confirm that importing these modules does not attempt DB connections, read .env files, or exit the process, and that repeated calls to the getters return stable instances and can be reset in tests when needed. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] `spark/engine.py`, `config.py`, `infrastructure/events/bus.py`, `timmy/memory_system.py`, and cascade router code no longer perform heavy work at import time -- [ ] New getter functions provide lazy initialization while preserving behavior -- [ ] Config validation and exits only happen through explicit initialization paths, not imports -- [ ] Tests demonstrate that imports have no side effects and that lazy initialization works and can be reset in tests -- [ ] Tests pass (`make test`) - ---- - -## Ticket 3: Unify EventBus with swarm event log persistence ✅ COMPLETED - -**Priority:** Tier 1 -**Estimated scope:** M -**Dependencies:** 1, 2 -**Status:** DONE (2026-03-08) — EventBus gains enable_persistence() + replay(). log_event() bridges to EventBus. App startup enables persistence. 10 new tests, 308 passing. - -### Objective - -Make the in-memory EventBus the single event interface while transparently persisting events via the swarm event log, enabling durable history and future replay. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Start by reading CLAUDE.md and AGENTS.md to understand event patterns and conventions. Then open infrastructure/events/bus.py and swarm/event_log.py. Refactor these so that there is a single, canonical EventBus abstraction that keeps its current async publish/subscribe behavior but writes all events through to the swarm event log persistence layer. Where swarm/event_log.py currently writes events directly, consolidate that logic into the EventBus implementation (e.g., via a persistence adapter or by having EventBus depend on a logger object), avoiding circular imports by using dependency injection or late imports if needed. Add a simple replay API to EventBus (e.g., replay(filters...)) that delegates to the persisted store to load past events, keeping the interface minimal but usable for debugging and future deterministic replay. Ensure that any existing callers of swarm event_log.py continue to function but are internally routed through the unified abstraction, and that in-memory history limits remain respected while persistence is complete. Add or update tests to cover: publishing an event persists it, replay returns expected events, and system behavior under restart is consistent with at-least-once persistence semantics. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] There is a single EventBus abstraction used for event publishing and subscription -- [ ] All published events are persisted via the swarm event log backend -- [ ] A basic replay API exists on EventBus that uses persisted events -- [ ] Existing users of `swarm/event_log.py` continue to work without breaking changes -- [ ] Tests cover persistence and basic replay behavior -- [ ] Tests pass (`make test`) - ---- - -## Ticket 4: Consolidate MEMORY.md hot memory into UnifiedMemory - -**Priority:** Tier 1 -**Estimated scope:** M -**Dependencies:** 1, 2, 3 -**Files to modify:** `timmy/memory_system.py`, `brain/memory.py`, any references to `MEMORY.md` in the codebase -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `timmy/memory_system.py`, `brain/memory.py`, `MEMORY.md` - -### Objective - -Make UnifiedMemory the single source of truth by representing HotMemory (MEMORY.md) as a tier or partition inside brain/memory.py, while preserving the existing hot/vault semantics. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md first to align with memory conventions. Then open timmy/memory_system.py, brain/memory.py, and the MEMORY.md file at the repo root. Refactor the memory design so that timmy/memory_system.py becomes a thin orchestration layer on top of brain/UnifiedMemory: represent HotMemory as a dedicated table, partition, or logical namespace within UnifiedMemory, and implement read/write operations that keep the semantics of "always-loaded context" but store data in brain.db rather than only in MEMORY.md. Preserve Vault memory as markdown files, but add indexing into UnifiedMemory for searchability (e.g., import vault note content and tags into a semantic or metadata table while leaving the files on disk). Wire the HandoffProtocol so that handoffs are serialized into UnifiedMemory with a tag or type field (e.g., "handoff") so they become discoverable and durable. Keep file formats backward compatible and avoid breaking existing routes that expect MEMORY.md content by providing migration or read-through behavior (e.g., syncing the initial contents into the new backend). Add or update tests to verify that hot memory read/writes now go through UnifiedMemory, that vault notes are searchable via brain/memory.py, and that handoff entries are persisted. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] UnifiedMemory in `brain/memory.py` becomes the canonical store for hot memory -- [ ] Vault markdown files remain on disk but are indexed into UnifiedMemory for search -- [ ] HandoffProtocol data is stored and tagged in UnifiedMemory -- [ ] Existing callers of `timmy/memory_system.py` continue to function -- [ ] Tests validate hot memory, vault indexing, and handoff persistence -- [ ] Tests pass (`make test`) - ---- - -## Ticket 5: Introduce MemoryFacade and tiered memory API - -**Priority:** Tier 2 -**Estimated scope:** L -**Dependencies:** 1, 2, 3, 4 -**Files to modify:** `brain/memory.py`, `timmy/memory_system.py`, any memory-related service modules -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `brain/memory.py`, `timmy/memory_system.py` - -### Objective - -Create a single MemoryFacade API that exposes working, episodic, and semantic memory tiers, preparing the system for durable knowledge extraction and reflection. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md to understand how agents interact with memory. Then inspect brain/memory.py and timmy/memory_system.py. Implement a MemoryFacade abstraction (in brain/memory.py or another existing brain/ module) that provides clearly delineated operations for Working Memory (current context, active plans, tool state), Episodic Memory (events, conversations, actions), and Semantic Memory (facts, skills, embeddings). Build on the unified backing store from UnifiedMemory and the consolidation of HotMemory so that working memory can be a small, size-limited in-memory cache with periodic persistence, episodic memory can store timestamped episodes and summaries, and semantic memory can store vectorized facts and concepts; keep the interface simple and driven by existing usage patterns (e.g., recall for context, store_episode, store_fact). Adapt timmy/memory_system.py to consume MemoryFacade instead of directly juggling multiple stores where possible, while keeping the public API stable for existing callers. Add tests that exercise basic tiered operations: writing and retrieving working memory, creating and retrieving episodic records, storing and querying semantic entries. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] A MemoryFacade exists that exposes explicit working, episodic, and semantic operations -- [ ] MemoryFacade uses UnifiedMemory as the underlying store -- [ ] `timmy/memory_system.py` is refactored to rely on MemoryFacade where appropriate -- [ ] Tests demonstrate correct behavior for each memory tier -- [ ] Tests pass (`make test`) - ---- - -## Ticket 6: Implement MemoryConsolidator background task - -**Priority:** Tier 3 -**Estimated scope:** M -**Dependencies:** 1, 2, 3, 4, 5 -**Files to modify:** `brain/memory.py` or a new consolidation module under `brain/`, `spark/engine.py`, relevant task scheduler or background worker setup -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `brain/memory.py`, `spark/engine.py` - -### Objective - -Add a MemoryConsolidator that periodically summarizes recent episodes, extracts facts, and performs decay/reinforcement to turn raw events into durable semantic knowledge. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then open brain/memory.py and spark/engine.py. Implement a MemoryConsolidator class (inside brain/memory.py or a small adjacent module under brain/) that periodically performs the consolidation pipeline described in the research: fetch recent episodic records, summarize them via the local LLM using the project's CascadeRouter/LLM abstraction, extract facts from summaries, store those facts in semantic memory, decay unaccessed memories older than a configured threshold, and reinforce frequently accessed memories. Reuse any existing Spark consolidation logic (e.g., _maybe_consolidate) where possible by extracting it into shared helpers instead of duplicating logic. Wire the consolidator into the appropriate background task mechanism (FastAPI startup event, existing scheduler, or Spark's event-driven hooks), controlled by configuration flags in config.py following pydantic-settings patterns and existing naming conventions. Add tests (with a MockLLM or equivalent) to verify that consolidation calls the LLM with expected prompts, creates summarized episodes, outputs fact entries, and updates decay/reinforcement metadata as configured. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] MemoryConsolidator exists and implements summarize, extract facts, decay, reinforce steps -- [ ] The consolidator runs on a configurable cadence without blocking request handling -- [ ] Consolidation uses the project's LLM abstraction and respects dependency injection patterns -- [ ] Tests simulate a consolidation cycle with a mock LLM and verify state changes -- [ ] Tests pass (`make test`) - ---- - -## Ticket 7: Create MockLLM for deterministic agent tests - -**Priority:** Tier 1 -**Estimated scope:** S -**Dependencies:** 2 -**Files to modify:** `tests/conftest.py`, any new test utilities module (under `tests/`), selected orchestrator/agent tests -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `tests/`, `timmy/agents/timmy.py`, any existing LLM abstraction - -### Objective - -Introduce a MockLLM implementation to enable deterministic tests for orchestrator routing, agent behavior, and non-deterministic flows. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md first to understand how agents call LLMs. Then open tests/conftest.py, timmy/agents/timmy.py, and any existing LLM provider abstractions. Implement a MockLLM class (either as a test-only helper in tests/conftest.py or a small test utility module under tests/) that accepts a mapping of string patterns to canned responses and records prompts passed to it. Wire this MockLLM into tests for the Timmy orchestrator and any existing agent tests so that classification and simple conversations can be verified without hitting real models. Ensure the mock supports both sync and async patterns if the project uses both. Add one or more tests that confirm: given a specific prompt or classification request, the orchestrator routes to the expected agent/tool, and the MockLLM returns deterministic content. Follow the project's pytest structure and naming conventions. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] A MockLLM exists and is easily reusable across tests -- [ ] Orchestrator and/or agent tests rely on MockLLM instead of real LLM calls -- [ ] Tests verify routing and behavior based on predefined MockLLM responses -- [ ] Real LLM calls are not required for CI -- [ ] Tests pass (`make test`) - ---- - -## Ticket 8: Introduce AgentCore perceive-decide-act-reflect interface - -**Priority:** Tier 2 -**Estimated scope:** M -**Dependencies:** 2, 4, 5, 7 -**Files to modify:** `timmy/agent_core/interface.py`, `timmy/agents/timmy.py`, any central agent orchestration modules -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `timmy/agent_core/interface.py`, `timmy/agents/timmy.py` - -### Objective - -Define and wire an AgentCore interface implementing the perceive-decide-act-reflect loop, sitting between Agno's Agent.run and the orchestrator to structure the cognitive cycle. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then open timmy/agent_core/interface.py and timmy/agents/timmy.py. Expand timmy/agent_core/interface.py to define a robust AgentCore abstract base class that captures the perceive, decide, act, and reflect steps, using appropriate type hints or lightweight data structures (WorldState, MemoryContext, Plan, ActionResult) consistent with existing code. Implement at least one concrete AgentCore wrapper around the current Agno-based agent, so that TimmyOrchestrator (or similar orchestrator) can call AgentCore.perceive/decide/act/reflect instead of only Agent.run, while preserving existing behavior. Integrate the AgentCore into the orchestrator path in a minimal, backward-compatible way, so existing routes and integrations still function but new code can rely on the structured loop. Add tests (using MockLLM where useful) that verify a simple request flows through perceive, decide, act, reflect and produces expected events or memory writes. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] AgentCore interface exists in `timmy/agent_core/interface.py` with perceive, decide, act, and reflect methods -- [ ] A concrete implementation wraps the current Agno agent behavior -- [ ] Orchestrator uses AgentCore in at least the primary path -- [ ] Tests demonstrate a full perceive-decide-act-reflect cycle -- [ ] Tests pass (`make test`) - ---- - -## Ticket 9: Implement deterministic TaskState machine with persistence - -**Priority:** Tier 2 -**Estimated scope:** M -**Dependencies:** 2, 3, 4, 5, 8 -**Files to modify:** `timmy/agents/timmy.py`, `swarm/event_log.py` or the unified event system, any task or job models -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `timmy/agents/timmy.py`, `infrastructure/events/bus.py`, `swarm/event_log.py` - -### Objective - -Replace ad-hoc routing state with an explicit TaskState enum and a TaskContext structure that is persisted, enabling auditability and recovery of agent tasks. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then open timmy/agents/timmy.py, infrastructure/events/bus.py, and swarm/event_log.py. Define a TaskState enum and a TaskContext dataclass (or pydantic model, consistent with project style) that captures task lifecycle phases (e.g., RECEIVED, CLASSIFIED, ROUTED, EXECUTING, REVIEWING, COMPLETED, FAILED) and associated metadata such as classification, assigned_agent, execution_history, and identifiers. Modify TimmyOrchestrator (or the main orchestrator class) to use TaskContext instances for each incoming request, updating state via a validated transition method that also emits events via the unified EventBus so each transition is recorded. Persist TaskContext snapshots using either the event log (event-sourced) or a dedicated table in an existing DB, following existing patterns in swarm/event_log.py or brain/memory.py, to allow replaying or inspecting task histories. Add tests that ensure legal transitions are allowed, illegal ones are rejected, events are emitted on transitions, and a simple task's life can be reconstructed from persisted data. Run make test to verify. -``` - -### Architecture Note (added 2026-03-08) - -> **Schema versioning required.** All persisted Pydantic models (TaskContext, -> AgentMessage) MUST include `schema_version: int = 1`. Without this, event -> replay/reconstruction will break when fields change in later tickets. -> Also add a `trace_id: str` field to TaskContext so the full execution graph -> of any user request can be reconstructed from the event log. - -### Acceptance Criteria - -- [ ] TaskState enum and TaskContext structure are defined and used in the orchestrator -- [ ] TaskContext includes `schema_version` and `trace_id` fields -- [ ] TaskContext transitions emit events and are persisted in a durable store -- [ ] Orchestrator uses TaskContext instead of loosely structured dicts/state for routing -- [ ] Tests validate transitions, persistence, and basic reconstruction of task history -- [ ] Tests pass (`make test`) - ---- - -## Ticket 10: Add LLM-based IntentClassifier for routing - -**Priority:** Tier 2 -**Estimated scope:** M -**Dependencies:** 2, 4, 7, 8, 9 -**Files to modify:** `timmy/agents/timmy.py`, LLM provider wrapper modules, possibly `timmy/agents/intent_classifier.py` -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `timmy/agents/timmy.py`, `infrastructure/router/cascade.py` - -### Objective - -Replace brittle keyword-based routing in TimmyOrchestrator with an LLM-based IntentClassifier that uses the local LLM stack and caches results. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then inspect timmy/agents/timmy.py and the LLM routing modules under infrastructure/router/, particularly CascadeRouter. Implement an IntentClassifier class (e.g., timmy/agents/intent_classifier.py) that uses the project's LLM abstraction to classify requests into categories such as DIRECT, RESEARCH, CODE, MEMORY, CREATIVE, and COMPLEX, based on the prompt template described in the research. Integrate this classifier into TimmyOrchestrator so that for each incoming request, the orchestrator calls IntentClassifier.classify and uses the result to choose the appropriate agent or workflow instead of relying on keyword lists. Add a simple caching mechanism (e.g., LRU cache keyed by normalized request strings) to avoid repeated LLM calls for identical or highly similar inputs, respecting project memory constraints. Update or add tests (using MockLLM) to verify that requests are classified into the correct categories and that orchestrator routing behaves appropriately when the classifier returns each label. Run make test to verify. -``` - -### Architecture Note (added 2026-03-08) - -> **Pivot from LLM classify to embedding similarity.** An LLM call adds 500ms–2s -> latency to the critical routing path. Instead, use embedding vectors + cosine -> similarity against a curated set of "Intent Vectors" (sub-10ms, deterministic). -> This reuses the sqlite-vec infrastructure from Ticket 11. Keep LLM fallback -> for truly ambiguous requests only. - -### Acceptance Criteria - -- [ ] An IntentClassifier is implemented and integrated into the orchestrator -- [ ] Primary routing uses embedding similarity, not LLM calls (LLM as fallback only) -- [ ] Keyword-based routing is minimized or removed for primary paths -- [ ] Classification can be tested via MockLLM and deterministic embeddings -- [ ] Tests cover classification behavior and routing outcomes for each category -- [ ] Tests pass (`make test`) - ---- - -## Ticket 11: Implement sqlite-vec or pluggable vector index for semantic memory - -**Priority:** Tier 2 -**Estimated scope:** M -**Dependencies:** 1, 4, 5 -**Files to modify:** `brain/memory.py`, `config.py`, any embedding/vector search helpers -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `brain/memory.py`, `config.py` - -### Objective - -Replace O(n) Python-based vector similarity computation with an indexed vector search mechanism (starting with sqlite-vec or an equivalent pluggable index) to scale semantic memory. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then open brain/memory.py and config.py. Identify the code paths where semantic memory queries load all embeddings from SQLite and compute cosine similarity in Python. Introduce a pluggable vector index abstraction in brain/memory.py that defaults to the current in-Python implementation but can use sqlite-vec or another efficient on-disk index when available. Add configuration flags in config.py (following the nested pydantic Settings pattern if present) to enable or disable sqlite-vec usage and specify any required extension loading or parameters. Implement the sqlite-vec (or similar) path to store vectors in an indexed column and perform queries using SQL, preserving the current API of memory search methods. Ensure the system gracefully falls back to the pure Python path if the extension is unavailable, aligning with the project's graceful degradation philosophy. Add tests to verify both the default and "fast path" behaviors, stubbing or simulating sqlite-vec calls where necessary, and asserting that sorted results remain consistent. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] Vector search in `brain/memory.py` is abstracted behind a pluggable index interface -- [ ] A configuration-controlled path uses sqlite-vec or similar indexing when enabled -- [ ] The previous in-Python search remains available as a fallback -- [ ] APIs of semantic search consumers are unchanged -- [ ] Tests cover both code paths and verify consistent ranking -- [ ] Tests pass (`make test`) - ---- - -## Ticket 12: Namespace and split config.py using nested pydantic models - -**Priority:** Tier 2 -**Estimated scope:** M -**Dependencies:** 2 -**Files to modify:** `config.py`, any modules that reference settings attributes -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `config.py` - -### Objective - -Refactor the monolithic Settings class into namespaced sections (LLM, memory, security, creative, etc.) using nested pydantic models while preserving backwards compatibility. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then open config.py. Identify the large Settings class with 90+ settings and group fields into meaningful nested models (e.g., LLMConfig, MemoryConfig, SecurityConfig, CreativeConfig) using pydantic BaseModel and nested BaseSettings patterns. Refactor Settings so it exposes these nested configs as attributes (settings.llm, settings.memory, etc.), but also provide backward-compatible properties or aliases for commonly used flat attributes so existing code does not break immediately. Keep environment and file override behavior unchanged and ensure import-time side effects remain avoided per the lazy init pattern implemented earlier. Update a representative set of modules that are "closest" to each namespace to use the new nested attributes (e.g., LLM code using settings.llm.*, memory code using settings.memory.*) to demonstrate the pattern. Add tests that verify loading from environment variables still works, nested models initialize properly, and the compatibility accessors return expected values. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] `config.py` defines nested config models grouped by domain -- [ ] Settings exposes namespaced attributes and compatibility accessors -- [ ] At least some core modules are updated to use the namespaced config -- [ ] Environment loading and overrides behave as before -- [ ] Tests validate both nested and backward-compatible access -- [ ] Tests pass (`make test`) - ---- - -## Ticket 13: Implement AgentPool for concurrent agent execution - -**Priority:** Tier 3 -**Estimated scope:** M -**Dependencies:** 2, 8, 9, 10 -**Files to modify:** `timmy/agents/timmy.py` (or agent execution module), any async agent loop implementation -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `timmy/agents/timmy.py` - -### Objective - -Add an AgentPool abstraction that manages concurrent agent execution with a semaphore and thread pool, replacing ad-hoc asyncio.to_thread usage. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then open timmy/agents/timmy.py and any agent loop implementation that wraps Agno's Agent.run in asyncio.to_thread. Implement an AgentPool class that uses an asyncio.Semaphore and a ThreadPoolExecutor (with configurable max_concurrent) to execute blocking agent.run calls, exposing an async execute(agent, message) interface. Replace direct calls to asyncio.to_thread or equivalent ad-hoc patterns in the agentic loop with calls to AgentPool.execute, wiring the pool into the orchestrator or agent manager via dependency injection or a lazy getter, consistent with the project's DI approach. Ensure proper cleanup of the ThreadPoolExecutor on shutdown (e.g., FastAPI shutdown events or equivalent hooks). Add tests with a fake or Mock agent that simulate multiple concurrent executions, verifying that concurrency limits are respected and that results are returned correctly. Run make test to verify. -``` - -### Architecture Note (added 2026-03-08) - -> **SQLite write contention risk.** Even with WAL mode (Ticket 1), 4 concurrent -> agents hitting brain.db in a tight loop will cause `database is locked` errors. -> Python's sqlite3 driver handles concurrency poorly. Wrap database writes in an -> `asyncio.Lock()` at the application layer, or adopt `aiosqlite` for proper -> async SQLite access. This is critical for AgentPool to function correctly. - -### Acceptance Criteria - -- [ ] AgentPool exists and controls concurrent use of blocking agents -- [ ] Agentic loop no longer directly uses `asyncio.to_thread`; it uses AgentPool -- [ ] Database write paths use `asyncio.Lock()` or `aiosqlite` to prevent lock contention -- [ ] Concurrency limits are configurable via settings -- [ ] Executor shutdown is handled gracefully -- [ ] Tests cover basic concurrency behavior -- [ ] Tests pass (`make test`) - ---- - -## Ticket 14: Define structured AgentMessage for inter-agent communication - -**Priority:** Tier 3 -**Estimated scope:** S -**Dependencies:** 3, 4, 5, 9 -**Files to modify:** `timmy/agents/timmy.py`, any module where agents send messages via the event bus, possibly `timmy/agents/messages.py` -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `timmy/agents/timmy.py`, `infrastructure/events/bus.py` - -### Objective - -Replace loosely structured dict messages between agents with a structured AgentMessage data model that captures sender, recipient, type, content, and thread metadata. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then inspect timmy/agents/timmy.py and any code that sends inter-agent messages via the EventBus. Implement an AgentMessage dataclass or pydantic model (e.g., in timmy/agents/messages.py) containing fields such as from_agent, to_agent, message_type (request, response, delegate, escalate), content, context, thread_id, priority, and requires_response. Update inter-agent communication code to construct and emit AgentMessage instances instead of generic dicts, while preserving the underlying event/topic names used by EventBus. Ensure serialization and deserialization for events is handled consistently (e.g., via .dict() or model_dump()), keeping backward compatibility where needed for existing consumers. Add tests that verify AgentMessage roundtripping through the event bus and that fields are populated as expected in typical workflows. Run make test to verify. -``` - -### Architecture Note (added 2026-03-08) - -> **Schema versioning + trace_id required.** AgentMessage MUST include -> `schema_version: int = 1` for forward-compatible event replay. Also include -> a `trace_id: str` field that correlates all messages belonging to a single -> user request, enabling full execution graph reconstruction from the event log. - -### Acceptance Criteria - -- [ ] AgentMessage model exists and is used for inter-agent communication -- [ ] AgentMessage includes `schema_version` and `trace_id` fields -- [ ] EventBus payloads carry structured messages instead of arbitrary dicts -- [ ] Serialization and deserialization work correctly with existing event infrastructure -- [ ] Tests confirm structure and roundtrip behavior -- [ ] Tests pass (`make test`) - ---- - -## Ticket 15: Introduce capability-based permission model for agents - -**Priority:** Tier 3 -**Estimated scope:** M -**Dependencies:** 2, 4, 5, 8, 9, 14 -**Files to modify:** `config.py`, `infrastructure/hands/tools.py`, `timmy/agents/timmy.py`, any permission/flag checks -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `config.py`, `infrastructure/hands/tools.py`, `timmy/agents/timmy.py` - -### Objective - -Evolve from global feature flags to a capability-based permission system where each agent has explicit capabilities with scopes, approval requirements, and cost bounds. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then open config.py, infrastructure/hands/tools.py, and timmy/agents/timmy.py. Define Capability and AgentPermissions types (e.g., dataclasses or pydantic models) representing permissions like shell.execute, git.push, memory.write with fields for scope, requires_approval, and max_cost_sats. Implement a basic AgentPermissions container that can answer can(action: str) queries. Refactor the code paths that currently rely on global flags such as self_modify_enabled and hands_shell_enabled to consult AgentPermissions per agent instead, starting with the most sensitive operations (shell tools, git, self-modification). Integrate this with the ToolRegistry or hands tool registration so capabilities are associated with tools and filtered by agent. Keep global flags as high-level defaults or safety rails but have the permission checks happen through the new capability model. Add tests to confirm that agents with and without specific capabilities behave differently when attempting sensitive operations and that cost bounds or approval flags are respected where enforced. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] Capability and AgentPermissions abstractions are implemented -- [ ] Sensitive operations consult AgentPermissions instead of only global flags -- [ ] Tool registration associates capabilities with tools -- [ ] Tests verify permission enforcement and backward compatibility with existing flags -- [ ] Tests pass (`make test`) - ---- - -## Ticket 16: Add GhostCore scaffolding and align with AgentCore - -**Priority:** Tier 4 -**Estimated scope:** L -**Dependencies:** 2, 4, 5, 6, 8, 9, 10, 11, 13, 15 -**Files to modify:** `timmy/ghost.py` (or an existing core module under `timmy/`), `timmy/agent_core/interface.py`, relevant workflow/registry stubs -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `timmy/agent_core/interface.py`, `timmy/` (core agent modules) - -### Objective - -Introduce an initial GhostCore class implementing a ReAct+Reflexion loop that plugs into AgentCore and prepares for YAML workflows and externalized tools without disrupting current behavior. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then open timmy/agent_core/interface.py and the main agent modules under timmy/. Implement an initial GhostCore class (in an existing timmy core file or an appropriately named module under timmy/) that encapsulates the ReAct+Reflexion loop described in the research: observe/perceive inputs, choose a plan, execute tools or actions, and reflect/store lessons using the existing memory abstractions and LLM router. For this ticket, focus on a minimal but functional implementation that wraps the existing AgentCore or orchestrator logic rather than adding full YAML workflows or Docker tool registry: GhostCore should delegate tool invocations to current in-process tools and use MemoryFacade where available. Ensure GhostCore integrates cleanly with the AgentCore interface so that existing routes can opt into GhostCore-based execution for specific workflows without breaking legacy behavior. Add tests that simulate a simple goal being processed by GhostCore, verifying that observe and reflect steps call into memory and that the loop can handle at least one iteration of critique and adjustment. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] GhostCore exists and implements a basic ReAct+Reflexion loop on top of AgentCore -- [ ] GhostCore uses existing LLM and memory abstractions rather than new dependencies -- [ ] Integration points exist so orchestrator or routes can opt into GhostCore -- [ ] Tests cover a simple GhostCore run and verify memory interaction and loop structure -- [ ] Tests pass (`make test`) - ---- - -## Ticket 17: Strengthen non-deterministic testing with golden-path and chaos tests - -**Priority:** Tier 4 -**Estimated scope:** M -**Dependencies:** 2, 3, 4, 7, 8, 9, 10, 13 -**Files to modify:** `tests/` (new golden and chaos tests), any test helpers or fixtures related to LLMs, agents, and EventBus -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `tests/`, `timmy/agents/timmy.py`, `infrastructure/events/bus.py` - -### Objective - -Extend the test suite with golden-path end-to-end tests and basic chaos tests for agent resilience, using MockLLM and the unified event system. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then review the existing tests directory, timmy/agents/timmy.py, and infrastructure/events/bus.py. Using the MockLLM and AgentPool abstractions, add golden-path tests that run a full agentic loop for simple tasks (e.g., "List files in current directory") and assert that results satisfy high-level properties: status completed or partial, non-empty step list, valid summary, positive duration. Add chaos-style tests (marked with an appropriate pytest marker) that simulate failures of core components (e.g., temporary LLM errors, EventBus persistence failures, or underlying LLM backend crashes) using context managers or fixtures, verifying that the system degrades gracefully (errors captured, partial results returned, no crashes) in line with the project's graceful degradation philosophy. Ensure tests are deterministic by using MockLLM and stubbed backends where possible and that they are safe to run in CI (chaos tests can be optionally skipped via markers). Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] Golden-path tests exist that exercise full agentic flows with assertions on outcomes -- [ ] Chaos tests simulate core infrastructure failures and verify graceful degradation -- [ ] Tests use MockLLM and do not rely on real external LLMs -- [ ] Test markers allow chaos tests to be selectively run -- [ ] Tests pass (`make test`) - ---- - -## Ticket 18: Tighten tool/plugin architecture and ToolCapability model - -**Priority:** Tier 4 -**Estimated scope:** M -**Dependencies:** 2, 3, 4, 5, 11, 12, 15 -**Files to modify:** `infrastructure/hands/tools.py`, any MCP or tool registry modules, `plugins/` (if present) -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `infrastructure/hands/tools.py`, any tool registry code - -### Objective - -Formalize tools as capabilities with schemas and safety metadata, and ensure the tool registry supports discovery and agent-specific tool sets. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then open infrastructure/hands/tools.py and any existing tool registry or MCP integration modules. Implement a ToolCapability model (dataclass or pydantic) containing fields like name, description, input_schema, output_schema, cost_estimate, requires_approval, and safety_level. Extend the tool registry so it registers ToolCapability instances alongside their handlers, supports natural-language discovery (e.g., discover(query: str) -> list[ToolCapability]), and can return tools permitted for a specific agent using AgentPermissions. Update existing shell/git tools and other core tools to register via this new structure while keeping any current registration APIs working for backward compatibility. Add tests that confirm registration works, discovery returns expected capabilities, and agent-specific tool lists respect permissions. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] ToolCapability model exists and is used in the tool registry -- [ ] Tool registry supports registration, discovery, and agent-scoped tool queries -- [ ] Core tools are registered with capabilities including safety metadata -- [ ] Tests validate the registry behavior and permission filtering -- [ ] Tests pass (`make test`) - ---- - -## Execution Order Summary - -| Order | Ticket | Title | Tier | Scope | -|-------|--------|-------|------|-------| -| 1 | T1 | WAL mode for all SQLite databases | 1 | S | -| 2 | T2 | Lazy initialization for singletons | 1 | M | -| 3 | T7 | MockLLM for deterministic tests | 1 | S | -| 4 | T3 | Unify EventBus with swarm persistence | 1 | M | -| 5 | T4 | Consolidate MEMORY.md into UnifiedMemory | 1 | M | -| 6 | T5 | MemoryFacade and tiered memory API | 2 | L | -| 7 | T12 | Namespace config.py with nested models | 2 | M | -| 8 | T11 | sqlite-vec pluggable vector index | 2 | M | -| 9 | T8 | AgentCore perceive-decide-act-reflect | 2 | M | -| 10 | T9 | TaskState machine with persistence | 2 | M | -| 11 | T10 | LLM-based IntentClassifier | 2 | M | -| 12 | T14 | Structured AgentMessage model | 3 | S | -| 13 | T13 | AgentPool for concurrent execution | 3 | M | -| 14 | T15 | Capability-based permission model | 3 | M | -| 15 | T6 | MemoryConsolidator background task | 3 | M | -| 16 | T16 | GhostCore scaffolding | 4 | L | -| 17 | T17 | Golden-path and chaos tests | 4 | M | -| 18 | T18 | ToolCapability model and registry | 4 | M | - -*Note: T1 and T2 can run in parallel. T7 can run after T2 completes. Tickets within -the same tier can often be parallelized if their dependencies are satisfied.* - ---- - -## Supplemental Tickets (Peer Review Gaps) - -The following tickets cover consensus findings from all 5 peer reviews that -Perplexity's 18 tickets did not address. Numbering continues from T19. - ---- - -## Ticket 19: Add security threat model and sovereignty boundary documentation - -**Priority:** Tier 1 (all 5 reviewers flagged as CRITICAL) -**Estimated scope:** M -**Dependencies:** none -**Files to modify:** `docs/THREAT_MODEL.md` (new, under `docs/`), `docs/SOVEREIGN_AGI_RESEARCH.md` (cross-reference) -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `config.py`, `infrastructure/hands/shell.py`, `dashboard/middleware/csrf.py`, `dashboard/middleware/security_headers.py` - -### Objective - -Create an explicit threat model documenting adversaries, attack vectors, mitigations, and the sovereignty boundary — what is fully local vs what depends on external infrastructure. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md first. Then read the existing security code: dashboard/middleware/csrf.py, dashboard/middleware/security_headers.py, infrastructure/hands/shell.py (especially the blocked commands at lines 29-39 and allow-list at lines 42-71), and config.py (production validation at lines 349-363, self_modify_enabled, hands_shell_enabled flags). Also read docs/SOVEREIGN_AGI_RESEARCH.md for the architecture context. - -Create docs/THREAT_MODEL.md with the following sections: - -1. Sovereignty Boundary Diagram (text/ASCII): classify each component as Fully Local (Ghost Core, Ollama, SQLite, brain/memory), Locally Controlled External (LN peers, Docker containers), or External Dependency (Discord/Telegram APIs, Docker Hub images). - -2. Threat Actor Table with columns: Actor, Capabilities, Attack Vectors, Current Mitigations, Residual Risk, Recommended Mitigations. Include at minimum: (a) Malicious/compromised tool container, (b) Prompt-injected content from web/Discord/Telegram triggering tool calls or LN spend, (c) Compromised LN peer, (d) Malicious local user/process tampering with state or keys, (e) Supply chain attack via Docker images. - -3. Network Egress Policy: define "no outbound HTTP except allowed" rule for Ghost Core, listing exactly which outbound connections are permitted (Ollama localhost, LN node, configured webhook URLs). - -4. Secrets Management: document current approach (pydantic-settings .env, production enforcement in config.py), identify gaps (no rotation strategy, no encrypted-at-rest), recommend next steps (SOPS or similar). - -5. Container/Tool Sandboxing Requirements: for each tool trust level (trusted in-process, sandboxed subprocess, untrusted container), specify filesystem isolation, network isolation, and capability constraints. - -Keep the document concise and actionable — tables over prose. Add a cross-reference link from docs/SOVEREIGN_AGI_RESEARCH.md's security section. Run make test to verify nothing is broken. -``` - -### Acceptance Criteria - -- [ ] `docs/THREAT_MODEL.md` exists with all 5 sections -- [ ] Threat actor table covers at least 5 adversary types with mitigations -- [ ] Sovereignty boundary clearly classifies all major components -- [ ] Network egress policy is explicit -- [ ] Cross-reference added to research document -- [ ] Tests pass (`make test`) - ---- - -## Ticket 20: Add OpenTelemetry observability infrastructure - -**Priority:** Tier 1 (all 5 reviewers flagged as HIGH) -**Estimated scope:** M -**Dependencies:** 2, 3 -**Files to modify:** `infrastructure/observability.py` (new under `infrastructure/`), `infrastructure/events/bus.py`, `dashboard/app.py`, `config.py` -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `infrastructure/events/bus.py`, `infrastructure/events/broadcaster.py`, `dashboard/app.py`, `config.py` - -### Objective - -Add optional OpenTelemetry tracing with LLM-specific span attributes, wired into the EventBus and FastAPI middleware, so agent workflows can be debugged end-to-end. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md first. Then read infrastructure/events/bus.py, dashboard/app.py (middleware stack at lines 291-313, lifespan at lines 200-267), and config.py. - -Create infrastructure/observability.py with: - -1. A config flag in config.py: `observability_enabled: bool = False` and `observability_endpoint: str = ""` following pydantic-settings patterns. - -2. An init_tracing() function that sets up OpenTelemetry TracerProvider with OTLP exporter when enabled, no-op when disabled. Use graceful degradation — if opentelemetry packages aren't installed, log a warning and return a no-op tracer. Never crash. - -3. A trace_agent_step() context manager that creates spans with attributes: llm.model, llm.prompt_tokens, workflow.step, memory.retrieved_count, economic.sats_spent. Make it safe to use even when tracing is disabled (no-op path). - -4. Wire init_tracing() into dashboard/app.py's lifespan startup. Add FastAPI middleware or instrument_fastapi() call for HTTP request tracing when enabled. - -5. Add a subscriber to EventBus that creates spans for published events, linking event_id to trace context so events appear in traces. - -Ensure the entire feature is opt-in and zero-cost when disabled. Add opentelemetry packages to an optional dependency group (not required). Add tests that verify: tracing initializes without error when disabled, trace_agent_step produces expected attributes when a mock tracer is injected, and the EventBus subscriber doesn't break event publishing. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] `infrastructure/observability.py` exists with init_tracing() and trace_agent_step() -- [ ] Feature is fully opt-in via config flag, zero-cost when disabled -- [ ] Graceful degradation when opentelemetry packages aren't installed -- [ ] EventBus events are linked to trace context -- [ ] FastAPI middleware adds HTTP tracing when enabled -- [ ] Tests pass without opentelemetry installed (`make test`) - ---- - -## Ticket 21: Add human-in-the-loop approval gates - -**Priority:** Tier 2 (4 of 5 reviewers flagged as HIGH) -**Estimated scope:** M -**Dependencies:** 2, 3, 9, 15 -**Files to modify:** `timmy/approvals.py`, `infrastructure/hands/shell.py`, `infrastructure/hands/git.py`, `config.py`, `dashboard/routes/tasks.py` -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `timmy/approvals.py`, `infrastructure/hands/shell.py`, `infrastructure/hands/git.py`, `config.py` - -### Objective - -Implement approval gates that pause agent execution for human review before high-risk actions (shell commands, git push, LN payments above threshold), with configurable auto-approve for low-risk operations. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md first. Then read timmy/approvals.py (existing approval queue), infrastructure/hands/shell.py (shell sandboxing), infrastructure/hands/git.py (git execution), and config.py. - -Extend the existing approval system in timmy/approvals.py to support: - -1. An ApprovalPolicy model (pydantic or dataclass) with fields: action_type (str), requires_approval (bool), auto_approve_below_sats (int), max_invocations_per_run (int). Add a list of default policies in config.py following existing naming patterns. - -2. An async request_approval(action, context) method that checks the policy, auto-approves if within bounds, otherwise queues the action and emits an "approval.requested" event via EventBus, then awaits resolution (approved/denied) with a configurable timeout. - -3. Wire shell.execute in infrastructure/hands/shell.py and git operations in infrastructure/hands/git.py to call request_approval() before executing when the policy requires it. Keep the existing allow-list/block-list as a first pass; approval gates are a second pass for allowed-but-risky commands. - -4. Add a simple approval review endpoint to dashboard/routes/tasks.py (or an existing route file) that lists pending approvals and allows approve/deny via HTMX, following the existing route registration pattern in dashboard/app.py. - -5. Add a kill switch config flag: `agent_kill_switch: bool = False` that, when True, immediately denies all pending and new approval requests and prevents new workflow starts. - -Add tests that verify: approval is requested for shell commands when policy requires it, auto-approve works below threshold, kill switch blocks all actions, and the approval endpoint returns pending items. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] ApprovalPolicy model exists and is configurable via config.py -- [ ] Shell and git operations check approval before executing when policy requires -- [ ] Auto-approve works for low-risk operations within bounds -- [ ] Kill switch immediately blocks all agent actions -- [ ] Dashboard endpoint shows pending approvals -- [ ] Tests pass (`make test`) - ---- - -## Ticket 22: Add YAML workflow schema validation with Pydantic - -**Priority:** Tier 2 (4 of 5 reviewers flagged as HIGH) -**Estimated scope:** S -**Dependencies:** 2 -**Files to modify:** `timmy/workflows/` (or appropriate existing module), `config.py` -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, any existing YAML workflow files or references in the codebase - -### Objective - -Prevent malformed YAML workflows from crashing the system by validating all workflow files against a strict Pydantic schema before execution, with schema versioning for future migration. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md first. Search the codebase for any existing YAML workflow files or references to workflow loading/parsing (grep for "yaml", "workflow", "YAML" in src/). - -Implement workflow validation: - -1. Define a WorkflowSchema pydantic model that captures: schema_version (str), workflow_id (str), description (str), steps (list of StepSchema with name, tool, parameters, on_failure), budget (optional: max_sats, max_file_writes, max_external_domains, max_shell_commands), and review_required (list of action types requiring human approval). Include a migration section with from_version and instructions fields for future use. - -2. Create a validate_workflow(yaml_path_or_content) function that parses YAML via safe_load, validates against WorkflowSchema, and returns either the validated model or raises a clear ValidationError. This function MUST be called before any workflow is executed or saved. - -3. If no workflow loading code exists yet, create the scaffolding in an appropriate module under timmy/ (e.g., timmy/workflow_engine.py) with load_workflow() and validate_workflow() functions. If workflow code already exists, integrate validation into the existing loading path. - -4. Add a budget enforcement check: before a workflow runs, verify its declared budget doesn't exceed system-level caps from config.py. - -Add tests that verify: valid workflows pass validation, invalid YAML is rejected with clear errors, missing required fields are caught, budget caps are enforced, and schema_version is present. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] WorkflowSchema pydantic model exists with all specified fields -- [ ] validate_workflow() rejects malformed YAML with clear error messages -- [ ] Schema versioning field is required -- [ ] Budget enforcement checks system-level caps -- [ ] Tests cover valid, invalid, and edge-case workflows -- [ ] Tests pass (`make test`) - ---- - -## Ticket 23: Add ToolSubstrate abstraction (subprocess + Docker backends) - -**Priority:** Tier 2 (all 5 reviewers flagged as HIGH) -**Estimated scope:** M -**Dependencies:** 2, 15 -**Files to modify:** `infrastructure/hands/tools.py`, `infrastructure/hands/shell.py`, `config.py` -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `infrastructure/hands/tools.py`, `infrastructure/hands/shell.py`, `config.py` - -### Objective - -Abstract tool execution behind a ToolSubstrate interface with subprocess (bubblewrap) and Docker backends, so tools run on RPi/bare-metal without requiring Docker and container cold-start latency is avoided for lightweight tools. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md first. Then read infrastructure/hands/tools.py, infrastructure/hands/shell.py (existing sandbox with allow-list), and config.py. - -Implement a ToolSubstrate abstraction: - -1. Define a ToolSubstrate protocol/ABC in infrastructure/hands/tools.py with methods: async execute(tool_name, params) -> ToolResult, async health_check() -> bool, async shutdown(). - -2. Implement SubprocessSubstrate that runs tools as local subprocesses using the existing shell sandboxing in infrastructure/hands/shell.py. This is the default — works everywhere, fast startup, uses existing allow-list and timeout enforcement. If bubblewrap (bwrap) is available on the system, use it for additional filesystem/network isolation; if not, fall back to plain subprocess with existing restrictions. - -3. Implement ContainerSubstrate that runs tools in Docker/Podman containers. Detect whether docker or podman is available (prefer podman for rootless security). Support configurable image registry, network isolation mode, and volume mounts. Handle cold-start by supporting a warm pool of N pre-started containers (configurable via config.py). - -4. Add config flags: tool_substrate_default (str, "subprocess" or "container"), tool_container_runtime (str, "auto" | "docker" | "podman"), tool_warm_pool_size (int, default 0). - -5. Wire the substrate selection into the tool execution path so that tools declare their required substrate (or use the default), and the runtime selects the appropriate backend. - -Add tests that verify: SubprocessSubstrate executes a simple command, ContainerSubstrate gracefully degrades when Docker/Podman isn't available, substrate selection respects config, and health checks work. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] ToolSubstrate interface defined with execute/health_check/shutdown -- [ ] SubprocessSubstrate works as default, with optional bubblewrap -- [ ] ContainerSubstrate supports Docker and Podman with graceful degradation -- [ ] Config flags control substrate selection and warm pool size -- [ ] Tests pass without Docker installed (`make test`) - ---- - -## Ticket 24: Add event sourcing snapshot and compaction - -**Priority:** Tier 2 (3 of 5 reviewers flagged as MEDIUM) -**Estimated scope:** M -**Dependencies:** 1, 3 -**Files to modify:** `infrastructure/events/bus.py` or `swarm/event_log.py`, `config.py` -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `infrastructure/events/bus.py`, `swarm/event_log.py` - -### Objective - -Prevent infinite event log growth by adding periodic snapshotting and compaction, archiving old events while preserving replay capability from the most recent snapshot. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md first. Then read infrastructure/events/bus.py and swarm/event_log.py (the unified event system from Ticket 3). - -Implement event compaction: - -1. Add a snapshots table to the event log database schema: snapshot_id, timestamp, state_json (serialized system state at that point), event_count_at_snapshot. - -2. Implement create_snapshot() that captures the current materialized state from the event log (counts, latest task states, agent states) and stores it as a snapshot row. - -3. Implement compact(before_date) that: creates a snapshot at the compaction point, moves events older than before_date to an archive table (or deletes them if archive isn't needed), and logs the compaction. Ensure the snapshot + remaining events can fully reconstruct current state. - -4. Add config flags: event_compaction_enabled (bool, default False), event_compaction_interval_hours (int, default 168 = 1 week), event_max_age_days (int, default 90). - -5. Wire compaction into the FastAPI lifespan or an existing background scheduler so it runs periodically when enabled. - -6. Update the replay API (from Ticket 3) to start from the most recent snapshot and replay only subsequent events, avoiding O(n) full replay. - -Add tests that verify: snapshots capture expected state, compaction removes old events, replay from snapshot + remaining events produces correct state, and the feature is inert when disabled. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] Snapshots table exists and create_snapshot() works -- [ ] compact() archives old events and creates a snapshot -- [ ] Replay starts from snapshot when available -- [ ] Feature is opt-in via config flags -- [ ] Tests verify snapshot correctness and replay consistency -- [ ] Tests pass (`make test`) - ---- - -## Ticket 25: Add data-driven policy engine for agent permissions - -**Priority:** Tier 2 (3 of 5 reviewers flagged as MEDIUM) -**Estimated scope:** M -**Dependencies:** 2, 15, 22 -**Files to modify:** `config.py`, `timmy/agents/timmy.py`, `infrastructure/hands/tools.py` -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `config.py`, `timmy/agents/timmy.py`, `infrastructure/hands/tools.py` - -### Objective - -Replace hardcoded permission checks with a data-driven policy engine where policies are defined as YAML/JSON configuration, enabling users to adjust agent behavior without editing Python. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md first. Then read config.py, timmy/agents/timmy.py, and infrastructure/hands/tools.py. - -Implement a minimal policy engine: - -1. Define a Policy pydantic model with fields: id (str), description (str), match (dict with capability pattern, e.g., "shell.execute"), constraints (dict with requires_approval: bool, max_invocations_per_run: int, allowed_time_window: optional str, max_cost_sats: optional int). - -2. Add a policies_file config setting in config.py (str, default "policies.yaml") and a PolicyEngine class that loads policies from this file on startup, with graceful fallback to sensible defaults if the file doesn't exist. - -3. Implement PolicyEngine.evaluate(capability: str, context: dict) -> PolicyDecision that checks all matching policies and returns ALLOW, DENY, or REQUIRE_APPROVAL with the reason. - -4. Wire PolicyEngine.evaluate into the capability check path from Ticket 15 (AgentPermissions.can()) so that the policy engine is consulted for every sensitive action. Keep the existing global flags as ultimate overrides (kill switch pattern). - -5. Create a default policies.yaml in the repo with sensible defaults: shell.execute requires approval with max 3 per run, git.push requires approval, memory.write is allowed, file.write is allowed within src/ and tests/ only. - -Add tests that verify: policies load from YAML, evaluate returns correct decisions for matching and non-matching rules, default policies work, and missing policy file triggers graceful fallback. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] Policy model and PolicyEngine class exist -- [ ] Policies load from YAML file with graceful fallback -- [ ] evaluate() returns correct decisions based on matching rules -- [ ] PolicyEngine is wired into capability checks -- [ ] Default policies.yaml provides sensible defaults -- [ ] Tests pass (`make test`) - ---- - -## Ticket 26: Add LN payment error handling and compensation events - -**Priority:** Tier 3 (3 of 5 reviewers flagged as MEDIUM) -**Estimated scope:** M -**Dependencies:** 3, 9, 20 -**Files to modify:** modules referencing Lightning/L402 payments, `config.py`, `infrastructure/events/bus.py` -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `config.py` (L402 settings), any LN/payment modules in the codebase - -### Objective - -Treat Lightning Network operations as asynchronous events with retry and compensation logic, handling partial failures gracefully instead of assuming payments always succeed. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md first. Search the codebase for Lightning, L402, payment, invoice, and sats references (grep in src/). Read config.py lines 349-363 for L402 production validation. - -Implement LN payment resilience: - -1. Define PaymentEvent pydantic models: PaymentRequested, PaymentCompleted, PaymentFailed, PaymentCompensated, each extending the project's Event model with fields like invoice_id, amount_sats, recipient, workflow_id, retry_count, failure_reason. - -2. Implement a PaymentManager class (in the appropriate existing module or a new module under infrastructure/) that wraps LN payment calls with: retry logic (configurable max_retries, exponential backoff), timeout handling, and compensation event emission on failure. - -3. Add budget tracking: PaymentManager maintains a per-workflow running total of sats spent and refuses payments that would exceed the workflow's declared budget (from WorkflowSchema if available) or a global daily limit from config.py (e.g., ln_daily_limit_sats: int = 10000). - -4. Emit all payment lifecycle events through the unified EventBus so they appear in traces and can be audited. - -5. Handle the "workflow succeeded but payment failed" case by marking the workflow as "financially_incomplete" in TaskState and emitting a compensation event for manual review. - -Add tests that verify: payment retry logic, budget enforcement, compensation events on failure, and the financially_incomplete state. Mock all actual LN calls. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] PaymentEvent models exist for the full payment lifecycle -- [ ] PaymentManager implements retry, timeout, and budget enforcement -- [ ] Per-workflow and daily budget limits are enforced -- [ ] Payment events flow through EventBus -- [ ] Partial failure produces financially_incomplete state -- [ ] Tests pass without real LN node (`make test`) - ---- - -## Ticket 27: Add Timmy Console introspection dashboard route - -**Priority:** Tier 2 (recommended by 3 of 5 reviewers) -**Estimated scope:** M -**Dependencies:** 3, 9, 20 -**Files to modify:** `dashboard/routes/` (new route file), `dashboard/templates/` (new template), `dashboard/app.py` -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `dashboard/app.py`, `dashboard/routes/system.py`, `dashboard/templates/base.html` - -### Objective - -Add a /console route that provides real-time operator introspection: active workflows, recent events, LN balance, current capabilities/policies, and a kill switch toggle. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md first. Then read dashboard/app.py (route registration at lines 323-348), dashboard/routes/system.py (existing system info route as a pattern), and dashboard/templates/base.html (base template with navbar). - -Create a Timmy Console page: - -1. Add dashboard/routes/console.py with an APIRouter following the existing pattern. Implement GET /console that renders a template with: (a) list of active workflows/tasks with their TaskState, (b) recent events from EventBus (last 50, with filtering by type), (c) LN wallet balance and recent payments if available, (d) current agent capabilities/policies summary, (e) kill switch toggle button. - -2. Create dashboard/templates/console.html extending base.html. Use HTMX for live updates — poll /console/events every 5 seconds for the event feed, similar to the swarm_live.html pattern. Display events in a scrollable log with timestamp, type, source, and data preview. - -3. Add POST /console/kill-switch endpoint that toggles the agent_kill_switch config flag and returns an HTMX partial confirming the new state. - -4. Register the router in dashboard/app.py following the existing pattern. - -5. Add a "Console" link to the navbar in base.html. - -Add tests that verify: GET /console returns 200, event feed returns recent events, kill switch toggle works, and the page renders without errors. Follow the existing test patterns in tests/dashboard/. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] GET /console renders with active tasks, events, and controls -- [ ] HTMX live event feed updates every 5 seconds -- [ ] Kill switch toggle works via POST endpoint -- [ ] Route registered in app.py, link in navbar -- [ ] Tests cover route responses and kill switch behavior -- [ ] Tests pass (`make test`) - ---- - -## Ticket 28: Add SQLite write queue with backpressure - -**Priority:** Tier 3 (3 of 5 reviewers flagged as MEDIUM) -**Estimated scope:** M -**Dependencies:** 1, 3 -**Files to modify:** `brain/memory.py`, `swarm/event_log.py`, or a shared DB utility module -**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `brain/memory.py`, `swarm/event_log.py` - -### Objective - -Add a write queue for SQLite operations that serializes writes, prevents connection exhaustion under concurrent agent load, and applies backpressure when the queue is full. - -### Claude Code Prompt - -``` -You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md first. Then read brain/memory.py and swarm/event_log.py for existing SQLite connection patterns. - -Implement a SQLite write queue: - -1. Create a WriteQueue class (in an existing infrastructure or utility module) that accepts write operations as callables, queues them in an asyncio.Queue with a configurable max size (default 1000), and processes them sequentially in a background task using a single dedicated SQLite connection per database. - -2. Implement backpressure: when the queue is full, callers receive a QueueFullError (or the write blocks with a timeout), preventing unbounded memory growth. Log warnings when the queue reaches 80% capacity. - -3. Wire WriteQueue into the highest-traffic write paths: event persistence (swarm/event_log.py or the unified EventBus) and memory writes (brain/memory.py). Read operations should continue to use their own connections directly — only writes go through the queue. - -4. Add config flags: db_write_queue_size (int, default 1000), db_write_queue_timeout (float, default 30.0). - -5. Ensure the queue drains gracefully on shutdown (process remaining items before closing the connection). - -Add tests that verify: writes are serialized (no concurrent write errors), backpressure triggers when queue is full, graceful shutdown drains the queue, and reads are unaffected. Run make test to verify. -``` - -### Acceptance Criteria - -- [ ] WriteQueue class serializes SQLite writes through a single connection -- [ ] Backpressure triggers at configurable queue limit -- [ ] Event persistence and memory writes use the queue -- [ ] Reads bypass the queue and work normally -- [ ] Graceful shutdown drains remaining writes -- [ ] Tests pass (`make test`) - ---- - -## Updated Execution Order (Full 28 Tickets) - -| Order | Ticket | Title | Tier | Scope | -|-------|--------|-------|------|-------| -| 1 | T1 | WAL mode for all SQLite databases | 1 | S | -| 2 | T2 | Lazy initialization for singletons | 1 | M | -| 3 | T7 | MockLLM for deterministic tests | 1 | S | -| 4 | T19 | Security threat model documentation | 1 | M | -| 5 | T3 | Unify EventBus with swarm persistence | 1 | M | -| 6 | T4 | Consolidate MEMORY.md into UnifiedMemory | 1 | M | -| 7 | T20 | OpenTelemetry observability infrastructure | 1 | M | -| 8 | T5 | MemoryFacade and tiered memory API | 2 | L | -| 9 | T12 | Namespace config.py with nested models | 2 | M | -| 10 | T11 | sqlite-vec pluggable vector index | 2 | M | -| 11 | T22 | YAML workflow schema validation | 2 | S | -| 12 | T23 | ToolSubstrate abstraction | 2 | M | -| 13 | T24 | Event sourcing snapshot and compaction | 2 | M | -| 14 | T21 | Human-in-the-loop approval gates | 2 | M | -| 15 | T27 | Timmy Console introspection dashboard | 2 | M | -| 16 | T8 | AgentCore perceive-decide-act-reflect | 2 | M | -| 17 | T9 | TaskState machine with persistence | 2 | M | -| 18 | T25 | Data-driven policy engine | 2 | M | -| 19 | T10 | LLM-based IntentClassifier | 2 | M | -| 20 | T14 | Structured AgentMessage model | 3 | S | -| 21 | T13 | AgentPool for concurrent execution | 3 | M | -| 22 | T15 | Capability-based permission model | 3 | M | -| 23 | T28 | SQLite write queue with backpressure | 3 | M | -| 24 | T6 | MemoryConsolidator background task | 3 | M | -| 25 | T26 | LN payment error handling | 3 | M | -| 26 | T16 | GhostCore scaffolding | 4 | L | -| 27 | T17 | Golden-path and chaos tests | 4 | M | -| 28 | T18 | ToolCapability model and registry | 4 | M | - -*T1, T2, T19 can all run in parallel as they have no dependencies on each other.* - ---- - -## Architectural Review Notes (2026-03-08) - -**Source:** Independent technical review of the 28-ticket roadmap. - -### Verdict - -| Aspect | Assessment | -|--------|------------| -| Overall direction | Good — consolidation first, features second | -| Tier 1 (Tickets 1–5) | Execute fully. Reduces bloat, improves testability. | -| Tier 2 (Tickets 6–11) | Reasonable operational features. Selectively execute. Skip rate limiting if behind a gateway; skip validation middleware if FastAPI already covers it. | -| Tier 3 (Tickets 12–20) | **Major bloat risk.** Pick 2–3 max. Keep WebSocket (if real-time needed), metrics (if lacking observability), and retention. **Skip multi-tenancy and plugins unless paying customers demand them.** | - -### What's Good (Anti-Bloat) - -1. **Tier 1 consolidates, doesn't add** — WAL mode, lazy init, EventBus unification, memory consolidation all reduce surface area. -2. **Several tickets explicitly reduce complexity** — Ticket 2 removes import-time side effects, Ticket 3 unifies two event interfaces into one, Ticket 5 creates a facade. -3. **Scope estimates are realistic** — 4S, 9M, 4L, 3XL distribution is honest. - -### Bloat Risks to Watch - -1. **Tier 3 has 9 tickets** — multi-tenancy (XL) and plugin system are massive complexity multipliers. Only build if there's real demand. -2. **Some features duplicate ecosystem tools** — Prometheus metrics adds ops complexity; backup systems often already exist at the infrastructure layer; config hot-reload is nice-to-have. -3. **Ticket prompts are overly prescriptive** — they specify implementation details ("sqlite-vec or similar") better left to the implementer. - -### Recommendation - -Execute **Tier 1 fully** (done: #1, #2, #3). Execute **Tier 2 selectively**. **Cut Tier 3 in half** — the plan is well-organized but ~12 tickets would suffice for real-world needs. - ---- - -## Clean Architecture Review (2026-03-08) - -**Source:** Second independent review — Clean Architecture critique. - -### Core Problem: Infrastructure-First, Not Domain-First - -This plan is almost entirely infrastructure thinking. Clean Architecture (Martin, -Hexagonal/Ports & Adapters, Onion) prescribes: start with Entities and Use Cases, -then build infrastructure as adapters. This plan inverts that. - -### What's Missing - -| Gap | Detail | -|-----|--------| -| No domain layer | 0 tickets define domain entities, use cases, or business invariants | -| Database drives design | Tickets 1, 4, 13, 17–19 are all SQLite schema/storage mechanics. DB should be a swappable detail. | -| No dependency rule | Everything depends on `brain/memory.py`, `swarm/event_log.py` — infrastructure modules. Dependencies should point inward: Infra → Use Cases → Entities. | -| Facades hide, don't abstract | Ticket 5's "MemoryFacade" exposes storage tiers (`store(tier, key, value)`), not domain operations (`recordTimeEntry()`, `generateReport()`). | - -### What Clean Architecture Would Look Like - -| This Plan | Clean Architecture | -|-----------|--------------------| -| "Add WAL mode for SQLite" | Define `TimeEntryRepository` interface in domain; SQLite is one implementation | -| "MemoryFacade with 4 tiers" | Domain entities with clear lifecycle; storage strategy is infrastructure | -| "Unify EventBus with event log" | Domain events (`TimeEntryRecorded`) published to abstract event bus | -| "Multi-tenancy support" | `TenantId` value object in domain; infrastructure handles isolation | - -### Verdict - -This plan will produce a system that: -- Has high load handling (WAL, circuit breakers) -- Is observable (metrics, logging) -- Has lots of features (WebSocket, plugins, embeddings) -- **Has no clear domain boundaries** -- **Is hard to unit test without the database** -- **Will be hard to refactor when requirements change** - -### Recommendation - -> **Stop after Ticket 2 (lazy init) and define:** -> 1. What are the domain entities? (TimeEntry? Agent? Task?) -> 2. What are the use cases? (Record time? Generate report? Hand off?) -> 3. What interfaces do use cases need? (Repository? EventPublisher?) -> -> Then build infrastructure (SQLite, EventBus, WebSocket) as adapters -> implementing those interfaces — not as the foundation. - -### Action Items - -- [ ] Before Ticket 4: define domain entities and repository interfaces -- [ ] Ticket 5 (MemoryFacade): reframe as domain operations, not storage tiers -- [ ] All new abstractions: dependency rule — domain depends on nothing, infra depends on domain -- [ ] Evaluate remaining tickets through "is this domain or infrastructure?" lens - ---- - -## 2000-Line Philosophy Review (2026-03-08) - -**Source:** Third independent review — radical simplicity critique. - -### Core Argument: This Plan Abandons YAGNI - -| 2000-Line Philosophy | This Plan | -|----------------------|-----------| -| Small, comprehensible units | 28 tickets, 4 tiers, XL scopes | -| YAGNI — prove you need it | Multi-tenancy, plugins, semantic search, hot-reload — all speculative | -| Delete code, don't add it | Mostly adding infrastructure | -| One database, simple schema | WAL tuning, retention, archival, backups, multi-tenant modes | -| A few solid abstractions | MemoryFacade + 4 tiers + EventBus + Plugin system + Job queue | -| Understand the whole system | "Drop each Claude Code Prompt into a fresh session" — you won't understand it | - -### Telltale Signs - -1. **Prompts are massive.** Each "Claude Code Prompt" is 200+ words of prescriptive - implementation detail. That's outsourcing thinking, not planning. -2. **28 tickets for a time dashboard.** Cumulative surface area is enormous. - A 2000-line codebase has ~10–15 source files, 3–4 core abstractions, one way - to do things. This plan creates 20+ new modules. -3. **"S" scope tickets aren't small.** Ticket 1 (WAL mode) touches multiple databases, - requires shared helpers, needs tests across the codebase — that's a cross-cutting concern. - -### What 2000 Lines Looks Like - -```python -# memory.py (~150 lines) -class Memory: - def get(self, key): ... - def set(self, key, value): ... - # SQLite behind a simple interface. No WAL tuning exposed. - # No "tiers." No "facade." Just store and retrieve. - -# events.py (~100 lines) -class Events: - def publish(self, event): ... - # SQLite table. Simple. Blocking is fine for now. -``` - -Need multi-tenancy later? Fork and add it when a customer pays for it. -Need plugins? Monkey-patch or add a hook — in 20 lines. -Need semantic search? `grep` works surprisingly well under 10k documents. - -### Verdict - -> The 2000-line philosophy isn't about the number — it's about being willing -> to say **"no."** This plan says yes to everything. That's not architecture — -> that's accumulation. - -### Recommendation - -> **Cut to 4 tickets max.** WAL mode (if hitting contention), lazy init (for -> tests), and maybe health checks. Everything else waits until it hurts. - -### Decision Matrix (Updated with All 3 Reviews) - -| Ticket | Bloat Review | Clean Arch Review | 2000-Line Review | Final Call | -|--------|-------------|-------------------|-------------------|-----------| -| T1 WAL mode | Do it | Infrastructure detail | Do if contention exists | **DONE** | -| T2 Lazy init | Do it | Good for testability | Do it | **DONE** | -| T3 EventBus unify | Do it | Needs domain events first | Overkill | **DONE** | -| T4 Memory consolidation | Do it | Define domain entities first | Wait until it hurts | **BLOCKED: needs domain model** | -| T5 MemoryFacade | Do it | Reframe as domain ops | Overkill — one `Memory` class | **BLOCKED: needs domain model** | -| T7 MockLLM | Do it | Good | Good | **NEXT** | -| T19 Threat model | Do it | Good | Skip unless deploying | Evaluate | -| T20 OpenTelemetry | Selective | Infrastructure | Skip | Skip | -| T6–T18, T21–T28 | Cut in half | Define domain first | Cut to zero | **PARKED** | diff --git a/docs/PLAN_ARCHIVE.md b/docs/PLAN_ARCHIVE.md deleted file mode 100644 index a54ab1bf..00000000 --- a/docs/PLAN_ARCHIVE.md +++ /dev/null @@ -1,478 +0,0 @@ -# Plan: Full Creative & DevOps Capabilities for Timmy - -## Overview - -Add five major capability domains to Timmy's agent system, turning it into a -sovereign creative studio and full-stack DevOps operator. All tools are -open-source, self-hosted, and GPU-accelerated where needed. - ---- - -## Phase 1: Git & DevOps Tools (Forge + Helm personas) - -**Goal:** Timmy can observe local/remote repos, read code, create branches, -stage changes, commit, diff, log, and manage PRs — all through the swarm -task system with Spark event capture. - -### New module: `src/tools/git_tools.py` - -Tools to add (using **GitPython** — BSD-3, `pip install GitPython`): - -| Tool | Function | Persona Access | -|---|---|---| -| `git_clone` | Clone a remote repo to local path | Forge, Helm | -| `git_status` | Show working tree status | Forge, Helm, Timmy | -| `git_diff` | Show staged/unstaged diffs | Forge, Helm, Timmy | -| `git_log` | Show recent commit history | Forge, Helm, Echo, Timmy | -| `git_branch` | List/create/switch branches | Forge, Helm | -| `git_add` | Stage files for commit | Forge, Helm | -| `git_commit` | Create a commit with message | Forge, Helm | -| `git_push` | Push to remote | Forge, Helm | -| `git_pull` | Pull from remote | Forge, Helm | -| `git_blame` | Show line-by-line authorship | Forge, Echo | -| `git_stash` | Stash/pop changes | Forge, Helm | - -### Changes to existing files - -- **`src/timmy/tools.py`** — Add `create_git_tools()` factory, wire into - `PERSONA_TOOLKITS` for Forge and Helm -- **`src/swarm/tool_executor.py`** — Enhance `_infer_tools_needed()` with - git keywords (commit, branch, push, pull, diff, clone, merge) -- **`src/config.py`** — Add `git_default_repo_dir: str = "~/repos"` setting -- **`src/spark/engine.py`** — Add `on_tool_executed()` method to capture - individual tool invocations (not just task-level events) -- **`src/swarm/personas.py`** — Add git-related keywords to Forge and Helm - preferred_keywords - -### New dependency - -```toml -# pyproject.toml -dependencies = [ - ..., - "GitPython>=3.1.40", -] -``` - -### Dashboard - -- **`/tools`** page updated to show git tools in the catalog -- Git tool usage stats visible per agent - -### Tests - -- `tests/test_git_tools.py` — test all git tool functions against tmp repos -- Mock GitPython's `Repo` class for unit tests - ---- - -## Phase 2: Image Generation (new "Pixel" persona) - -**Goal:** Generate storyboard frames and standalone images from text prompts -using FLUX.2 Klein 4B locally. - -### New persona: Pixel — Visual Architect - -```python -"pixel": { - "id": "pixel", - "name": "Pixel", - "role": "Visual Architect", - "description": "Image generation, storyboard frames, and visual design.", - "capabilities": "image-generation,storyboard,design", - "rate_sats": 80, - "bid_base": 60, - "bid_jitter": 20, - "preferred_keywords": [ - "image", "picture", "photo", "draw", "illustration", - "storyboard", "frame", "visual", "design", "generate", - "portrait", "landscape", "scene", "artwork", - ], -} -``` - -### New module: `src/tools/image_tools.py` - -Tools (using **diffusers** + **FLUX.2 Klein 4B** — Apache 2.0): - -| Tool | Function | -|---|---| -| `generate_image` | Text-to-image generation (returns file path) | -| `generate_storyboard` | Generate N frames from scene descriptions | -| `image_variations` | Generate variations of an existing image | - -### Architecture - -``` -generate_image(prompt, width=1024, height=1024, steps=4) - → loads FLUX.2 Klein via diffusers FluxPipeline - → saves to data/images/{uuid}.png - → returns path + metadata -``` - -- Model loaded lazily on first use, kept in memory for subsequent calls -- Falls back to CPU generation (slower) if no GPU -- Output saved to `data/images/` with metadata JSON sidecar - -### New dependency (optional extra) - -```toml -[project.optional-dependencies] -creative = [ - "diffusers>=0.30.0", - "transformers>=4.40.0", - "accelerate>=0.30.0", - "torch>=2.2.0", - "safetensors>=0.4.0", -] -``` - -### Config - -```python -# config.py additions -flux_model_id: str = "black-forest-labs/FLUX.2-klein-4b" -image_output_dir: str = "data/images" -image_default_steps: int = 4 -``` - -### Dashboard - -- `/creative/ui` — new Creative Studio page (image gallery + generation form) -- HTMX-powered: submit prompt, poll for result, display inline -- Gallery view of all generated images with metadata - -### Tests - -- `tests/test_image_tools.py` — mock diffusers pipeline, test prompt handling, - file output, storyboard generation - ---- - -## Phase 3: Music Generation (new "Lyra" persona) - -**Goal:** Generate full songs with vocals, instrumentals, and lyrics using -ACE-Step 1.5 locally. - -### New persona: Lyra — Sound Weaver - -```python -"lyra": { - "id": "lyra", - "name": "Lyra", - "role": "Sound Weaver", - "description": "Music and song generation with vocals, instrumentals, and lyrics.", - "capabilities": "music-generation,vocals,composition", - "rate_sats": 90, - "bid_base": 70, - "bid_jitter": 20, - "preferred_keywords": [ - "music", "song", "sing", "vocal", "instrumental", - "melody", "beat", "track", "compose", "lyrics", - "audio", "sound", "album", "remix", - ], -} -``` - -### New module: `src/tools/music_tools.py` - -Tools (using **ACE-Step 1.5** — Apache 2.0, `pip install ace-step`): - -| Tool | Function | -|---|---| -| `generate_song` | Text/lyrics → full song (vocals + instrumentals) | -| `generate_instrumental` | Text prompt → instrumental track | -| `generate_vocals` | Lyrics + style → vocal track | -| `list_genres` | Return supported genre/style tags | - -### Architecture - -``` -generate_song(lyrics, genre="pop", duration=120, language="en") - → loads ACE-Step model (lazy, cached) - → generates audio - → saves to data/music/{uuid}.wav - → returns path + metadata (duration, genre, etc.) -``` - -- Model loaded lazily, ~4GB VRAM minimum -- Output saved to `data/music/` with metadata sidecar -- Supports 19 languages, genre tags, tempo control - -### New dependency (optional extra, extends `creative`) - -```toml -[project.optional-dependencies] -creative = [ - ..., - "ace-step>=1.5.0", -] -``` - -### Config - -```python -music_output_dir: str = "data/music" -ace_step_model: str = "ace-step/ACE-Step-v1.5" -``` - -### Dashboard - -- `/creative/ui` expanded with Music tab -- Audio player widget (HTML5 `