feat: enable SQLite WAL mode for all databases (AGI ticket #1) (#153)

2026-03-08 16:07:02 -04:00
parent 11ba21418a
commit 82fb2417e3
31 changed files with 1042 additions and 170 deletions
--- a/docs/IMPLEMENTATION_TICKETS.md
+++ b/docs/IMPLEMENTATION_TICKETS.md
@@ -6,13 +6,12 @@

 ---

-## Ticket 1: Add WAL mode for all SQLite databases
+## Ticket 1: Add WAL mode for all SQLite databases ✅ COMPLETED

 **Priority:** Tier 1
 **Estimated scope:** S
 **Dependencies:** none
-**Files to modify:** `brain/memory.py`, `swarm/event_log.py`, any SQLite init helpers under `data/` or `swarm/`
-**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `brain/memory.py`, `swarm/event_log.py`
+**Status:** DONE (2026-03-08) — WAL + busy_timeout=5000 added to brain/memory.py, swarm/event_log.py, spark/memory.py, spark/eidos.py, swarm/task_queue/models.py, infrastructure/models/registry.py. 8 new tests across 4 files.

 ### Objective

@@ -34,11 +33,12 @@ You are working in the Timmy Time Dashboard repo. First, read CLAUDE.md and AGEN

 ---

-## Ticket 2: Introduce lazy initialization for critical singletons
+## Ticket 2: Introduce lazy initialization for critical singletons ✅ COMPLETED

 **Priority:** Tier 1
 **Estimated scope:** M
 **Dependencies:** 1
+**Status:** DONE (2026-03-08) — config.py startup validation moved to validate_startup(). spark_engine, memory_system, event_bus all use lazy getters with __getattr__ backward compat. 15 new tests, 953 passing.
 **Files to modify:** `spark/engine.py`, `config.py`, `infrastructure/events/bus.py`, `timmy/memory_system.py`, `infrastructure/router/cascade.py`
 **Files to read first:** `CLAUDE.md`, `AGENTS.md`, `spark/engine.py`, `config.py`, `infrastructure/events/bus.py`, `timmy/memory_system.py`, `infrastructure/router/cascade.py`

@@ -62,13 +62,12 @@ You are working in the Timmy Time Dashboard repo. First read CLAUDE.md and AGENT

 ---

-## Ticket 3: Unify EventBus with swarm event log persistence
+## Ticket 3: Unify EventBus with swarm event log persistence ✅ COMPLETED

 **Priority:** Tier 1
 **Estimated scope:** M
 **Dependencies:** 1, 2
-**Files to modify:** `infrastructure/events/bus.py`, `swarm/event_log.py`, any event-related models or helpers
-**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `infrastructure/events/bus.py`, `swarm/event_log.py`
+**Status:** DONE (2026-03-08) — EventBus gains enable_persistence() + replay(). log_event() bridges to EventBus. App startup enables persistence. 10 new tests, 308 passing.

 ### Objective

@@ -250,9 +249,18 @@ Replace ad-hoc routing state with an explicit TaskState enum and a TaskContext s
 You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then open timmy/agents/timmy.py, infrastructure/events/bus.py, and swarm/event_log.py. Define a TaskState enum and a TaskContext dataclass (or pydantic model, consistent with project style) that captures task lifecycle phases (e.g., RECEIVED, CLASSIFIED, ROUTED, EXECUTING, REVIEWING, COMPLETED, FAILED) and associated metadata such as classification, assigned_agent, execution_history, and identifiers. Modify TimmyOrchestrator (or the main orchestrator class) to use TaskContext instances for each incoming request, updating state via a validated transition method that also emits events via the unified EventBus so each transition is recorded. Persist TaskContext snapshots using either the event log (event-sourced) or a dedicated table in an existing DB, following existing patterns in swarm/event_log.py or brain/memory.py, to allow replaying or inspecting task histories. Add tests that ensure legal transitions are allowed, illegal ones are rejected, events are emitted on transitions, and a simple task's life can be reconstructed from persisted data. Run make test to verify.
 ```

+### Architecture Note (added 2026-03-08)
+
+> **Schema versioning required.** All persisted Pydantic models (TaskContext,
+> AgentMessage) MUST include `schema_version: int = 1`. Without this, event
+> replay/reconstruction will break when fields change in later tickets.
+> Also add a `trace_id: str` field to TaskContext so the full execution graph
+> of any user request can be reconstructed from the event log.
+
 ### Acceptance Criteria

 - [ ] TaskState enum and TaskContext structure are defined and used in the orchestrator
+- [ ] TaskContext includes `schema_version` and `trace_id` fields
 - [ ] TaskContext transitions emit events and are persisted in a durable store
 - [ ] Orchestrator uses TaskContext instead of loosely structured dicts/state for routing
 - [ ] Tests validate transitions, persistence, and basic reconstruction of task history
@@ -278,11 +286,20 @@ Replace brittle keyword-based routing in TimmyOrchestrator with an LLM-based Int
 You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then inspect timmy/agents/timmy.py and the LLM routing modules under infrastructure/router/, particularly CascadeRouter. Implement an IntentClassifier class (e.g., timmy/agents/intent_classifier.py) that uses the project's LLM abstraction to classify requests into categories such as DIRECT, RESEARCH, CODE, MEMORY, CREATIVE, and COMPLEX, based on the prompt template described in the research. Integrate this classifier into TimmyOrchestrator so that for each incoming request, the orchestrator calls IntentClassifier.classify and uses the result to choose the appropriate agent or workflow instead of relying on keyword lists. Add a simple caching mechanism (e.g., LRU cache keyed by normalized request strings) to avoid repeated LLM calls for identical or highly similar inputs, respecting project memory constraints. Update or add tests (using MockLLM) to verify that requests are classified into the correct categories and that orchestrator routing behaves appropriately when the classifier returns each label. Run make test to verify.
 ```

+### Architecture Note (added 2026-03-08)
+
+> **Pivot from LLM classify to embedding similarity.** An LLM call adds 500ms–2s
+> latency to the critical routing path. Instead, use embedding vectors + cosine
+> similarity against a curated set of "Intent Vectors" (sub-10ms, deterministic).
+> This reuses the sqlite-vec infrastructure from Ticket 11. Keep LLM fallback
+> for truly ambiguous requests only.
+
 ### Acceptance Criteria

 - [ ] An IntentClassifier is implemented and integrated into the orchestrator
+- [ ] Primary routing uses embedding similarity, not LLM calls (LLM as fallback only)
 - [ ] Keyword-based routing is minimized or removed for primary paths
- [ ] Classification uses the project's LLM abstraction and can be tested via MockLLM
+- [ ] Classification can be tested via MockLLM and deterministic embeddings
 - [ ] Tests cover classification behavior and routing outcomes for each category
 - [ ] Tests pass (`make test`)

@@ -364,10 +381,19 @@ Add an AgentPool abstraction that manages concurrent agent execution with a sema
 You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then open timmy/agents/timmy.py and any agent loop implementation that wraps Agno's Agent.run in asyncio.to_thread. Implement an AgentPool class that uses an asyncio.Semaphore and a ThreadPoolExecutor (with configurable max_concurrent) to execute blocking agent.run calls, exposing an async execute(agent, message) interface. Replace direct calls to asyncio.to_thread or equivalent ad-hoc patterns in the agentic loop with calls to AgentPool.execute, wiring the pool into the orchestrator or agent manager via dependency injection or a lazy getter, consistent with the project's DI approach. Ensure proper cleanup of the ThreadPoolExecutor on shutdown (e.g., FastAPI shutdown events or equivalent hooks). Add tests with a fake or Mock agent that simulate multiple concurrent executions, verifying that concurrency limits are respected and that results are returned correctly. Run make test to verify.
 ```

+### Architecture Note (added 2026-03-08)
+
+> **SQLite write contention risk.** Even with WAL mode (Ticket 1), 4 concurrent
+> agents hitting brain.db in a tight loop will cause `database is locked` errors.
+> Python's sqlite3 driver handles concurrency poorly. Wrap database writes in an
+> `asyncio.Lock()` at the application layer, or adopt `aiosqlite` for proper
+> async SQLite access. This is critical for AgentPool to function correctly.
+
 ### Acceptance Criteria

 - [ ] AgentPool exists and controls concurrent use of blocking agents
 - [ ] Agentic loop no longer directly uses `asyncio.to_thread`; it uses AgentPool
+- [ ] Database write paths use `asyncio.Lock()` or `aiosqlite` to prevent lock contention
 - [ ] Concurrency limits are configurable via settings
 - [ ] Executor shutdown is handled gracefully
 - [ ] Tests cover basic concurrency behavior
@@ -393,9 +419,17 @@ Replace loosely structured dict messages between agents with a structured AgentM
 You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then inspect timmy/agents/timmy.py and any code that sends inter-agent messages via the EventBus. Implement an AgentMessage dataclass or pydantic model (e.g., in timmy/agents/messages.py) containing fields such as from_agent, to_agent, message_type (request, response, delegate, escalate), content, context, thread_id, priority, and requires_response. Update inter-agent communication code to construct and emit AgentMessage instances instead of generic dicts, while preserving the underlying event/topic names used by EventBus. Ensure serialization and deserialization for events is handled consistently (e.g., via .dict() or model_dump()), keeping backward compatibility where needed for existing consumers. Add tests that verify AgentMessage roundtripping through the event bus and that fields are populated as expected in typical workflows. Run make test to verify.
 ```

+### Architecture Note (added 2026-03-08)
+
+> **Schema versioning + trace_id required.** AgentMessage MUST include
+> `schema_version: int = 1` for forward-compatible event replay. Also include
+> a `trace_id: str` field that correlates all messages belonging to a single
+> user request, enabling full execution graph reconstruction from the event log.
+
 ### Acceptance Criteria

 - [ ] AgentMessage model exists and is used for inter-agent communication
+- [ ] AgentMessage includes `schema_version` and `trace_id` fields
 - [ ] EventBus payloads carry structured messages instead of arbitrary dicts
 - [ ] Serialization and deserialization work correctly with existing event infrastructure
 - [ ] Tests confirm structure and roundtrip behavior
@@ -1013,3 +1047,163 @@ Add tests that verify: writes are serialized (no concurrent write errors), backp
 | 28 | T18 | ToolCapability model and registry | 4 | M |

 *T1, T2, T19 can all run in parallel as they have no dependencies on each other.*
+
+---
+
+## Architectural Review Notes (2026-03-08)
+
+**Source:** Independent technical review of the 28-ticket roadmap.
+
+### Verdict
+
+| Aspect | Assessment |
+|--------|------------|
+| Overall direction | Good — consolidation first, features second |
+| Tier 1 (Tickets 1–5) | Execute fully. Reduces bloat, improves testability. |
+| Tier 2 (Tickets 6–11) | Reasonable operational features. Selectively execute. Skip rate limiting if behind a gateway; skip validation middleware if FastAPI already covers it. |
+| Tier 3 (Tickets 12–20) | **Major bloat risk.** Pick 2–3 max. Keep WebSocket (if real-time needed), metrics (if lacking observability), and retention. **Skip multi-tenancy and plugins unless paying customers demand them.** |
+
+### What's Good (Anti-Bloat)
+
+1. **Tier 1 consolidates, doesn't add** — WAL mode, lazy init, EventBus unification, memory consolidation all reduce surface area.
+2. **Several tickets explicitly reduce complexity** — Ticket 2 removes import-time side effects, Ticket 3 unifies two event interfaces into one, Ticket 5 creates a facade.
+3. **Scope estimates are realistic** — 4S, 9M, 4L, 3XL distribution is honest.
+
+### Bloat Risks to Watch
+
+1. **Tier 3 has 9 tickets** — multi-tenancy (XL) and plugin system are massive complexity multipliers. Only build if there's real demand.
+2. **Some features duplicate ecosystem tools** — Prometheus metrics adds ops complexity; backup systems often already exist at the infrastructure layer; config hot-reload is nice-to-have.
+3. **Ticket prompts are overly prescriptive** — they specify implementation details ("sqlite-vec or similar") better left to the implementer.
+
+### Recommendation
+
+Execute **Tier 1 fully** (done: #1, #2, #3). Execute **Tier 2 selectively**. **Cut Tier 3 in half** — the plan is well-organized but ~12 tickets would suffice for real-world needs.
+
+---
+
+## Clean Architecture Review (2026-03-08)
+
+**Source:** Second independent review — Clean Architecture critique.
+
+### Core Problem: Infrastructure-First, Not Domain-First
+
+This plan is almost entirely infrastructure thinking. Clean Architecture (Martin,
+Hexagonal/Ports & Adapters, Onion) prescribes: start with Entities and Use Cases,
+then build infrastructure as adapters. This plan inverts that.
+
+### What's Missing
+
+| Gap | Detail |
+|-----|--------|
+| No domain layer | 0 tickets define domain entities, use cases, or business invariants |
+| Database drives design | Tickets 1, 4, 13, 17–19 are all SQLite schema/storage mechanics. DB should be a swappable detail. |
+| No dependency rule | Everything depends on `brain/memory.py`, `swarm/event_log.py` — infrastructure modules. Dependencies should point inward: Infra → Use Cases → Entities. |
+| Facades hide, don't abstract | Ticket 5's "MemoryFacade" exposes storage tiers (`store(tier, key, value)`), not domain operations (`recordTimeEntry()`, `generateReport()`). |
+
+### What Clean Architecture Would Look Like
+
+| This Plan | Clean Architecture |
+|-----------|--------------------|
+| "Add WAL mode for SQLite" | Define `TimeEntryRepository` interface in domain; SQLite is one implementation |
+| "MemoryFacade with 4 tiers" | Domain entities with clear lifecycle; storage strategy is infrastructure |
+| "Unify EventBus with event log" | Domain events (`TimeEntryRecorded`) published to abstract event bus |
+| "Multi-tenancy support" | `TenantId` value object in domain; infrastructure handles isolation |
+
+### Verdict
+
+This plan will produce a system that:
+- Has high load handling (WAL, circuit breakers)
+- Is observable (metrics, logging)
+- Has lots of features (WebSocket, plugins, embeddings)
+- **Has no clear domain boundaries**
+- **Is hard to unit test without the database**
+- **Will be hard to refactor when requirements change**
+
+### Recommendation
+
+> **Stop after Ticket 2 (lazy init) and define:**
+> 1. What are the domain entities? (TimeEntry? Agent? Task?)
+> 2. What are the use cases? (Record time? Generate report? Hand off?)
+> 3. What interfaces do use cases need? (Repository? EventPublisher?)
+>
+> Then build infrastructure (SQLite, EventBus, WebSocket) as adapters
+> implementing those interfaces — not as the foundation.
+
+### Action Items
+
+- [ ] Before Ticket 4: define domain entities and repository interfaces
+- [ ] Ticket 5 (MemoryFacade): reframe as domain operations, not storage tiers
+- [ ] All new abstractions: dependency rule — domain depends on nothing, infra depends on domain
+- [ ] Evaluate remaining tickets through "is this domain or infrastructure?" lens
+
+---
+
+## 2000-Line Philosophy Review (2026-03-08)
+
+**Source:** Third independent review — radical simplicity critique.
+
+### Core Argument: This Plan Abandons YAGNI
+
+| 2000-Line Philosophy | This Plan |
+|----------------------|-----------|
+| Small, comprehensible units | 28 tickets, 4 tiers, XL scopes |
+| YAGNI — prove you need it | Multi-tenancy, plugins, semantic search, hot-reload — all speculative |
+| Delete code, don't add it | Mostly adding infrastructure |
+| One database, simple schema | WAL tuning, retention, archival, backups, multi-tenant modes |
+| A few solid abstractions | MemoryFacade + 4 tiers + EventBus + Plugin system + Job queue |
+| Understand the whole system | "Drop each Claude Code Prompt into a fresh session" — you won't understand it |
+
+### Telltale Signs
+
+1. **Prompts are massive.** Each "Claude Code Prompt" is 200+ words of prescriptive
+   implementation detail. That's outsourcing thinking, not planning.
+2. **28 tickets for a time dashboard.** Cumulative surface area is enormous.
+   A 2000-line codebase has ~10–15 source files, 3–4 core abstractions, one way
+   to do things. This plan creates 20+ new modules.
+3. **"S" scope tickets aren't small.** Ticket 1 (WAL mode) touches multiple databases,
+   requires shared helpers, needs tests across the codebase — that's a cross-cutting concern.
+
+### What 2000 Lines Looks Like
+
+```python
+# memory.py (~150 lines)
+class Memory:
+    def get(self, key): ...
+    def set(self, key, value): ...
+    # SQLite behind a simple interface. No WAL tuning exposed.
+    # No "tiers." No "facade." Just store and retrieve.
+
+# events.py (~100 lines)
+class Events:
+    def publish(self, event): ...
+    # SQLite table. Simple. Blocking is fine for now.
+```
+
+Need multi-tenancy later? Fork and add it when a customer pays for it.
+Need plugins? Monkey-patch or add a hook — in 20 lines.
+Need semantic search? `grep` works surprisingly well under 10k documents.
+
+### Verdict
+
+> The 2000-line philosophy isn't about the number — it's about being willing
+> to say **"no."** This plan says yes to everything. That's not architecture —
+> that's accumulation.
+
+### Recommendation
+
+> **Cut to 4 tickets max.** WAL mode (if hitting contention), lazy init (for
+> tests), and maybe health checks. Everything else waits until it hurts.
+
+### Decision Matrix (Updated with All 3 Reviews)
+
+| Ticket | Bloat Review | Clean Arch Review | 2000-Line Review | Final Call |
+|--------|-------------|-------------------|-------------------|-----------|
+| T1 WAL mode | Do it | Infrastructure detail | Do if contention exists | **DONE** |
+| T2 Lazy init | Do it | Good for testability | Do it | **DONE** |
+| T3 EventBus unify | Do it | Needs domain events first | Overkill | **DONE** |
+| T4 Memory consolidation | Do it | Define domain entities first | Wait until it hurts | **BLOCKED: needs domain model** |
+| T5 MemoryFacade | Do it | Reframe as domain ops | Overkill — one `Memory` class | **BLOCKED: needs domain model** |
+| T7 MockLLM | Do it | Good | Good | **NEXT** |
+| T19 Threat model | Do it | Good | Skip unless deploying | Evaluate |
+| T20 OpenTelemetry | Selective | Infrastructure | Skip | Skip |
+| T6–T18, T21–T28 | Cut in half | Define domain first | Cut to zero | **PARKED** |