feat: enable SQLite WAL mode for all databases (AGI ticket #1) (#153)

2026-03-08 16:07:02 -04:00
parent 11ba21418a
commit 82fb2417e3
31 changed files with 1042 additions and 170 deletions
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -17,7 +17,7 @@ jobs:
          python-version: "3.11"

      - name: Install linters
-        run: pip install black==23.12.1 isort==5.13.2 bandit==1.8.0
+        run: pip install black==26.3.0 isort==5.13.2 bandit==1.8.0

      - name: Check formatting (black)
        run: black --check --line-length 100 src/ tests/
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -6,7 +6,7 @@
 repos:
  # Code formatting
  - repo: https://github.com/psf/black
-    rev: 23.12.1
+    rev: 26.3.0
    hooks:
      - id: black
        language_version: python3.11
@@ -60,7 +60,7 @@ repos:
        entry: timeout 30 poetry run pytest
        language: system
        types: [python]
-        stages: [commit]
+        stages: [pre-commit]
        pass_filenames: false
        always_run: true
        args:
@@ -68,8 +68,8 @@ repos:
          - -q
          - --tb=short
          - --timeout=10
-          - -m
-          - unit
          - -p
          - no:xdist
+          - --ignore=tests/e2e
+          - --ignore=tests/functional
        verbose: true
--- a/docs/IMPLEMENTATION_TICKETS.md
+++ b/docs/IMPLEMENTATION_TICKETS.md
@@ -6,13 +6,12 @@

 ---

-## Ticket 1: Add WAL mode for all SQLite databases
+## Ticket 1: Add WAL mode for all SQLite databases ✅ COMPLETED

 **Priority:** Tier 1
 **Estimated scope:** S
 **Dependencies:** none
-**Files to modify:** `brain/memory.py`, `swarm/event_log.py`, any SQLite init helpers under `data/` or `swarm/`
-**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `brain/memory.py`, `swarm/event_log.py`
+**Status:** DONE (2026-03-08) — WAL + busy_timeout=5000 added to brain/memory.py, swarm/event_log.py, spark/memory.py, spark/eidos.py, swarm/task_queue/models.py, infrastructure/models/registry.py. 8 new tests across 4 files.

 ### Objective

@@ -34,11 +33,12 @@ You are working in the Timmy Time Dashboard repo. First, read CLAUDE.md and AGEN

 ---

-## Ticket 2: Introduce lazy initialization for critical singletons
+## Ticket 2: Introduce lazy initialization for critical singletons ✅ COMPLETED

 **Priority:** Tier 1
 **Estimated scope:** M
 **Dependencies:** 1
+**Status:** DONE (2026-03-08) — config.py startup validation moved to validate_startup(). spark_engine, memory_system, event_bus all use lazy getters with __getattr__ backward compat. 15 new tests, 953 passing.
 **Files to modify:** `spark/engine.py`, `config.py`, `infrastructure/events/bus.py`, `timmy/memory_system.py`, `infrastructure/router/cascade.py`
 **Files to read first:** `CLAUDE.md`, `AGENTS.md`, `spark/engine.py`, `config.py`, `infrastructure/events/bus.py`, `timmy/memory_system.py`, `infrastructure/router/cascade.py`

@@ -62,13 +62,12 @@ You are working in the Timmy Time Dashboard repo. First read CLAUDE.md and AGENT

 ---

-## Ticket 3: Unify EventBus with swarm event log persistence
+## Ticket 3: Unify EventBus with swarm event log persistence ✅ COMPLETED

 **Priority:** Tier 1
 **Estimated scope:** M
 **Dependencies:** 1, 2
-**Files to modify:** `infrastructure/events/bus.py`, `swarm/event_log.py`, any event-related models or helpers
-**Files to read first:** `CLAUDE.md`, `AGENTS.md`, `infrastructure/events/bus.py`, `swarm/event_log.py`
+**Status:** DONE (2026-03-08) — EventBus gains enable_persistence() + replay(). log_event() bridges to EventBus. App startup enables persistence. 10 new tests, 308 passing.

 ### Objective

@@ -250,9 +249,18 @@ Replace ad-hoc routing state with an explicit TaskState enum and a TaskContext s
 You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then open timmy/agents/timmy.py, infrastructure/events/bus.py, and swarm/event_log.py. Define a TaskState enum and a TaskContext dataclass (or pydantic model, consistent with project style) that captures task lifecycle phases (e.g., RECEIVED, CLASSIFIED, ROUTED, EXECUTING, REVIEWING, COMPLETED, FAILED) and associated metadata such as classification, assigned_agent, execution_history, and identifiers. Modify TimmyOrchestrator (or the main orchestrator class) to use TaskContext instances for each incoming request, updating state via a validated transition method that also emits events via the unified EventBus so each transition is recorded. Persist TaskContext snapshots using either the event log (event-sourced) or a dedicated table in an existing DB, following existing patterns in swarm/event_log.py or brain/memory.py, to allow replaying or inspecting task histories. Add tests that ensure legal transitions are allowed, illegal ones are rejected, events are emitted on transitions, and a simple task's life can be reconstructed from persisted data. Run make test to verify.
 ```

+### Architecture Note (added 2026-03-08)
+
+> **Schema versioning required.** All persisted Pydantic models (TaskContext,
+> AgentMessage) MUST include `schema_version: int = 1`. Without this, event
+> replay/reconstruction will break when fields change in later tickets.
+> Also add a `trace_id: str` field to TaskContext so the full execution graph
+> of any user request can be reconstructed from the event log.
+
 ### Acceptance Criteria

 - [ ] TaskState enum and TaskContext structure are defined and used in the orchestrator
+- [ ] TaskContext includes `schema_version` and `trace_id` fields
 - [ ] TaskContext transitions emit events and are persisted in a durable store
 - [ ] Orchestrator uses TaskContext instead of loosely structured dicts/state for routing
 - [ ] Tests validate transitions, persistence, and basic reconstruction of task history
@@ -278,11 +286,20 @@ Replace brittle keyword-based routing in TimmyOrchestrator with an LLM-based Int
 You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then inspect timmy/agents/timmy.py and the LLM routing modules under infrastructure/router/, particularly CascadeRouter. Implement an IntentClassifier class (e.g., timmy/agents/intent_classifier.py) that uses the project's LLM abstraction to classify requests into categories such as DIRECT, RESEARCH, CODE, MEMORY, CREATIVE, and COMPLEX, based on the prompt template described in the research. Integrate this classifier into TimmyOrchestrator so that for each incoming request, the orchestrator calls IntentClassifier.classify and uses the result to choose the appropriate agent or workflow instead of relying on keyword lists. Add a simple caching mechanism (e.g., LRU cache keyed by normalized request strings) to avoid repeated LLM calls for identical or highly similar inputs, respecting project memory constraints. Update or add tests (using MockLLM) to verify that requests are classified into the correct categories and that orchestrator routing behaves appropriately when the classifier returns each label. Run make test to verify.
 ```

+### Architecture Note (added 2026-03-08)
+
+> **Pivot from LLM classify to embedding similarity.** An LLM call adds 500ms–2s
+> latency to the critical routing path. Instead, use embedding vectors + cosine
+> similarity against a curated set of "Intent Vectors" (sub-10ms, deterministic).
+> This reuses the sqlite-vec infrastructure from Ticket 11. Keep LLM fallback
+> for truly ambiguous requests only.
+
 ### Acceptance Criteria

 - [ ] An IntentClassifier is implemented and integrated into the orchestrator
+- [ ] Primary routing uses embedding similarity, not LLM calls (LLM as fallback only)
 - [ ] Keyword-based routing is minimized or removed for primary paths
- [ ] Classification uses the project's LLM abstraction and can be tested via MockLLM
+- [ ] Classification can be tested via MockLLM and deterministic embeddings
 - [ ] Tests cover classification behavior and routing outcomes for each category
 - [ ] Tests pass (`make test`)

@@ -364,10 +381,19 @@ Add an AgentPool abstraction that manages concurrent agent execution with a sema
 You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then open timmy/agents/timmy.py and any agent loop implementation that wraps Agno's Agent.run in asyncio.to_thread. Implement an AgentPool class that uses an asyncio.Semaphore and a ThreadPoolExecutor (with configurable max_concurrent) to execute blocking agent.run calls, exposing an async execute(agent, message) interface. Replace direct calls to asyncio.to_thread or equivalent ad-hoc patterns in the agentic loop with calls to AgentPool.execute, wiring the pool into the orchestrator or agent manager via dependency injection or a lazy getter, consistent with the project's DI approach. Ensure proper cleanup of the ThreadPoolExecutor on shutdown (e.g., FastAPI shutdown events or equivalent hooks). Add tests with a fake or Mock agent that simulate multiple concurrent executions, verifying that concurrency limits are respected and that results are returned correctly. Run make test to verify.
 ```

+### Architecture Note (added 2026-03-08)
+
+> **SQLite write contention risk.** Even with WAL mode (Ticket 1), 4 concurrent
+> agents hitting brain.db in a tight loop will cause `database is locked` errors.
+> Python's sqlite3 driver handles concurrency poorly. Wrap database writes in an
+> `asyncio.Lock()` at the application layer, or adopt `aiosqlite` for proper
+> async SQLite access. This is critical for AgentPool to function correctly.
+
 ### Acceptance Criteria

 - [ ] AgentPool exists and controls concurrent use of blocking agents
 - [ ] Agentic loop no longer directly uses `asyncio.to_thread`; it uses AgentPool
+- [ ] Database write paths use `asyncio.Lock()` or `aiosqlite` to prevent lock contention
 - [ ] Concurrency limits are configurable via settings
 - [ ] Executor shutdown is handled gracefully
 - [ ] Tests cover basic concurrency behavior
@@ -393,9 +419,17 @@ Replace loosely structured dict messages between agents with a structured AgentM
 You are working in the Timmy Time Dashboard repo. Read CLAUDE.md and AGENTS.md, then inspect timmy/agents/timmy.py and any code that sends inter-agent messages via the EventBus. Implement an AgentMessage dataclass or pydantic model (e.g., in timmy/agents/messages.py) containing fields such as from_agent, to_agent, message_type (request, response, delegate, escalate), content, context, thread_id, priority, and requires_response. Update inter-agent communication code to construct and emit AgentMessage instances instead of generic dicts, while preserving the underlying event/topic names used by EventBus. Ensure serialization and deserialization for events is handled consistently (e.g., via .dict() or model_dump()), keeping backward compatibility where needed for existing consumers. Add tests that verify AgentMessage roundtripping through the event bus and that fields are populated as expected in typical workflows. Run make test to verify.
 ```

+### Architecture Note (added 2026-03-08)
+
+> **Schema versioning + trace_id required.** AgentMessage MUST include
+> `schema_version: int = 1` for forward-compatible event replay. Also include
+> a `trace_id: str` field that correlates all messages belonging to a single
+> user request, enabling full execution graph reconstruction from the event log.
+
 ### Acceptance Criteria

 - [ ] AgentMessage model exists and is used for inter-agent communication
+- [ ] AgentMessage includes `schema_version` and `trace_id` fields
 - [ ] EventBus payloads carry structured messages instead of arbitrary dicts
 - [ ] Serialization and deserialization work correctly with existing event infrastructure
 - [ ] Tests confirm structure and roundtrip behavior
@@ -1013,3 +1047,163 @@ Add tests that verify: writes are serialized (no concurrent write errors), backp
 | 28 | T18 | ToolCapability model and registry | 4 | M |

 *T1, T2, T19 can all run in parallel as they have no dependencies on each other.*
+
+---
+
+## Architectural Review Notes (2026-03-08)
+
+**Source:** Independent technical review of the 28-ticket roadmap.
+
+### Verdict
+
+| Aspect | Assessment |
+|--------|------------|
+| Overall direction | Good — consolidation first, features second |
+| Tier 1 (Tickets 1–5) | Execute fully. Reduces bloat, improves testability. |
+| Tier 2 (Tickets 6–11) | Reasonable operational features. Selectively execute. Skip rate limiting if behind a gateway; skip validation middleware if FastAPI already covers it. |
+| Tier 3 (Tickets 12–20) | **Major bloat risk.** Pick 2–3 max. Keep WebSocket (if real-time needed), metrics (if lacking observability), and retention. **Skip multi-tenancy and plugins unless paying customers demand them.** |
+
+### What's Good (Anti-Bloat)
+
+1. **Tier 1 consolidates, doesn't add** — WAL mode, lazy init, EventBus unification, memory consolidation all reduce surface area.
+2. **Several tickets explicitly reduce complexity** — Ticket 2 removes import-time side effects, Ticket 3 unifies two event interfaces into one, Ticket 5 creates a facade.
+3. **Scope estimates are realistic** — 4S, 9M, 4L, 3XL distribution is honest.
+
+### Bloat Risks to Watch
+
+1. **Tier 3 has 9 tickets** — multi-tenancy (XL) and plugin system are massive complexity multipliers. Only build if there's real demand.
+2. **Some features duplicate ecosystem tools** — Prometheus metrics adds ops complexity; backup systems often already exist at the infrastructure layer; config hot-reload is nice-to-have.
+3. **Ticket prompts are overly prescriptive** — they specify implementation details ("sqlite-vec or similar") better left to the implementer.
+
+### Recommendation
+
+Execute **Tier 1 fully** (done: #1, #2, #3). Execute **Tier 2 selectively**. **Cut Tier 3 in half** — the plan is well-organized but ~12 tickets would suffice for real-world needs.
+
+---
+
+## Clean Architecture Review (2026-03-08)
+
+**Source:** Second independent review — Clean Architecture critique.
+
+### Core Problem: Infrastructure-First, Not Domain-First
+
+This plan is almost entirely infrastructure thinking. Clean Architecture (Martin,
+Hexagonal/Ports & Adapters, Onion) prescribes: start with Entities and Use Cases,
+then build infrastructure as adapters. This plan inverts that.
+
+### What's Missing
+
+| Gap | Detail |
+|-----|--------|
+| No domain layer | 0 tickets define domain entities, use cases, or business invariants |
+| Database drives design | Tickets 1, 4, 13, 17–19 are all SQLite schema/storage mechanics. DB should be a swappable detail. |
+| No dependency rule | Everything depends on `brain/memory.py`, `swarm/event_log.py` — infrastructure modules. Dependencies should point inward: Infra → Use Cases → Entities. |
+| Facades hide, don't abstract | Ticket 5's "MemoryFacade" exposes storage tiers (`store(tier, key, value)`), not domain operations (`recordTimeEntry()`, `generateReport()`). |
+
+### What Clean Architecture Would Look Like
+
+| This Plan | Clean Architecture |
+|-----------|--------------------|
+| "Add WAL mode for SQLite" | Define `TimeEntryRepository` interface in domain; SQLite is one implementation |
+| "MemoryFacade with 4 tiers" | Domain entities with clear lifecycle; storage strategy is infrastructure |
+| "Unify EventBus with event log" | Domain events (`TimeEntryRecorded`) published to abstract event bus |
+| "Multi-tenancy support" | `TenantId` value object in domain; infrastructure handles isolation |
+
+### Verdict
+
+This plan will produce a system that:
+- Has high load handling (WAL, circuit breakers)
+- Is observable (metrics, logging)
+- Has lots of features (WebSocket, plugins, embeddings)
+- **Has no clear domain boundaries**
+- **Is hard to unit test without the database**
+- **Will be hard to refactor when requirements change**
+
+### Recommendation
+
+> **Stop after Ticket 2 (lazy init) and define:**
+> 1. What are the domain entities? (TimeEntry? Agent? Task?)
+> 2. What are the use cases? (Record time? Generate report? Hand off?)
+> 3. What interfaces do use cases need? (Repository? EventPublisher?)
+>
+> Then build infrastructure (SQLite, EventBus, WebSocket) as adapters
+> implementing those interfaces — not as the foundation.
+
+### Action Items
+
+- [ ] Before Ticket 4: define domain entities and repository interfaces
+- [ ] Ticket 5 (MemoryFacade): reframe as domain operations, not storage tiers
+- [ ] All new abstractions: dependency rule — domain depends on nothing, infra depends on domain
+- [ ] Evaluate remaining tickets through "is this domain or infrastructure?" lens
+
+---
+
+## 2000-Line Philosophy Review (2026-03-08)
+
+**Source:** Third independent review — radical simplicity critique.
+
+### Core Argument: This Plan Abandons YAGNI
+
+| 2000-Line Philosophy | This Plan |
+|----------------------|-----------|
+| Small, comprehensible units | 28 tickets, 4 tiers, XL scopes |
+| YAGNI — prove you need it | Multi-tenancy, plugins, semantic search, hot-reload — all speculative |
+| Delete code, don't add it | Mostly adding infrastructure |
+| One database, simple schema | WAL tuning, retention, archival, backups, multi-tenant modes |
+| A few solid abstractions | MemoryFacade + 4 tiers + EventBus + Plugin system + Job queue |
+| Understand the whole system | "Drop each Claude Code Prompt into a fresh session" — you won't understand it |
+
+### Telltale Signs
+
+1. **Prompts are massive.** Each "Claude Code Prompt" is 200+ words of prescriptive
+   implementation detail. That's outsourcing thinking, not planning.
+2. **28 tickets for a time dashboard.** Cumulative surface area is enormous.
+   A 2000-line codebase has ~10–15 source files, 3–4 core abstractions, one way
+   to do things. This plan creates 20+ new modules.
+3. **"S" scope tickets aren't small.** Ticket 1 (WAL mode) touches multiple databases,
+   requires shared helpers, needs tests across the codebase — that's a cross-cutting concern.
+
+### What 2000 Lines Looks Like
+
+```python
+# memory.py (~150 lines)
+class Memory:
+    def get(self, key): ...
+    def set(self, key, value): ...
+    # SQLite behind a simple interface. No WAL tuning exposed.
+    # No "tiers." No "facade." Just store and retrieve.
+
+# events.py (~100 lines)
+class Events:
+    def publish(self, event): ...
+    # SQLite table. Simple. Blocking is fine for now.
+```
+
+Need multi-tenancy later? Fork and add it when a customer pays for it.
+Need plugins? Monkey-patch or add a hook — in 20 lines.
+Need semantic search? `grep` works surprisingly well under 10k documents.
+
+### Verdict
+
+> The 2000-line philosophy isn't about the number — it's about being willing
+> to say **"no."** This plan says yes to everything. That's not architecture —
+> that's accumulation.
+
+### Recommendation
+
+> **Cut to 4 tickets max.** WAL mode (if hitting contention), lazy init (for
+> tests), and maybe health checks. Everything else waits until it hurts.
+
+### Decision Matrix (Updated with All 3 Reviews)
+
+| Ticket | Bloat Review | Clean Arch Review | 2000-Line Review | Final Call |
+|--------|-------------|-------------------|-------------------|-----------|
+| T1 WAL mode | Do it | Infrastructure detail | Do if contention exists | **DONE** |
+| T2 Lazy init | Do it | Good for testability | Do it | **DONE** |
+| T3 EventBus unify | Do it | Needs domain events first | Overkill | **DONE** |
+| T4 Memory consolidation | Do it | Define domain entities first | Wait until it hurts | **BLOCKED: needs domain model** |
+| T5 MemoryFacade | Do it | Reframe as domain ops | Overkill — one `Memory` class | **BLOCKED: needs domain model** |
+| T7 MockLLM | Do it | Good | Good | **NEXT** |
+| T19 Threat model | Do it | Good | Skip unless deploying | Evaluate |
+| T20 OpenTelemetry | Selective | Infrastructure | Skip | Skip |
+| T6–T18, T21–T28 | Cut in half | Define domain first | Cut to zero | **PARKED** |
--- a/src/brain/memory.py
+++ b/src/brain/memory.py
@@ -93,16 +93,19 @@ class UnifiedMemory:

        conn = sqlite3.connect(str(self.db_path))
        try:
+            conn.execute("PRAGMA journal_mode=WAL")
+            conn.execute("PRAGMA busy_timeout=5000")
            conn.executescript(_LOCAL_SCHEMA)
            conn.commit()
-            logger.info("Brain local DB initialized at %s", self.db_path)
+            logger.info("Brain local DB initialized at %s (WAL mode)", self.db_path)
        finally:
            conn.close()

    def _get_conn(self) -> sqlite3.Connection:
-        """Get a SQLite connection."""
+        """Get a SQLite connection with WAL mode and busy timeout."""
        conn = sqlite3.connect(str(self.db_path))
        conn.row_factory = sqlite3.Row
+        conn.execute("PRAGMA busy_timeout=5000")
        return conn

    def _get_embedder(self):
--- a/src/config.py
+++ b/src/config.py
@@ -1,3 +1,6 @@
+import logging as _logging
+import os
+import sys
 from typing import Literal

 from pydantic_settings import BaseSettings, SettingsConfigDict
@@ -339,14 +342,26 @@ def get_effective_ollama_model() -> str:


 # ── Startup validation ───────────────────────────────────────────────────────
-# Enforce security requirements — fail fast in production.
-import logging as _logging
-import sys
-
 _startup_logger = _logging.getLogger("config")
+_startup_validated = False

-# Production mode: require secrets to be set
-if settings.timmy_env == "production":
+
+def validate_startup(*, force: bool = False) -> None:
+    """Enforce security requirements — call from app entry points, not import.
+
+    Skipped in test mode (TIMMY_TEST_MODE=1) unless force=True.
+    In production: sys.exit(1) if required secrets are missing.
+    In development: log warnings only.
+    """
+    global _startup_validated
+    if _startup_validated and not force:
+        return
+
+    if os.environ.get("TIMMY_TEST_MODE") == "1" and not force:
+        _startup_validated = True
+        return
+
+    if settings.timmy_env == "production":
        _missing = []
        if not settings.l402_hmac_secret:
            _missing.append("L402_HMAC_SECRET")
@@ -361,8 +376,7 @@ if settings.timmy_env == "production":
            )
            sys.exit(1)
        _startup_logger.info("Production mode: security secrets validated ✓")
-else:
-    # Development mode: warn but continue
+    else:
        if not settings.l402_hmac_secret:
            _startup_logger.warning(
                "SEC: L402_HMAC_SECRET is not set — "
@@ -373,3 +387,5 @@ else:
                "SEC: L402_MACAROON_SECRET is not set — "
                "set a unique secret in .env before deploying to production."
            )
+
+    _startup_validated = True
--- a/src/dashboard/app.py
+++ b/src/dashboard/app.py
@@ -9,7 +9,6 @@ Key improvements:

 import asyncio
 import logging
-import os
 from contextlib import asynccontextmanager
 from pathlib import Path

@@ -200,13 +199,23 @@ async def _discord_token_watcher() -> None:
 async def lifespan(app: FastAPI):
    """Application lifespan manager with non-blocking startup."""

+    # Validate security config (no-op in test mode)
+    from config import validate_startup
+
+    validate_startup()
+
+    # Enable event persistence (unified EventBus + swarm event_log)
+    from infrastructure.events.bus import init_event_bus_persistence
+
+    init_event_bus_persistence()
+
    # Create all background tasks without waiting for them
    briefing_task = asyncio.create_task(_briefing_scheduler())

    # Initialize Spark Intelligence engine
-    from spark.engine import spark_engine
+    from spark.engine import get_spark_engine

-    if spark_engine.enabled:
+    if get_spark_engine().enabled:
        logger.info("Spark Intelligence active — event capture enabled")

    # Auto-prune old vector store memories on startup
--- a/src/dashboard/middleware/security_headers.py
+++ b/src/dashboard/middleware/security_headers.py
@@ -116,9 +116,9 @@ class SecurityHeadersMiddleware(BaseHTTPMiddleware):

        # HTTPS enforcement (production only)
        if self.production:
-            response.headers[
-                "Strict-Transport-Security"
-            ] = "max-age=31536000; includeSubDomains; preload"
+            response.headers["Strict-Transport-Security"] = (
+                "max-age=31536000; includeSubDomains; preload"
+            )

    async def dispatch(self, request: Request, call_next) -> Response:
        """Add security headers to the response.
--- a/src/dashboard/routes/tasks.py
+++ b/src/dashboard/routes/tasks.py
@@ -40,8 +40,7 @@ def _get_db() -> sqlite3.Connection:
    DB_PATH.parent.mkdir(parents=True, exist_ok=True)
    conn = sqlite3.connect(str(DB_PATH))
    conn.row_factory = sqlite3.Row
-    conn.execute(
-        """
+    conn.execute("""
        CREATE TABLE IF NOT EXISTS tasks (
            id TEXT PRIMARY KEY,
            title TEXT NOT NULL,
@@ -54,8 +53,7 @@ def _get_db() -> sqlite3.Connection:
            created_at TEXT DEFAULT (datetime('now')),
            completed_at TEXT
        )
-    """
-    )
+    """)
    conn.commit()
    return conn

--- a/src/dashboard/routes/work_orders.py
+++ b/src/dashboard/routes/work_orders.py
@@ -26,8 +26,7 @@ def _get_db() -> sqlite3.Connection:
    DB_PATH.parent.mkdir(parents=True, exist_ok=True)
    conn = sqlite3.connect(str(DB_PATH))
    conn.row_factory = sqlite3.Row
-    conn.execute(
-        """
+    conn.execute("""
        CREATE TABLE IF NOT EXISTS work_orders (
            id TEXT PRIMARY KEY,
            title TEXT NOT NULL,
@@ -42,8 +41,7 @@ def _get_db() -> sqlite3.Connection:
            created_at TEXT DEFAULT (datetime('now')),
            completed_at TEXT
        )
-    """
-    )
+    """)
    conn.commit()
    return conn

--- a/src/infrastructure/events/bus.py
+++ b/src/infrastructure/events/bus.py
@@ -1,14 +1,18 @@
 """Async Event Bus for inter-agent communication.

 Agents publish and subscribe to events for loose coupling.
-Events are typed and carry structured data.
+Events are typed and carry structured data. Optionally persists
+events to SQLite for durability and replay.
 """

 import asyncio
+import json
 import logging
+import sqlite3
 from dataclasses import dataclass, field
 from datetime import datetime, timezone
-from typing import Any, Callable, Coroutine
+from pathlib import Path
+from typing import Any, Callable, Coroutine, Optional

 logger = logging.getLogger(__name__)

@@ -27,32 +31,181 @@ class Event:
 # Type alias for event handlers
 EventHandler = Callable[[Event], Coroutine[Any, Any, None]]

+# Schema for the unified events table
+_EVENTS_SCHEMA = """
+CREATE TABLE IF NOT EXISTS events (
+    id TEXT PRIMARY KEY,
+    event_type TEXT NOT NULL,
+    source TEXT DEFAULT '',
+    task_id TEXT DEFAULT '',
+    agent_id TEXT DEFAULT '',
+    data TEXT DEFAULT '{}',
+    timestamp TEXT NOT NULL
+);
+CREATE INDEX IF NOT EXISTS idx_events_type ON events(event_type);
+CREATE INDEX IF NOT EXISTS idx_events_source ON events(source);
+CREATE INDEX IF NOT EXISTS idx_events_task ON events(task_id);
+CREATE INDEX IF NOT EXISTS idx_events_ts ON events(timestamp);
+"""
+

 class EventBus:
    """Async event bus for publish/subscribe pattern.

+    Supports optional SQLite persistence via enable_persistence().
+    When enabled, all published events are durably stored and can be
+    replayed via the replay() method.
+
    Usage:
        bus = EventBus()
+        bus.enable_persistence(Path("data/events.db"))

-        # Subscribe to events
        @bus.subscribe("agent.task.*")
        async def handle_task(event: Event):
            print(f"Task event: {event.data}")

-        # Publish events
        await bus.publish(Event(
            type="agent.task.assigned",
            source="default",
            data={"task_id": "123", "agent": "forge"}
        ))
+
+        # Replay persisted events
+        events = bus.replay(event_type="task.created")
    """

    def __init__(self) -> None:
        self._subscribers: dict[str, list[EventHandler]] = {}
        self._history: list[Event] = []
        self._max_history = 1000
+        self._persistence_db_path: Optional[Path] = None
        logger.info("EventBus initialized")

+    # ── Persistence ──────────────────────────────────────────────────────
+
+    def enable_persistence(self, db_path: Path) -> None:
+        """Enable SQLite persistence for all published events.
+
+        Args:
+            db_path: Path to the SQLite database file.
+        """
+        self._persistence_db_path = db_path
+        self._init_persistence_db()
+        logger.info("EventBus persistence enabled at %s", db_path)
+
+    def _init_persistence_db(self) -> None:
+        """Initialize the persistence database with schema."""
+        if self._persistence_db_path is None:
+            return
+        self._persistence_db_path.parent.mkdir(parents=True, exist_ok=True)
+        conn = sqlite3.connect(str(self._persistence_db_path))
+        try:
+            conn.execute("PRAGMA journal_mode=WAL")
+            conn.execute("PRAGMA busy_timeout=5000")
+            conn.executescript(_EVENTS_SCHEMA)
+            conn.commit()
+        finally:
+            conn.close()
+
+    def _get_persistence_conn(self) -> Optional[sqlite3.Connection]:
+        """Get a connection to the persistence database."""
+        if self._persistence_db_path is None:
+            return None
+        conn = sqlite3.connect(str(self._persistence_db_path))
+        conn.row_factory = sqlite3.Row
+        conn.execute("PRAGMA busy_timeout=5000")
+        return conn
+
+    def _persist_event(self, event: Event) -> None:
+        """Write an event to the persistence database."""
+        conn = self._get_persistence_conn()
+        if conn is None:
+            return
+        try:
+            task_id = event.data.get("task_id", "")
+            agent_id = event.data.get("agent_id", "")
+            conn.execute(
+                "INSERT OR IGNORE INTO events "
+                "(id, event_type, source, task_id, agent_id, data, timestamp) "
+                "VALUES (?, ?, ?, ?, ?, ?, ?)",
+                (
+                    event.id,
+                    event.type,
+                    event.source,
+                    task_id,
+                    agent_id,
+                    json.dumps(event.data),
+                    event.timestamp,
+                ),
+            )
+            conn.commit()
+        except Exception as exc:
+            logger.debug("Failed to persist event: %s", exc)
+        finally:
+            conn.close()
+
+    # ── Replay ───────────────────────────────────────────────────────────
+
+    def replay(
+        self,
+        event_type: Optional[str] = None,
+        source: Optional[str] = None,
+        task_id: Optional[str] = None,
+        limit: int = 100,
+    ) -> list[Event]:
+        """Replay persisted events from SQLite with optional filters.
+
+        Args:
+            event_type: Filter by exact event type.
+            source: Filter by event source.
+            task_id: Filter by task_id.
+            limit: Max events to return (most recent first).
+
+        Returns:
+            List of Event objects from persistent storage.
+        """
+        conn = self._get_persistence_conn()
+        if conn is None:
+            return []
+
+        try:
+            conditions = []
+            params: list = []
+
+            if event_type:
+                conditions.append("event_type = ?")
+                params.append(event_type)
+            if source:
+                conditions.append("source = ?")
+                params.append(source)
+            if task_id:
+                conditions.append("task_id = ?")
+                params.append(task_id)
+
+            where = " AND ".join(conditions) if conditions else "1=1"
+            sql = f"SELECT * FROM events WHERE {where} ORDER BY timestamp DESC LIMIT ?"
+            params.append(limit)
+
+            rows = conn.execute(sql, params).fetchall()
+
+            return [
+                Event(
+                    id=row["id"],
+                    type=row["event_type"],
+                    source=row["source"],
+                    data=json.loads(row["data"]) if row["data"] else {},
+                    timestamp=row["timestamp"],
+                )
+                for row in rows
+            ]
+        except Exception as exc:
+            logger.debug("Failed to replay events: %s", exc)
+            return []
+        finally:
+            conn.close()
+
+    # ── Subscribe / Publish ──────────────────────────────────────────────
+
    def subscribe(self, event_pattern: str) -> Callable[[EventHandler], EventHandler]:
        """Decorator to subscribe to events matching a pattern.

@@ -87,10 +240,15 @@ class EventBus:
    async def publish(self, event: Event) -> int:
        """Publish an event to all matching subscribers.

+        If persistence is enabled, the event is also written to SQLite.
+
        Returns:
            Number of handlers invoked
        """
-        # Store in history
+        # Persist to SQLite (graceful — never crash on persistence failure)
+        self._persist_event(event)
+
+        # Store in in-memory history
        self._history.append(event)
        if len(self._history) > self._max_history:
            self._history = self._history[-self._max_history :]
@@ -105,7 +263,8 @@ class EventBus:
        # Invoke handlers concurrently
        if handlers:
            await asyncio.gather(
-                *[self._invoke_handler(h, event) for h in handlers], return_exceptions=True
+                *[self._invoke_handler(h, event) for h in handlers],
+                return_exceptions=True,
            )

        logger.debug("Published event '%s' to %d handlers", event.type, len(handlers))
@@ -147,18 +306,39 @@ class EventBus:
        return events[-limit:]

    def clear_history(self) -> None:
-        """Clear event history."""
+        """Clear in-memory event history."""
        self._history.clear()


-# Module-level singleton
-event_bus = EventBus()
+# ── Lazy singleton ────────────────────────────────────────────────────────────
+_event_bus: EventBus | None = None
+
+
+def get_event_bus() -> EventBus:
+    """Return the module-level EventBus, creating it on first access."""
+    global _event_bus
+    if _event_bus is None:
+        _event_bus = EventBus()
+    return _event_bus
+
+
+def init_event_bus_persistence(db_path: Optional[Path] = None) -> None:
+    """Enable persistence on the module-level EventBus singleton.
+
+    Call this during app startup to enable durable event storage.
+    If db_path is not provided, uses `data/events.db`.
+    """
+    bus = get_event_bus()
+    if bus._persistence_db_path is not None:
+        return  # already initialized
+    path = db_path or Path("data/events.db")
+    bus.enable_persistence(path)


 # Convenience functions
 async def emit(event_type: str, source: str, data: dict) -> int:
    """Quick emit an event."""
-    return await event_bus.publish(
+    return await get_event_bus().publish(
        Event(
            type=event_type,
            source=source,
@@ -169,4 +349,11 @@ async def emit(event_type: str, source: str, data: dict) -> int:

 def on(event_pattern: str) -> Callable[[EventHandler], EventHandler]:
    """Quick subscribe decorator."""
-    return event_bus.subscribe(event_pattern)
+    return get_event_bus().subscribe(event_pattern)
+
+
+def __getattr__(name: str):
+    """Module-level __getattr__ for lazy backward-compatible access to event_bus."""
+    if name == "event_bus":
+        return get_event_bus()
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
--- a/src/infrastructure/models/registry.py
+++ b/src/infrastructure/models/registry.py
@@ -67,8 +67,9 @@ def _get_conn() -> sqlite3.Connection:
    DB_PATH.parent.mkdir(parents=True, exist_ok=True)
    conn = sqlite3.connect(str(DB_PATH))
    conn.row_factory = sqlite3.Row
-    conn.execute(
-        """
+    conn.execute("PRAGMA journal_mode=WAL")
+    conn.execute("PRAGMA busy_timeout=5000")
+    conn.execute("""
        CREATE TABLE IF NOT EXISTS custom_models (
            name            TEXT PRIMARY KEY,
            format          TEXT NOT NULL,
@@ -81,18 +82,15 @@ def _get_conn() -> sqlite3.Connection:
            default_temperature REAL NOT NULL DEFAULT 0.7,
            max_tokens      INTEGER NOT NULL DEFAULT 2048
        )
-        """
-    )
-    conn.execute(
-        """
+        """)
+    conn.execute("""
        CREATE TABLE IF NOT EXISTS agent_model_assignments (
            agent_id    TEXT PRIMARY KEY,
            model_name  TEXT NOT NULL,
            assigned_at TEXT NOT NULL,
            FOREIGN KEY (model_name) REFERENCES custom_models(name)
        )
-        """
-    )
+        """)
    conn.commit()
    return conn

--- a/src/integrations/paperclip/task_runner.py
+++ b/src/integrations/paperclip/task_runner.py
@@ -31,8 +31,7 @@ logger = logging.getLogger(__name__)
 class Orchestrator(Protocol):
    """Anything with an ``execute_task`` matching Timmy's orchestrator."""

-    async def execute_task(self, task_id: str, description: str, context: dict) -> Any:
-        ...
+    async def execute_task(self, task_id: str, description: str, context: dict) -> Any: ...


 def _wrap_orchestrator(orch: Orchestrator) -> Callable:
--- a/src/spark/eidos.py
+++ b/src/spark/eidos.py
@@ -44,8 +44,9 @@ def _get_conn() -> sqlite3.Connection:
    DB_PATH.parent.mkdir(parents=True, exist_ok=True)
    conn = sqlite3.connect(str(DB_PATH))
    conn.row_factory = sqlite3.Row
-    conn.execute(
-        """
+    conn.execute("PRAGMA journal_mode=WAL")
+    conn.execute("PRAGMA busy_timeout=5000")
+    conn.execute("""
        CREATE TABLE IF NOT EXISTS spark_predictions (
            id               TEXT PRIMARY KEY,
            task_id          TEXT NOT NULL,
@@ -56,8 +57,7 @@ def _get_conn() -> sqlite3.Connection:
            created_at       TEXT NOT NULL,
            evaluated_at     TEXT
        )
-        """
-    )
+        """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_pred_task ON spark_predictions(task_id)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_pred_type ON spark_predictions(prediction_type)")
    conn.commit()
@@ -277,8 +277,7 @@ def get_predictions(
 def get_accuracy_stats() -> dict:
    """Return aggregate accuracy statistics for the EIDOS loop."""
    conn = _get_conn()
-    row = conn.execute(
-        """
+    row = conn.execute("""
        SELECT
            COUNT(*)                          AS total_predictions,
            COUNT(evaluated_at)               AS evaluated,
@@ -286,8 +285,7 @@ def get_accuracy_stats() -> dict:
            MIN(CASE WHEN accuracy IS NOT NULL THEN accuracy END) AS min_accuracy,
            MAX(CASE WHEN accuracy IS NOT NULL THEN accuracy END) AS max_accuracy
        FROM spark_predictions
-        """
-    ).fetchone()
+        """).fetchone()
    conn.close()

    return {
--- a/src/spark/engine.py
+++ b/src/spark/engine.py
@@ -6,7 +6,8 @@ memory consolidation, and the advisory system.

 Usage
 -----
-    from spark.engine import spark_engine
+    from spark.engine import get_spark_engine
+    spark_engine = get_spark_engine()

    # Capture a swarm event
    spark_engine.on_task_posted(task_id, description)
@@ -346,14 +347,31 @@ class SparkEngine:
        return spark_eidos.get_predictions(limit=limit)


-# Module-level singleton — respects SPARK_ENABLED config
-def _create_engine() -> SparkEngine:
+# ── Lazy singleton ────────────────────────────────────────────────────────────
+_spark_engine: SparkEngine | None = None
+
+
+def get_spark_engine() -> SparkEngine:
+    """Return the module-level SparkEngine, creating it on first access."""
+    global _spark_engine
+    if _spark_engine is None:
        try:
            from config import settings

-        return SparkEngine(enabled=settings.spark_enabled)
+            _spark_engine = SparkEngine(enabled=settings.spark_enabled)
        except Exception:
-        return SparkEngine(enabled=True)
+            _spark_engine = SparkEngine(enabled=True)
+    return _spark_engine


-spark_engine = _create_engine()
+def reset_spark_engine() -> None:
+    """Reset the singleton for test isolation."""
+    global _spark_engine
+    _spark_engine = None
+
+
+def __getattr__(name: str):
+    """Module-level __getattr__ for lazy backward-compatible access to spark_engine."""
+    if name == "spark_engine":
+        return get_spark_engine()
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
--- a/src/spark/memory.py
+++ b/src/spark/memory.py
@@ -57,8 +57,9 @@ def _get_conn() -> sqlite3.Connection:
    DB_PATH.parent.mkdir(parents=True, exist_ok=True)
    conn = sqlite3.connect(str(DB_PATH))
    conn.row_factory = sqlite3.Row
-    conn.execute(
-        """
+    conn.execute("PRAGMA journal_mode=WAL")
+    conn.execute("PRAGMA busy_timeout=5000")
+    conn.execute("""
        CREATE TABLE IF NOT EXISTS spark_events (
            id          TEXT PRIMARY KEY,
            event_type  TEXT NOT NULL,
@@ -69,10 +70,8 @@ def _get_conn() -> sqlite3.Connection:
            importance  REAL NOT NULL DEFAULT 0.5,
            created_at  TEXT NOT NULL
        )
-        """
-    )
-    conn.execute(
-        """
+        """)
+    conn.execute("""
        CREATE TABLE IF NOT EXISTS spark_memories (
            id              TEXT PRIMARY KEY,
            memory_type     TEXT NOT NULL,
@@ -83,8 +82,7 @@ def _get_conn() -> sqlite3.Connection:
            created_at      TEXT NOT NULL,
            expires_at      TEXT
        )
-        """
-    )
+        """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_events_type ON spark_events(event_type)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_events_agent ON spark_events(agent_id)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_events_task ON spark_events(task_id)")
--- a/src/swarm/event_log.py
+++ b/src/swarm/event_log.py
@@ -2,8 +2,12 @@

 Provides EventType enum, EventLogEntry dataclass, and log_event() function
 used by error_capture, thinking engine, and the event broadcaster.
+
+Events are persisted to SQLite and also published to the unified EventBus
+(infrastructure.events.bus) for subscriber notification.
 """

+import json
 import logging
 import sqlite3
 import uuid
@@ -73,8 +77,9 @@ def _ensure_db() -> sqlite3.Connection:
    DB_PATH.parent.mkdir(parents=True, exist_ok=True)
    conn = sqlite3.connect(str(DB_PATH))
    conn.row_factory = sqlite3.Row
-    conn.execute(
-        """
+    conn.execute("PRAGMA journal_mode=WAL")
+    conn.execute("PRAGMA busy_timeout=5000")
+    conn.execute("""
        CREATE TABLE IF NOT EXISTS events (
            id TEXT PRIMARY KEY,
            event_type TEXT NOT NULL,
@@ -84,12 +89,45 @@ def _ensure_db() -> sqlite3.Connection:
            data TEXT DEFAULT '{}',
            timestamp TEXT NOT NULL
        )
-    """
-    )
+    """)
    conn.commit()
    return conn


+def _publish_to_event_bus(entry: EventLogEntry) -> None:
+    """Publish an event to the unified EventBus (non-blocking).
+
+    This bridges the synchronous log_event() callers to the async EventBus
+    so subscribers get notified of all events regardless of origin.
+    """
+    try:
+        import asyncio
+
+        from infrastructure.events.bus import Event, event_bus
+
+        event = Event(
+            id=entry.id,
+            type=entry.event_type.value,
+            source=entry.source,
+            data={
+                **entry.data,
+                "task_id": entry.task_id,
+                "agent_id": entry.agent_id,
+            },
+            timestamp=entry.timestamp,
+        )
+
+        try:
+            asyncio.get_running_loop()
+            asyncio.create_task(event_bus.publish(event))
+        except RuntimeError:
+            # No event loop running — skip async publish
+            pass
+    except Exception:
+        # Graceful degradation — never crash on EventBus integration
+        pass
+
+
 def log_event(
    event_type: EventType,
    source: str = "",
@@ -99,11 +137,9 @@ def log_event(
 ) -> EventLogEntry:
    """Record an event and return the entry.

-    Also broadcasts to WebSocket clients via the event broadcaster
-    (lazy import to avoid circular deps).
+    Persists to SQLite, publishes to EventBus for subscribers,
+    and broadcasts to WebSocket clients.
    """
-    import json
-
    entry = EventLogEntry(
        id=str(uuid.uuid4()),
        event_type=event_type,
@@ -137,6 +173,9 @@ def log_event(
    except Exception as exc:
        logger.debug("Failed to persist event: %s", exc)

+    # Publish to unified EventBus (non-blocking)
+    _publish_to_event_bus(entry)
+
    # Broadcast to WebSocket clients (non-blocking)
    try:
        from infrastructure.events.broadcaster import event_broadcaster
@@ -150,8 +189,6 @@ def log_event(

 def get_task_events(task_id: str, limit: int = 50) -> list[EventLogEntry]:
    """Retrieve events for a specific task."""
-    import json
-
    db = _ensure_db()
    try:
        rows = db.execute(
--- a/src/swarm/task_queue/models.py
+++ b/src/swarm/task_queue/models.py
@@ -29,8 +29,9 @@ def _ensure_db() -> sqlite3.Connection:
    DB_PATH.parent.mkdir(parents=True, exist_ok=True)
    conn = sqlite3.connect(str(DB_PATH))
    conn.row_factory = sqlite3.Row
-    conn.execute(
-        """
+    conn.execute("PRAGMA journal_mode=WAL")
+    conn.execute("PRAGMA busy_timeout=5000")
+    conn.execute("""
        CREATE TABLE IF NOT EXISTS tasks (
            id TEXT PRIMARY KEY,
            title TEXT NOT NULL,
@@ -43,8 +44,7 @@ def _ensure_db() -> sqlite3.Connection:
            created_at TEXT DEFAULT (datetime('now')),
            completed_at TEXT
        )
-    """
-    )
+    """)
    conn.commit()
    return conn

--- a/src/timmy/agents/timmy.py
+++ b/src/timmy/agents/timmy.py
@@ -328,9 +328,7 @@ class TimmyOrchestrator(BaseAgent):
        # Build session-specific context block for the prompt
        recent_changes = self._session_context.get("git_log_oneline", "")
        if recent_changes and recent_changes != "Git log unavailable":
-            self._session_context[
-                "recent_changes_block"
-            ] = f"""
+            self._session_context["recent_changes_block"] = f"""
 ## Recent Changes to Your Codebase (last 15 commits):
 ```
 {recent_changes}
--- a/src/timmy/approvals.py
+++ b/src/timmy/approvals.py
@@ -48,8 +48,7 @@ def _get_conn(db_path: Path = _DEFAULT_DB) -> sqlite3.Connection:
    db_path.parent.mkdir(parents=True, exist_ok=True)
    conn = sqlite3.connect(str(db_path))
    conn.row_factory = sqlite3.Row
-    conn.execute(
-        """
+    conn.execute("""
        CREATE TABLE IF NOT EXISTS approval_items (
            id              TEXT PRIMARY KEY,
            title           TEXT NOT NULL,
@@ -59,8 +58,7 @@ def _get_conn(db_path: Path = _DEFAULT_DB) -> sqlite3.Connection:
            created_at      TEXT NOT NULL,
            status          TEXT NOT NULL DEFAULT 'pending'
        )
-        """
-    )
+        """)
    conn.commit()
    return conn

--- a/src/timmy/briefing.py
+++ b/src/timmy/briefing.py
@@ -63,8 +63,7 @@ def _get_cache_conn(db_path: Path = _DEFAULT_DB) -> sqlite3.Connection:
    db_path.parent.mkdir(parents=True, exist_ok=True)
    conn = sqlite3.connect(str(db_path))
    conn.row_factory = sqlite3.Row
-    conn.execute(
-        """
+    conn.execute("""
        CREATE TABLE IF NOT EXISTS briefings (
            id          INTEGER PRIMARY KEY AUTOINCREMENT,
            generated_at TEXT NOT NULL,
@@ -72,8 +71,7 @@ def _get_cache_conn(db_path: Path = _DEFAULT_DB) -> sqlite3.Connection:
            period_end   TEXT NOT NULL,
            summary      TEXT NOT NULL
        )
-        """
-    )
+        """)
    conn.commit()
    return conn

--- a/src/timmy/memory/vector_store.py
+++ b/src/timmy/memory/vector_store.py
@@ -122,8 +122,7 @@ def _get_conn() -> sqlite3.Connection:
        _has_vss = False

    # Create tables
-    conn.execute(
-        """
+    conn.execute("""
        CREATE TABLE IF NOT EXISTS memory_entries (
            id TEXT PRIMARY KEY,
            content TEXT NOT NULL,
@@ -136,8 +135,7 @@ def _get_conn() -> sqlite3.Connection:
            embedding TEXT,  -- JSON array of floats
            timestamp TEXT NOT NULL
        )
-        """
-    )
+        """)

    # Create indexes
    conn.execute("CREATE INDEX IF NOT EXISTS idx_memory_agent ON memory_entries(agent_id)")
--- a/src/timmy/memory_system.py
+++ b/src/timmy/memory_system.py
@@ -10,7 +10,6 @@ Handoff Protocol:
 - Inject into next session automatically
 """

-import hashlib
 import logging
 import re
 from datetime import datetime, timezone
@@ -160,7 +159,7 @@ class VaultMemory:
        filepath = self.path / namespace / filename

        # Add header
-        full_content = f"""# {name.replace('_', ' ').title()}
+        full_content = f"""# {name.replace("_", " ").title()}

 > Created: {datetime.now(timezone.utc).isoformat()}
 > Namespace: {namespace}
@@ -256,9 +255,7 @@ class VaultMemory:
 ---

 *Last updated: {date}*
-""".format(
-            date=datetime.now(timezone.utc).strftime("%Y-%m-%d")
-        )
+""".format(date=datetime.now(timezone.utc).strftime("%Y-%m-%d"))

        profile_path.write_text(default)

@@ -462,5 +459,26 @@ class MemorySystem:
        return "\n\n---\n\n".join(context_parts)


-# Module-level singleton
-memory_system = MemorySystem()
+# ── Lazy singleton ────────────────────────────────────────────────────────────
+_memory_system: MemorySystem | None = None
+
+
+def get_memory_system() -> MemorySystem:
+    """Return the module-level MemorySystem, creating it on first access."""
+    global _memory_system
+    if _memory_system is None:
+        _memory_system = MemorySystem()
+    return _memory_system
+
+
+def reset_memory_system() -> None:
+    """Reset the singleton for test isolation."""
+    global _memory_system
+    _memory_system = None
+
+
+def __getattr__(name: str):
+    """Module-level __getattr__ for lazy backward-compatible access."""
+    if name == "memory_system":
+        return get_memory_system()
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
--- a/src/timmy/semantic_memory.py
+++ b/src/timmy/semantic_memory.py
@@ -115,8 +115,7 @@ class SemanticMemory:
        """Initialize SQLite with vector storage."""
        self.db_path.parent.mkdir(parents=True, exist_ok=True)
        conn = sqlite3.connect(str(self.db_path))
-        conn.execute(
-            """
+        conn.execute("""
            CREATE TABLE IF NOT EXISTS chunks (
                id TEXT PRIMARY KEY,
                source TEXT NOT NULL,
@@ -125,8 +124,7 @@ class SemanticMemory:
                created_at TEXT NOT NULL,
                source_hash TEXT NOT NULL
            )
-        """
-        )
+        """)
        conn.execute("CREATE INDEX IF NOT EXISTS idx_source ON chunks(source)")
        conn.commit()
        conn.close()
--- a/src/timmy/thinking.py
+++ b/src/timmy/thinking.py
@@ -88,8 +88,7 @@ def _get_conn(db_path: Path = _DEFAULT_DB) -> sqlite3.Connection:
    db_path.parent.mkdir(parents=True, exist_ok=True)
    conn = sqlite3.connect(str(db_path))
    conn.row_factory = sqlite3.Row
-    conn.execute(
-        """
+    conn.execute("""
        CREATE TABLE IF NOT EXISTS thoughts (
            id TEXT PRIMARY KEY,
            content TEXT NOT NULL,
@@ -97,8 +96,7 @@ def _get_conn(db_path: Path = _DEFAULT_DB) -> sqlite3.Connection:
            parent_id TEXT,
            created_at TEXT NOT NULL
        )
-        """
-    )
+        """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_thoughts_time ON thoughts(created_at)")
    conn.commit()
    return conn
--- a/tests/brain/test_unified_memory.py
+++ b/tests/brain/test_unified_memory.py
@@ -9,8 +9,6 @@ Any substrate that reads/writes memory goes through this interface.
 from __future__ import annotations

 import json
-import tempfile
-from pathlib import Path

 import pytest

@@ -86,6 +84,24 @@ class TestUnifiedMemoryInit:
        results = m2.recall_sync("first")
        assert len(results) >= 1

+    def test_wal_mode_enabled(self, memory):
+        """Database should use WAL journal mode for concurrency."""
+        conn = memory._get_conn()
+        try:
+            mode = conn.execute("PRAGMA journal_mode").fetchone()[0]
+            assert mode == "wal", f"Expected WAL mode, got {mode}"
+        finally:
+            conn.close()
+
+    def test_busy_timeout_set(self, memory):
+        """Database connections should have busy_timeout configured."""
+        conn = memory._get_conn()
+        try:
+            timeout = conn.execute("PRAGMA busy_timeout").fetchone()[0]
+            assert timeout == 5000, f"Expected 5000ms busy_timeout, got {timeout}"
+        finally:
+            conn.close()
+

 # ── Remember (Sync) ──────────────────────────────────────────────────────────

--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -164,8 +164,7 @@ def db_connection():
    """Provide a fresh in-memory SQLite connection for tests."""
    conn = sqlite3.connect(":memory:")
    conn.row_factory = sqlite3.Row
-    conn.executescript(
-        """
+    conn.executescript("""
        CREATE TABLE IF NOT EXISTS agents (
            id TEXT PRIMARY KEY,
            name TEXT NOT NULL,
@@ -183,8 +182,7 @@ def db_connection():
            created_at TEXT NOT NULL,
            completed_at TEXT
        );
-    """
-    )
+    """)
    conn.commit()
    yield conn
    conn.close()
--- a/tests/infrastructure/test_event_bus.py
+++ b/tests/infrastructure/test_event_bus.py
@@ -1,6 +1,6 @@
 """Tests for the async event bus (infrastructure.events.bus)."""

-import asyncio
+import sqlite3

 import pytest

@@ -231,3 +231,121 @@ class TestConvenienceFunctions:
        # Cleanup
        event_bus._subscribers.clear()
        event_bus.clear_history()
+
+
+# ── Persistence ──────────────────────────────────────────────────────────
+
+
+class TestEventBusPersistence:
+    """Test that EventBus persists events to SQLite."""
+
+    @pytest.fixture
+    def persistent_bus(self, tmp_path):
+        """Create an EventBus with persistence enabled."""
+        db_path = tmp_path / "events.db"
+        bus = EventBus()
+        bus.enable_persistence(db_path)
+        return bus
+
+    async def test_publish_persists_event(self, persistent_bus):
+        """Published events should be written to SQLite."""
+        await persistent_bus.publish(
+            Event(type="task.created", source="test", data={"task_id": "t1"})
+        )
+        events = persistent_bus.replay(event_type="task.created")
+        assert len(events) >= 1
+        assert events[0].type == "task.created"
+        assert events[0].data["task_id"] == "t1"
+
+    async def test_replay_returns_persisted_events(self, persistent_bus):
+        """Replay should return events from SQLite, not just in-memory history."""
+        for i in range(5):
+            await persistent_bus.publish(Event(type="task.created", source="test", data={"i": i}))
+
+        # Create a fresh bus pointing at the same DB to prove persistence
+        bus2 = EventBus()
+        bus2.enable_persistence(persistent_bus._persistence_db_path)
+        events = bus2.replay(event_type="task.created")
+        assert len(events) == 5
+
+    async def test_replay_filters_by_type(self, persistent_bus):
+        """Replay should filter by event type."""
+        await persistent_bus.publish(Event(type="task.created", source="s"))
+        await persistent_bus.publish(Event(type="agent.joined", source="s"))
+
+        tasks = persistent_bus.replay(event_type="task.created")
+        agents = persistent_bus.replay(event_type="agent.joined")
+        assert len(tasks) == 1
+        assert len(agents) == 1
+
+    async def test_replay_filters_by_source(self, persistent_bus):
+        """Replay should filter by source."""
+        await persistent_bus.publish(Event(type="x", source="alpha"))
+        await persistent_bus.publish(Event(type="x", source="beta"))
+
+        alpha_events = persistent_bus.replay(source="alpha")
+        assert len(alpha_events) == 1
+        assert alpha_events[0].source == "alpha"
+
+    async def test_replay_filters_by_task_id(self, persistent_bus):
+        """Replay should filter by task_id in data."""
+        await persistent_bus.publish(
+            Event(type="task.started", source="s", data={"task_id": "abc"})
+        )
+        await persistent_bus.publish(
+            Event(type="task.started", source="s", data={"task_id": "xyz"})
+        )
+
+        events = persistent_bus.replay(task_id="abc")
+        assert len(events) == 1
+        assert events[0].data["task_id"] == "abc"
+
+    async def test_replay_respects_limit(self, persistent_bus):
+        """Replay should respect the limit parameter."""
+        for i in range(10):
+            await persistent_bus.publish(Event(type="x", source="s"))
+
+        events = persistent_bus.replay(limit=3)
+        assert len(events) == 3
+
+    async def test_persistence_failure_does_not_crash(self, tmp_path):
+        """If persistence fails, publish should still work (graceful degradation)."""
+        bus = EventBus()
+        # Enable persistence to a read-only path to simulate failure
+        bus.enable_persistence(tmp_path / "events.db")
+
+        received = []
+
+        @bus.subscribe("test.event")
+        async def handler(event):
+            received.append(event)
+
+        # Should not raise even if persistence has issues
+        count = await bus.publish(Event(type="test.event", source="test"))
+        assert count == 1
+        assert len(received) == 1
+
+    async def test_bus_without_persistence_still_works(self):
+        """EventBus should work fine without persistence enabled."""
+        bus = EventBus()
+        received = []
+
+        @bus.subscribe("x")
+        async def handler(event):
+            received.append(event)
+
+        await bus.publish(Event(type="x", source="s"))
+        assert len(received) == 1
+
+        # replay returns empty when no persistence
+        events = bus.replay()
+        assert events == []
+
+    async def test_wal_mode_on_persistence_db(self, persistent_bus):
+        """Persistence database should use WAL mode."""
+        conn = sqlite3.connect(str(persistent_bus._persistence_db_path))
+        try:
+            mode = conn.execute("PRAGMA journal_mode").fetchone()[0]
+            assert mode == "wal"
+        finally:
+            conn.close()
--- a/tests/infrastructure/test_model_registry.py
+++ b/tests/infrastructure/test_model_registry.py
@@ -208,3 +208,31 @@ class TestCustomModelDataclass:
        assert ModelFormat.SAFETENSORS.value == "safetensors"
        assert ModelFormat.HF_CHECKPOINT.value == "hf"
        assert ModelFormat.OLLAMA.value == "ollama"
+
+
+class TestWALMode:
+    """Verify WAL mode is enabled for the model registry database."""
+
+    def test_registry_db_uses_wal(self, tmp_path):
+        db = tmp_path / "wal_test.db"
+        with patch("infrastructure.models.registry.DB_PATH", db):
+            from infrastructure.models.registry import _get_conn
+
+            conn = _get_conn()
+            try:
+                mode = conn.execute("PRAGMA journal_mode").fetchone()[0]
+                assert mode == "wal"
+            finally:
+                conn.close()
+
+    def test_registry_db_busy_timeout(self, tmp_path):
+        db = tmp_path / "wal_test.db"
+        with patch("infrastructure.models.registry.DB_PATH", db):
+            from infrastructure.models.registry import _get_conn
+
+            conn = _get_conn()
+            try:
+                timeout = conn.execute("PRAGMA busy_timeout").fetchone()[0]
+                assert timeout == 5000
+            finally:
+                conn.close()
--- a/tests/infrastructure/test_swarm_event_log.py
+++ b/tests/infrastructure/test_swarm_event_log.py
@@ -0,0 +1,66 @@
+"""Tests for swarm.event_log — WAL mode, basic operations, and EventBus bridge."""
+
+import pytest
+
+from swarm.event_log import EventType, _ensure_db, log_event
+
+
+@pytest.fixture(autouse=True)
+def tmp_event_db(tmp_path, monkeypatch):
+    """Redirect event_log writes to a temp directory."""
+    db_path = tmp_path / "events.db"
+    monkeypatch.setattr("swarm.event_log.DB_PATH", db_path)
+    yield db_path
+
+
+class TestEventLogWAL:
+    """Verify WAL mode is enabled for the event log database."""
+
+    def test_event_db_uses_wal(self):
+        conn = _ensure_db()
+        try:
+            mode = conn.execute("PRAGMA journal_mode").fetchone()[0]
+            assert mode == "wal", f"Expected WAL mode, got {mode}"
+        finally:
+            conn.close()
+
+    def test_event_db_busy_timeout(self):
+        conn = _ensure_db()
+        try:
+            timeout = conn.execute("PRAGMA busy_timeout").fetchone()[0]
+            assert timeout == 5000
+        finally:
+            conn.close()
+
+
+class TestEventLogBasics:
+    """Basic event logging operations."""
+
+    def test_log_event_returns_entry(self):
+        entry = log_event(EventType.SYSTEM_INFO, source="test", data={"msg": "hello"})
+        assert entry.id
+        assert entry.event_type == EventType.SYSTEM_INFO
+        assert entry.source == "test"
+
+    def test_log_event_persists(self):
+        log_event(EventType.TASK_CREATED, source="test", task_id="t1")
+        from swarm.event_log import get_task_events
+
+        events = get_task_events("t1")
+        assert len(events) == 1
+        assert events[0].event_type == EventType.TASK_CREATED
+
+    def test_log_event_with_agent_id(self):
+        entry = log_event(
+            EventType.AGENT_JOINED,
+            source="test",
+            agent_id="forge",
+            data={"persona_id": "forge"},
+        )
+        assert entry.agent_id == "forge"
+
+    def test_log_event_data_roundtrip(self):
+        data = {"bid_sats": 42, "reason": "testing"}
+        entry = log_event(EventType.BID_SUBMITTED, source="test", data=data)
+        assert entry.data["bid_sats"] == 42
+        assert entry.data["reason"] == "testing"
--- a/tests/spark/test_spark.py
+++ b/tests/spark/test_spark.py
@@ -473,3 +473,40 @@ class TestSparkRoutes:
    def test_spark_insights(self, client):
        resp = client.get("/spark/insights")
        assert resp.status_code == 200
+
+
+# ── WAL Mode ──────────────────────────────────────────────────────────────
+
+
+class TestWALMode:
+    """Verify SQLite WAL mode is enabled for all Spark databases."""
+
+    def test_spark_memory_uses_wal(self):
+        from spark.memory import _get_conn
+
+        conn = _get_conn()
+        try:
+            mode = conn.execute("PRAGMA journal_mode").fetchone()[0]
+            assert mode == "wal", f"Expected WAL mode, got {mode}"
+        finally:
+            conn.close()
+
+    def test_spark_eidos_uses_wal(self):
+        from spark.eidos import _get_conn
+
+        conn = _get_conn()
+        try:
+            mode = conn.execute("PRAGMA journal_mode").fetchone()[0]
+            assert mode == "wal", f"Expected WAL mode, got {mode}"
+        finally:
+            conn.close()
+
+    def test_spark_memory_busy_timeout(self):
+        from spark.memory import _get_conn
+
+        conn = _get_conn()
+        try:
+            timeout = conn.execute("PRAGMA busy_timeout").fetchone()[0]
+            assert timeout == 5000
+        finally:
+            conn.close()
--- a/tests/test_lazy_init.py
+++ b/tests/test_lazy_init.py
@@ -0,0 +1,150 @@
+"""Tests for lazy singleton initialization (Ticket #2).
+
+Verifies that importing modules does NOT trigger heavy side effects
+(DB connections, HTTP calls, sys.exit, directory creation) and that
+lazy getters return stable, resettable singleton instances.
+"""
+
+import sys
+from unittest.mock import patch
+
+import pytest
+
+
+class TestConfigLazyValidation:
+    """config.py should not run startup validation at import time."""
+
+    def test_import_config_does_not_exit(self):
+        """Importing config should never call sys.exit, even in production."""
+        with patch.dict("os.environ", {"TIMMY_ENV": "production"}, clear=False):
+            # Re-import config — should not sys.exit
+            if "config" in sys.modules:
+                mod = sys.modules["config"]
+                # validate_startup should exist as a callable
+                assert callable(
+                    getattr(mod, "validate_startup", None)
+                ), "config.validate_startup() must exist as an explicit init path"
+
+    def test_validate_startup_exits_on_missing_secrets_in_production(self):
+        """validate_startup() should exit in production when secrets are missing."""
+        from config import settings, validate_startup
+
+        with (
+            patch.object(settings, "timmy_env", "production"),
+            patch.object(settings, "l402_hmac_secret", ""),
+            patch.object(settings, "l402_macaroon_secret", ""),
+            pytest.raises(SystemExit),
+        ):
+            validate_startup(force=True)
+
+    def test_validate_startup_ok_with_secrets(self):
+        """validate_startup() should not exit when secrets are set."""
+        from config import settings, validate_startup
+
+        with (
+            patch.object(settings, "timmy_env", "production"),
+            patch.object(settings, "l402_hmac_secret", "test-secret-hex-value-32"),
+            patch.object(settings, "l402_macaroon_secret", "test-macaroon-hex-value-32"),
+        ):
+            # Should not raise
+            validate_startup(force=True)
+
+    def test_validate_startup_skips_in_test_mode(self):
+        """validate_startup() should be a no-op in test mode."""
+        from config import validate_startup
+
+        # TIMMY_TEST_MODE=1 is set by conftest — should not raise
+        validate_startup()
+
+
+class TestSparkEngineLazy:
+    """spark.engine should not create the engine at import time."""
+
+    def test_get_spark_engine_returns_instance(self):
+        """get_spark_engine() should return a SparkEngine."""
+        from spark.engine import SparkEngine, get_spark_engine
+
+        engine = get_spark_engine()
+        assert isinstance(engine, SparkEngine)
+
+    def test_get_spark_engine_is_singleton(self):
+        """Repeated calls return the same instance."""
+        from spark.engine import get_spark_engine
+
+        a = get_spark_engine()
+        b = get_spark_engine()
+        assert a is b
+
+    def test_get_spark_engine_reset(self):
+        """reset_spark_engine() allows re-initialization for tests."""
+        from spark.engine import get_spark_engine, reset_spark_engine
+
+        a = get_spark_engine()
+        reset_spark_engine()
+        b = get_spark_engine()
+        assert a is not b
+
+    def test_spark_engine_backward_compat(self):
+        """spark_engine module-level name still works via get_spark_engine."""
+        from spark.engine import spark_engine
+
+        assert spark_engine is not None
+
+
+class TestMemorySystemLazy:
+    """timmy.memory_system should not create the system at import time."""
+
+    def test_get_memory_system_returns_instance(self):
+        """get_memory_system() should return a MemorySystem."""
+        from timmy.memory_system import MemorySystem, get_memory_system
+
+        ms = get_memory_system()
+        assert isinstance(ms, MemorySystem)
+
+    def test_get_memory_system_is_singleton(self):
+        """Repeated calls return the same instance."""
+        from timmy.memory_system import get_memory_system
+
+        a = get_memory_system()
+        b = get_memory_system()
+        assert a is b
+
+    def test_get_memory_system_reset(self):
+        """reset_memory_system() allows re-initialization for tests."""
+        from timmy.memory_system import get_memory_system, reset_memory_system
+
+        a = get_memory_system()
+        reset_memory_system()
+        b = get_memory_system()
+        assert a is not b
+
+    def test_memory_system_backward_compat(self):
+        """memory_system module-level name still works."""
+        from timmy.memory_system import memory_system
+
+        assert memory_system is not None
+
+
+class TestEventBusLazy:
+    """infrastructure.events.bus should use lazy initialization."""
+
+    def test_get_event_bus_returns_instance(self):
+        """get_event_bus() should return an EventBus."""
+        from infrastructure.events.bus import EventBus, get_event_bus
+
+        bus = get_event_bus()
+        assert isinstance(bus, EventBus)
+
+    def test_get_event_bus_is_singleton(self):
+        """Repeated calls return the same instance."""
+        from infrastructure.events.bus import get_event_bus
+
+        a = get_event_bus()
+        b = get_event_bus()
+        assert a is b
+
+    def test_event_bus_backward_compat(self):
+        """event_bus module-level name still works."""
+        from infrastructure.events.bus import event_bus
+
+        assert event_bus is not None