docs/SOVEREIGN_AGI_RESEARCH.md

# Sovereign AGI Research: Maintainability & Scalability Analysis

**Date:** 2026-03-08
**Scope:** Deep architecture review of Timmy Time Dashboard with actionable
recommendations for evolving toward a robust sovereign AGI.
**Inputs:** Codebase analysis, ROADMAP.md, REVELATION_PLAN.md, Kimi's Ghost
Core Knowledge Transfer (March 9, 2026), Grok's research validation
(March 9, 2026)

---

## Executive Summary

Timmy Time has a solid foundation: local-first inference, graceful degradation,
event-driven architecture, and a clear module boundary system. However, several
architectural patterns need to evolve to support the sovereign AGI vision —
particularly around **memory durability**, **agent coordination determinism**,
**interface contracts**, and **dependency injection**.

The Ghost Core vision (from Kimi's consolidation) proposes a radical
minimization: the current ~13K-line monolith compressed to a ~2,000-line
cognitive kernel that orchestrates intelligence through YAML workflows and
containerized tools. This report synthesizes both perspectives — the
incremental improvements needed now and the architectural target state — into
a unified roadmap with 14 specific areas for improvement across 4 priority
tiers.

**The core tension this report resolves:** How to move from "dashboard that
contains an agent" to "agent that exposes a dashboard" without breaking what
already works.

---

## Table of Contents

1. [Current Architecture Assessment](#1-current-architecture-assessment)
2. [Critical Maintainability Issues](#2-critical-maintainability-issues)
3. [Scalability Bottlenecks](#3-scalability-bottlenecks)
4. [Sovereign AGI Architecture Recommendations](#4-sovereign-agi-architecture-recommendations)
5. [The Ghost Core Vision](#5-the-ghost-core-vision)
6. [Memory System Evolution](#6-memory-system-evolution)
7. [Agent Coordination & Orchestration](#7-agent-coordination--orchestration)
8. [Testing Strategy for Non-Deterministic Systems](#8-testing-strategy-for-non-deterministic-systems)
9. [Plugin & Extension Architecture](#9-plugin--extension-architecture)
10. [Self-Improvement Loops](#10-self-improvement-loops)
11. [Implementation Priority Matrix](#11-implementation-priority-matrix)

---

## 1. Current Architecture Assessment

### Strengths

| Area | Implementation | Quality |
|------|---------------|---------|
| Config management | `pydantic-settings` with env/file cascade | Excellent |
| Graceful degradation | Try/except with fallback at every integration point | Excellent |
| Event system | Async EventBus with wildcard subscriptions | Good |
| LLM routing | CascadeRouter with circuit breakers | Good |
| Memory tiers | Hot (MEMORY.md) → Vault (markdown) → Semantic (SQLite+vectors) | Good foundation |
| Module boundaries | 8 packages with clear responsibilities | Good |
| Multi-backend LLM | Ollama/AirLLM/Grok/Claude with auto-detection | Good |
| Security posture | CSRF, security headers, secret validation, telemetry off | Good |

### Architecture Diagram (Current State)

```
┌──────────────────────────────────────────────────────────────┐
│                    Dashboard (FastAPI)                        │
│  23 route modules · HTMX+Jinja2 · WebSocket · CSRF          │
├───────────────┬──────────────┬───────────────┬───────────────┤
│   timmy/      │  spark/      │  swarm/       │ timmy_serve/  │
│ Agent core    │ Intelligence │ Task queue    │ API server    │
│ Memory system │ EIDOS predict│ Event log     │ Voice TTS     │
│ Agentic loop  │ Advisory     │               │ Inter-agent   │
├───────────────┴──────────────┴───────────────┴───────────────┤
│                   infrastructure/                             │
│ CascadeRouter · EventBus · WebSocket · Notifications · Hands │
├──────────────────────────────────────────────────────────────┤
│                   integrations/                               │
│ Discord · Telegram · Siri · Voice NLU · Paperclip · ChatBridge│
├──────────────────────────────────────────────────────────────┤
│                      brain/                                   │
│ UnifiedMemory (SQLite+vectors) · Embeddings · rqlite client  │
├──────────────────────────────────────────────────────────────┤
│                    config.py                                  │
│ 90+ settings · pydantic-settings · env/file cascade          │
└──────────────────────────────────────────────────────────────┘
```

---

## 2. Critical Maintainability Issues

### 2.1 Singleton Proliferation (Severity: HIGH)

**Problem:** The codebase uses module-level singletons extensively:
- `config.settings` (global mutable state)
- `event_bus` (infrastructure/events/bus.py)
- `memory_system` (timmy/memory_system.py)
- `spark_engine` (spark/engine.py)
- `cascade_router` (infrastructure/router/cascade.py)
- `ws_manager`, `notifier`, etc.

These singletons are instantiated at import time, making testing difficult and
creating hidden dependencies between modules. A test that imports `spark.engine`
triggers a chain: `spark_engine` → `settings` → reads `.env` → tries to
connect to SQLite.

**Recommendation:** Migrate to a lightweight dependency injection container.
Not a heavy framework — a simple registry pattern:

```python
# infrastructure/container.py
class Container:
    """Lightweight DI container — singletons with lazy init."""

    _registry: dict[str, Any] = {}
    _factories: dict[str, Callable] = {}

    @classmethod
    def register(cls, name: str, factory: Callable):
        cls._factories[name] = factory

    @classmethod
    def get(cls, name: str) -> Any:
        if name not in cls._registry:
            cls._registry[name] = cls._factories[name]()
        return cls._registry[name]

    @classmethod
    def reset(cls):
        """For tests — clear all instances."""
        cls._registry.clear()
```

This preserves the simplicity of `from x import singleton` while enabling
test isolation and lazy initialization.

### 2.2 Dual Memory Systems (Severity: HIGH)

**Problem:** There are TWO independent memory systems that don't talk to each
other:

1. **`timmy/memory_system.py`** — `MemorySystem` with HotMemory (MEMORY.md),
   VaultMemory (markdown files), HandoffProtocol. Used by the agent creation
   path (`agent.py`).

2. **`brain/memory.py`** — `UnifiedMemory` with SQLite-backed memories, facts,
   embeddings, rqlite support. Used by brain-related routes and tools.

Data stored in one system is invisible to the other. The agent's context
includes HotMemory (MEMORY.md) but not brain facts. The brain stores semantic
memories but doesn't know about vault markdown files.

**Recommendation:** Unify under `brain/memory.py` (UnifiedMemory) as the single
source of truth. Make `MemorySystem` a thin orchestration layer that delegates:
- Hot memory → a "hot" table/partition in UnifiedMemory
- Vault → keep markdown files but index them in UnifiedMemory for search
- Handoff → serialize to UnifiedMemory with a "handoff" tag

### 2.3 Config Monolith (Severity: MEDIUM)

**Problem:** `config.py` has 90+ settings in a single `Settings` class. This
violates single-responsibility and makes it hard to understand which settings
belong to which module. Adding a new feature means touching the same 280-line
file every time.

**Recommendation:** Split into namespaced configs using pydantic's nested model
support:

```python
class LLMConfig(BaseModel):
    ollama_url: str = "http://localhost:11434"
    ollama_model: str = "qwen3.5:latest"
    # ... all LLM settings

class MemoryConfig(BaseModel):
    prune_days: int = 90
    vault_max_mb: int = 100
    # ... all memory settings

class Settings(BaseSettings):
    llm: LLMConfig = LLMConfig()
    memory: MemoryConfig = MemoryConfig()
    # ...
```

This can be done incrementally — start with the largest groups (LLM, memory,
security, creative) and keep the flat namespace available via properties for
backwards compatibility.

### 2.4 Import-Time Side Effects (Severity: MEDIUM)

**Problem:** Several modules execute significant logic at import time:
- `config.py` lines 342-376: production validation, logging setup, `sys.exit(1)`
- `spark/engine.py` line 359: `spark_engine = _create_engine()`
- `swarm/event_log.py`: SQLite table creation on `_ensure_db()`
- `dashboard/app.py`: middleware setup, router registration

This means importing any module can trigger database connections, file I/O, or
even process exits. This makes testing fragile and increases startup time.

**Recommendation:** Guard all side effects behind explicit initialization:
```python
# Instead of module-level execution:
spark_engine = _create_engine()

# Use lazy pattern:
_spark_engine = None
def get_spark_engine() -> SparkEngine:
    global _spark_engine
    if _spark_engine is None:
        _spark_engine = _create_engine()
    return _spark_engine
```

The `brain/memory.py` already does this correctly with `get_memory()`.

---

## 3. Scalability Bottlenecks

### 3.1 SQLite Concurrency (Severity: HIGH for AGI vision)

**Problem:** Multiple components write to different SQLite databases:
- `data/brain.db` — UnifiedMemory
- `data/events.db` — Swarm event log
- `timmy.db` — Agno conversation history
- Spark memory (internal SQLite)

SQLite has a single-writer lock. With multiple async tasks (agentic loop,
background thinking, event capture, WebSocket handlers), write contention will
increase as the system becomes more autonomous.

**Current impact:** Low (single-user system), but the AGI vision requires
concurrent agent operation.

**Recommendations (progressive):**

1. **Immediate:** Use WAL mode for all SQLite databases:
   ```python
   conn.execute("PRAGMA journal_mode=WAL")
   conn.execute("PRAGMA busy_timeout=5000")
   ```

2. **Near-term:** Consolidate into fewer databases. Currently 4+ separate
   `.db` files that could be 1-2 databases with proper schemas.

3. **Medium-term:** Move to DuckDB for analytical queries (agent performance
   metrics, Spark intelligence) while keeping SQLite for transactional data.

4. **Long-term:** The rqlite path (`brain/client.py`) is the right direction
   for distributed memory across federated nodes.

### 3.2 In-Memory Event Bus (Severity: MEDIUM)

**Problem:** The `EventBus` at `infrastructure/events/bus.py` is in-memory
only. Events are lost on restart. History is capped at 1000 entries. No
persistence, no replay.

For a sovereign AGI, every event is valuable data. Agent decisions, tool
executions, task outcomes — all of this is training signal.

**Recommendations:**

1. **Persist events to SQLite** (the swarm `event_log.py` already does this
   separately — unify these two systems).

2. **Add event replay** for debugging agent behavior:
   ```python
   # Replay all events from a specific task
   events = event_bus.replay(task_id="abc123")
   ```

3. **Event sourcing pattern:** Make events the source of truth for system
   state. Current state is derived by replaying events. This enables:
   - Time-travel debugging
   - State reconstruction after crashes
   - Deterministic replay for testing

### 3.3 Synchronous Agent Execution (Severity: MEDIUM)

**Problem:** The `BaseAgent.run()` method calls `self.agent.run()` synchronously
(Agno's run is blocking). The agentic loop wraps this in `asyncio.to_thread()`
which works but creates thread contention with multiple concurrent agents.

**Recommendation:** Create a proper async agent execution pool:

```python
class AgentPool:
    """Manages concurrent agent execution with backpressure."""

    def __init__(self, max_concurrent: int = 4):
        self._semaphore = asyncio.Semaphore(max_concurrent)
        self._executor = ThreadPoolExecutor(max_workers=max_concurrent)

    async def execute(self, agent: Agent, message: str) -> str:
        async with self._semaphore:
            loop = asyncio.get_event_loop()
            return await loop.run_in_executor(
                self._executor,
                lambda: agent.run(message, stream=False)
            )
```

### 3.4 Vector Search Performance (Severity: MEDIUM)

**Problem:** The current vector search in `brain/memory.py` loads ALL embeddings
into memory and computes cosine similarity in Python:

```python
rows = conn.execute("SELECT ... FROM memories WHERE embedding IS NOT NULL").fetchall()
for row in rows:
    stored_vec = np.frombuffer(row["embedding"], dtype=np.float32)
    score = float(np.dot(query_vec, stored_vec))
```

This is O(n) with no indexing. At 100K+ memories, this becomes a significant
latency bottleneck.

**Recommendations (as documented in ROADMAP.md Phase 3):**
1. **sqlite-vec** — drop-in SQLite extension, zero new dependencies
2. **LanceDB** — embedded, disk-based, IVF_PQ indexing, handles millions
3. **Qdrant** — only if federated search across nodes is needed

sqlite-vec is the clear first step given the SQLite-native architecture.

---

## 4. Sovereign AGI Architecture Recommendations

### 4.1 The Perceive-Decide-Act-Remember Loop

The REVELATION_PLAN.md defines the right core interface:

```python
class TimAgent(ABC):
    async def perceive(self, input) -> WorldState
    async def decide(self, state) -> Action
    async def act(self, action) -> Result
    async def remember(self, key, value)
    async def recall(self, key) -> Value
```

**Current gap:** The existing agent system (Agno-based) doesn't implement this
pattern. The `BaseAgent` and `SubAgent` classes just wrap `Agent.run()` with
event bus integration. There's no structured perception→decision→action cycle.

**Recommendation:** Introduce this as the **AgentCore** interface, sitting
between the current Agno wrapper and the orchestrator:

```python
# timmy/agent_core/interface.py (file already exists but is minimal)

class AgentCore(ABC):
    """Sovereign agent loop — the fundamental cognitive cycle."""

    @abstractmethod
    async def perceive(self, inputs: list[Perception]) -> WorldState:
        """Gather and fuse sensory inputs."""

    @abstractmethod
    async def decide(self, state: WorldState, memory: MemoryContext) -> Plan:
        """Deliberate and choose actions."""

    @abstractmethod
    async def act(self, plan: Plan) -> list[ActionResult]:
        """Execute planned actions."""

    @abstractmethod
    async def reflect(self, results: list[ActionResult]) -> list[Memory]:
        """Extract learnings from action results."""
```

The `reflect` step is the key addition — it closes the loop for self-improvement.

### 4.2 Capability-Based Security Model

For a sovereign AGI, the permission model must evolve from "config flags" to
a capability-based system:

```python
@dataclass
class Capability:
    name: str           # "shell.execute", "git.push", "memory.write"
    scope: str          # "project", "system", "network"
    requires_approval: bool
    max_cost_sats: int  # Economic bound on the capability

class AgentPermissions:
    """Each agent has a set of capabilities, not global flags."""
    capabilities: dict[str, Capability]

    def can(self, action: str) -> bool:
        cap = self.capabilities.get(action)
        return cap is not None and not cap.requires_approval
```

Currently, permissions are global (`self_modify_enabled`,
`hands_shell_enabled`). For a multi-agent sovereign system, each agent needs
its own permission set with escalation paths.

### 4.3 Substrate Abstraction Layer

The REVELATION_PLAN envisions multiple substrates (Cloud, Desktop, Robot, Sim).
The current architecture is tightly coupled to the "cloud/server" substrate.

**Recommendation:** Introduce a substrate interface early:

```python
class Substrate(ABC):
    """Where and how the agent executes."""

    @abstractmethod
    def get_llm_provider(self) -> LLMProvider: ...

    @abstractmethod
    def get_memory_backend(self) -> MemoryBackend: ...

    @abstractmethod
    def get_io_channels(self) -> list[IOChannel]: ...

    @abstractmethod
    def get_capabilities(self) -> list[Capability]: ...

class ServerSubstrate(Substrate):
    """Current implementation — Ollama + SQLite + FastAPI."""
    ...

class RobotSubstrate(Substrate):
    """RPi + camera + motors + local Ollama."""
    ...
```

This doesn't require rewriting existing code — it's an abstraction that wraps
current components and makes substrate-switching possible.

---

## 5. The Ghost Core Vision

*Source: Kimi's Knowledge Transfer Document (March 9, 2026)*

### 5.1 The Proposition

The Ghost Core philosophy proposes a fundamental inversion of the current
architecture:

```
CURRENT:  Dashboard (monolith) → contains agents → agents use tools
TARGET:   Ghost Core (~2K lines) → orchestrates workflows → tools are containers
```

**The formula:**
```
Ghost Core (ReAct+Reflexion)
  + Workflow Layer (YAML-defined, self-modifying)
  + Tool Registry (dynamic container spin-up/spin-down)
  + Three-Tier Memory (hot/vault/semantic)
  + Lightning Economic Layer
  = Sovereign, Wealth-Generating Agent
```

### 5.2 Ghost Core Module Budget

The proposal enforces a strict 2,000-line limit for `src/timmy/`:

| Module | Lines | Purpose |
|--------|-------|---------|
| `ghost.py` | ~150 | ReAct loop: Observe → Plan → Act → Reflect |
| `reflexion.py` | ~100 | Critique generation + lesson extraction |
| `workflow_engine.py` | ~200 | YAML loader, step executor, state machine |
| `tool_registry.py` | ~200 | Dynamic tool discovery, spawn, health check |
| `memory_system.py` | ~300 | Hot/Vault/Semantic memory interface (existing) |
| `backends.py` | ~200 | Ollama/AirLLM/Claude/Grok adapters |
| `config.py` | ~150 | Pydantic-settings (existing) |
| `lightning_wallet.py` | ~200 | L402 handling, invoice generation, balance |
| `utils/` | ~300 | Shared helpers, logging, serialization |
| **Total** | **~1,600** | **Headroom for 400 lines critical fixes** |

### 5.3 What Gets Externalized

Everything heavy moves to Docker containers with standardized HTTP APIs:

| Current In-Process | Target Container | Interface |
|-------------------|-----------------|-----------|
| Discord/Telegram bots | `timmy-bridges` | POST /ingest |
| Voice (pyttsx3) | `timmy-voice` | POST /speak, /transcribe |
| Creative tools | `timmy-creative` | POST /generate |
| Web scraping | `timmy-tools/scraper` | POST /execute |
| Experiment runners | `timmy-tools/lab` | POST /execute |

### 5.4 The ReAct+Reflexion Loop

```python
class GhostCore:
    async def run(self, goal: str, workflow_id: Optional[str] = None) -> Result:
        workflow = (self.workflow_engine.load(workflow_id)
                    if workflow_id
                    else self.planner.create_workflow(goal))

        context = {
            "goal": goal,
            "memory": self.memory.retrieve_relevant(goal),
            "lessons": self.reflexion.get_lessons(goal),
            "wallet": self.lightning.get_balance()
        }

        for step in workflow.steps:
            state = await self.observe(step, context)
            tool = await self.registry.get(step.tool)
            result = await tool.execute(step.params, context)

            critique = await self.reflexion.critique(step, result, context)
            if not critique.success:
                workflow.insert_step(critique.fix_step)
                continue

            context["results"].append(result)
            if critique.is_novel:
                await self.reflexion.store(critique.lesson)

        summary = await self.reflexion.summarize(context)
        return Result(context["results"], summary)
```

### 5.5 YAML Workflow Layer

Intelligence lives in workflows, not code. Workflows are YAML files that the
Ghost Core executes. Timmy can modify his own workflows.

```yaml
workflow_id: alpha_hunter_v2
name: Alpha Hunter - Crypto Scanner
goal: Find 3 undervalued assets with 5x potential

budget:
  max_sats_per_run: 1000
  estimated_time: 300s

tools_required:
  - name: web_scraper
    image: timmy-tools/scraper:latest
  - name: sentiment_analyzer
    image: timmy-tools/sentiment:latest

steps:
  - id: fetch_market_data
    tool: web_scraper
    action: scrape
    params: { source: coingecko, assets: top_100 }
    output: market_data

  - id: analyze_sentiment
    tool: sentiment_analyzer
    action: analyze
    input: ${market_data.asset_names}
    output: sentiment_scores

  - id: validate_candidates
    tool: web_scraper
    action: deep_dive
    foreach: ${candidates}
    output: reports

self_modify:
  enabled: true
  max_iterations: 5
  persist_lessons: true
```

### 5.6 Dynamic Tool Registry

Tools are Docker images spawned on demand, health-checked, and shut down after
idle timeout:

```python
class ToolRegistry:
    async def get(self, tool_name: str) -> ToolEndpoint:
        if tool_name in self.active_containers:
            return self.active_containers[tool_name]

        image = f"timmy-tools/{tool_name}:latest"
        if not self.docker.has_image(image):
            await self.docker.pull(image)

        port = self._find_free_port()
        container = await self.docker.run(image, ports={f"{port}/tcp": port})
        endpoint = ToolEndpoint(url=f"http://localhost:{port}", ...)

        if not await self._health_check(endpoint):
            await self.destroy(tool_name)
            raise ToolUnavailable(tool_name)

        self.active_containers[tool_name] = endpoint
        asyncio.create_task(self._idle_watcher(tool_name))
        return endpoint
```

All tool containers expose a standardized contract:
- `GET /health` — capability advertisement
- `POST /execute` — stateless command execution
- `POST /shutdown` — graceful teardown

### 5.7 Economic Layer

Every workflow has a Lightning budget. Tools cost sats to run:

```python
class LightningWallet:
    async def execute_with_budget(self, workflow: Workflow) -> Result:
        estimated_cost = self.estimate_cost(workflow.tools_required)
        if self.balance < estimated_cost:
            await self.request_funding(estimated_cost)

        self.escrow[run_id] = estimated_cost
        self.balance -= estimated_cost

        try:
            result = await self.execute_workflow(workflow)
            actual_cost = self.calculate_actual_cost(result)
            self.balance += (estimated_cost - actual_cost)
            return result
        except Exception:
            self.balance += estimated_cost
            raise
```

### 5.8 Assessment: Ghost Core vs Current Architecture

**Where Ghost Core aligns with this research:**
- ReAct+Reflexion loop = Section 4.1's perceive-decide-act-reflect
- YAML workflows = formalized version of the agentic loop
- Dynamic tool registry = Section 8's plugin architecture
- Economic layer = REVELATION_PLAN's Lightning treasury
- Self-modifying workflows = Section 9's self-improvement loops

**Open questions requiring decisions (see Interview section):**

1. **Migration strategy:** Big-bang rewrite to ~2K lines vs incremental
   extraction? The 2K line limit is aspirational but risks breaking working
   code.

2. **Docker dependency:** The tool registry assumes Docker everywhere. What
   about bare-metal RPi deployments? Need a substrate-aware fallback.

3. **YAML as intelligence:** Workflows in YAML are powerful for structured
   tasks but may be limiting for emergent agent behavior. Need both
   structured (YAML) and unstructured (LLM-driven) execution paths.

4. **Dashboard fate:** The Ghost Core vision externalizes the dashboard, but
   the dashboard IS the primary user interface today. Needs careful
   migration planning.

### 5.9 Proposed Migration Phases

| Phase | Weeks | Scope | Risk |
|-------|-------|-------|------|
| 1. Extract Bots | 1-2 | Discord/Telegram to `timmy-bridges` container | Low |
| 2. Workflow Engine | 3-4 | `workflow_engine.py` + convert hardcoded logic to YAML | Medium |
| 3. Tool Registry | 5-6 | Dockerize first tool, implement spawn/destroy | Medium |
| 4. Economic Layer | 7-8 | Lightning wallet integration, budget constraints | Medium |
| 5. Reflexion | 9-10 | Self-critique, YAML auto-modification, LESSONS.md | Low |

### 5.10 Anti-Patterns to Enforce

From the Ghost Core philosophy, enforceable via CI:

- Line count check: CI warns if `src/timmy/` exceeds target threshold
- No in-process heavy tools: PyTorch, transformers in containers only
- All capabilities as workflows or registered tools
- No cloud dependencies in core (user-provided API keys are opt-in)
- Every YAML workflow must have a contract test

---

## 6. Memory System Evolution

### 6.1 Current State Analysis

The system has **four** separate memory stores:

| Store | Location | Purpose | Technology |
|-------|----------|---------|------------|
| Hot Memory | `MEMORY.md` | Always-loaded context | Flat file |
| Vault | `memory/` dir | Structured notes | Markdown files |
| UnifiedMemory | `data/brain.db` | Semantic search, facts | SQLite + embeddings |
| Agno DB | `timmy.db` | Conversation history | SQLite (Agno-managed) |

Plus Spark has its own event/memory SQLite tables.

### 6.2 Target Architecture: Unified Memory with Tiers

```
┌──────────────────────────────────────────────────────────┐
│                    MemoryFacade                           │
│  Single API for all memory operations                    │
├──────────────────────────────────────────────────────────┤
│  Working Memory    │ Episodic Memory  │ Semantic Memory  │
│  (current context, │ (conversations,  │ (facts, skills,  │
│   active plans,    │  events, actions) │  patterns)       │
│   tool state)      │                  │                  │
├──────────────────────────────────────────────────────────┤
│  Storage Backend (pluggable)                             │
│  SQLite (dev) → DuckDB (analytics) → rqlite (federation)│
└──────────────────────────────────────────────────────────┘
```

**Working Memory** replaces MEMORY.md and hot memory:
- In-memory cache with periodic persistence
- Current conversation context, active plans, tool state
- Size-limited with LRU eviction
- Survives restarts via checkpoint

**Episodic Memory** replaces vault notes and conversation history:
- Timestamped records of events, conversations, decisions
- Compressed summaries for older episodes
- Searchable by time range, tags, participants
- The agentic loop steps should automatically become episodes

**Semantic Memory** replaces brain facts and vector store:
- Long-term knowledge: user preferences, learned patterns, world knowledge
- Vector-indexed for similarity search
- Confidence-scored with decay over time
- Facts extracted from episodic memory via reflection

### 6.3 Memory Consolidation Pipeline

For a sovereign AGI, raw memories must be consolidated into durable knowledge:

```
Raw Events → Working Memory → Episodic Summary → Semantic Extraction
                                    ↓
                           Reflection Agent
                                    ↓
                    Patterns / Skills / Facts
```

The Spark engine already has a primitive version of this
(`_maybe_consolidate`). Generalize it:

```python
class MemoryConsolidator:
    """Periodic background task that strengthens memories."""

    async def consolidate(self):
        # 1. Summarize recent episodes into condensed form
        recent = await self.memory.get_episodes(hours=24)
        summary = await self.llm.summarize(recent)
        await self.memory.store_episode_summary(summary)

        # 2. Extract facts from summaries
        facts = await self.llm.extract_facts(summary)
        for fact in facts:
            await self.memory.store_fact(fact)

        # 3. Decay old, unaccessed memories
        await self.memory.decay_unused(older_than_days=30)

        # 4. Strengthen frequently accessed memories
        await self.memory.reinforce_popular(min_access_count=5)
```

---

## 7. Agent Coordination & Orchestration

### 7.1 Current Orchestrator Analysis

`TimmyOrchestrator` in `timmy/agents/timmy.py` routes requests using keyword
matching:

```python
direct_patterns = ["your name", "who are you", "hello", ...]
memory_patterns = ["we talked about", "remember", ...]
```

This is fragile — it doesn't handle ambiguous requests well and doesn't learn
from routing outcomes.

### 7.2 Deterministic State Machine

Replace keyword-based routing with an explicit state machine:

```python
from enum import Enum, auto

class TaskState(Enum):
    RECEIVED = auto()
    CLASSIFIED = auto()
    ROUTED = auto()
    EXECUTING = auto()
    REVIEWING = auto()
    COMPLETED = auto()
    FAILED = auto()

@dataclass
class TaskContext:
    state: TaskState
    request: str
    classification: dict  # intent, complexity, required_capabilities
    assigned_agent: str
    execution_history: list[StepResult]

    def transition(self, new_state: TaskState):
        # Validate transition is legal
        # Persist state for recovery
        # Emit event
        ...
```

**Key benefits:**
- Every task has a clear audit trail
- Failed tasks can be replayed from last checkpoint
- State is serializable — survives restarts
- Transitions emit events → Spark can learn from routing patterns

### 7.3 Intelligent Routing (Replace Keyword Matching)

Instead of regex patterns, use the LLM to classify intent:

```python
class IntentClassifier:
    """Classify user intent using the local LLM."""

    CLASSIFICATION_PROMPT = """Classify this request into ONE category:
    - DIRECT: Simple question, greeting, status check
    - RESEARCH: Needs external information gathering
    - CODE: Programming, file operations, tool building
    - MEMORY: Recall past conversations or decisions
    - CREATIVE: Writing, content generation
    - COMPLEX: Multi-step task requiring orchestration

    Request: {request}
    Category:"""

    async def classify(self, request: str) -> str:
        result = await self.llm.complete(
            self.CLASSIFICATION_PROMPT.format(request=request)
        )
        return result.strip().upper()
```

Cache classifications to avoid repeated LLM calls for similar requests.

### 7.4 Inter-Agent Communication Protocol

Currently agents communicate through the event bus with unstructured dicts.
Define a protocol:

```python
@dataclass
class AgentMessage:
    """Structured inter-agent communication."""
    from_agent: str
    to_agent: str
    message_type: str  # "request", "response", "delegate", "escalate"
    content: str
    context: dict       # Shared context for the conversation
    thread_id: str      # Groups related messages
    priority: int = 0
    requires_response: bool = True
```

---

## 8. Testing Strategy for Non-Deterministic Systems

### 8.1 Current Testing Gaps

The test suite has good coverage (73%+) but tests primarily cover deterministic
code paths. The agent behavior, LLM responses, and orchestration decisions are
largely untested because they're non-deterministic.

### 8.2 Testing Layers for AGI Systems

```
┌─────────────────────────────────────────────┐
│  Property-Based Tests                        │
│  "Agent always stores user facts"            │
│  "Memory never loses data on restart"        │
├─────────────────────────────────────────────┤
│  Behavioral Tests (with mock LLM)            │
│  "Given X input, agent routes to Y agent"    │
│  "Agentic loop produces 3+ steps"            │
├─────────────────────────────────────────────┤
│  Contract Tests                              │
│  "Agent always returns valid JSON"           │
│  "Memory API satisfies interface contract"   │
├─────────────────────────────────────────────┤
│  Unit Tests (current)                        │
│  "Config loads correctly"                    │
│  "Router circuits break at threshold"        │
└─────────────────────────────────────────────┘
```

### 8.3 Mock LLM for Deterministic Agent Tests

```python
class MockLLM:
    """Deterministic LLM for testing agent behavior."""

    def __init__(self, responses: dict[str, str]):
        self.responses = responses
        self.calls = []

    def complete(self, prompt: str) -> str:
        self.calls.append(prompt)
        for pattern, response in self.responses.items():
            if pattern in prompt.lower():
                return response
        return "I don't know."

# Usage in tests:
def test_orchestrator_routes_code_requests():
    llm = MockLLM({"classify": "CODE"})
    orch = TimmyOrchestrator(llm=llm)
    result = await orch.orchestrate("Write a Python function")
    assert result.routed_to == "forge"
```

### 8.4 Golden Path Tests

Record real agent interactions and replay them as regression tests:

```python
@pytest.mark.golden
def test_agentic_loop_solves_simple_task():
    """Golden test: agentic loop produces valid plan and executes it."""
    result = run_agentic_loop("List the files in the current directory")

    assert result.status in ("completed", "partial")
    assert len(result.steps) >= 1
    assert result.summary  # Non-empty summary
    assert result.total_duration_ms > 0
```

### 8.5 Chaos Testing for Resilience

```python
@pytest.mark.chaos
def test_agent_handles_ollama_crash():
    """Agent degrades gracefully when Ollama dies mid-request."""
    with SimulatedOllamaCrash(after_n_requests=2):
        result = run_agentic_loop("Complex multi-step task")
        assert result.status in ("partial", "failed")
        assert "error" not in result.summary.lower() or "recovered" in result.summary.lower()
```

---

## 9. Plugin & Extension Architecture

### 9.1 Current State

The MCP (Model Context Protocol) tool registry is partially implemented:
- `infrastructure/hands/tools.py` registers shell and git tools
- `BaseAgent._create_agent()` looks up tools from registry
- But the registry import is wrapped in `try/except` — it's optional

### 9.2 Recommendation: Tool Capability System

Formalize tools as capabilities with a discovery protocol:

```python
@dataclass
class ToolCapability:
    name: str
    description: str
    input_schema: dict
    output_schema: dict
    cost_estimate: int  # sats per invocation (for economic routing)
    requires_approval: bool
    safety_level: str  # "safe", "review", "dangerous"

class ToolRegistry:
    """Central registry for all agent capabilities."""

    def register(self, tool: ToolCapability, handler: Callable):
        ...

    def discover(self, query: str) -> list[ToolCapability]:
        """Natural language tool discovery."""
        ...

    def get_tools_for_agent(self, agent_id: str) -> list[ToolCapability]:
        """Permission-filtered tools for a specific agent."""
        ...
```

### 9.3 Plugin Loading Pattern

```python
# plugins/my_plugin/__init__.py
def register(registry: ToolRegistry):
    registry.register(
        ToolCapability(name="my_tool", ...),
        handler=my_handler,
    )

# App startup
for plugin_dir in Path("plugins").iterdir():
    if plugin_dir.is_dir() and (plugin_dir / "__init__.py").exists():
        module = importlib.import_module(f"plugins.{plugin_dir.name}")
        module.register(tool_registry)
```

This enables third-party extensions without modifying core code.

---

## 10. Self-Improvement Loops

### 10.1 Current Self-Modification

The `self_modify_enabled` config flag exists but the implementation is basic.
For sovereign AGI, self-improvement must be systematic.

### 10.2 Reflection-Based Self-Improvement

```
┌─────────────────────────────────────────────────┐
│                Improvement Loop                  │
│                                                  │
│  1. OBSERVE: Collect performance metrics         │
│     - Task success rates per agent               │
│     - Latency distributions                      │
│     - Memory retrieval relevance scores          │
│     - User satisfaction signals                  │
│                                                  │
│  2. REFLECT: Identify patterns                   │
│     - Which tasks fail most?                     │
│     - Which tool combinations work best?         │
│     - Which prompts produce best results?        │
│                                                  │
│  3. HYPOTHESIZE: Generate improvements           │
│     - Better system prompts                      │
│     - Better tool selection strategies           │
│     - Better memory retrieval queries            │
│                                                  │
│  4. EXPERIMENT: Test improvements                │
│     - A/B test prompt variations                 │
│     - Measure before/after metrics               │
│     - Roll back if worse                         │
│                                                  │
│  5. INTEGRATE: Apply validated improvements      │
│     - Update system prompts                      │
│     - Adjust routing weights                     │
│     - Store new patterns in semantic memory      │
└─────────────────────────────────────────────────┘
```

### 10.3 Reward Model Integration

The config already has `reward_model_enabled` and `reward_model_votes`. Wire
this into the reflection loop:

```python
class RewardSignal:
    """Evaluate agent output quality."""

    async def score(self, task: str, output: str) -> float:
        if settings.reward_model_enabled:
            # Use PRM-style scoring via Ollama
            scores = []
            for _ in range(settings.reward_model_votes):
                score = await self.reward_model.evaluate(task, output)
                scores.append(score)
            return statistics.median(scores)

        # Fallback: heuristic scoring
        return self._heuristic_score(task, output)
```

### 10.4 Autoresearch Integration

The `autoresearch_*` config fields show intent for Karpathy-style experiment
loops. Connect this to the Lab agent:

```python
class AutoresearchLoop:
    """Autonomous ML experiment iteration."""

    async def run_iteration(self, experiment: Experiment):
        # 1. Run experiment with time budget
        result = await self.run_with_timeout(
            experiment, timeout=settings.autoresearch_time_budget
        )

        # 2. Evaluate metric
        metric_value = result.metrics[settings.autoresearch_metric]

        # 3. Log to Spark for pattern detection
        spark_engine.on_creative_step(
            project_id=experiment.id,
            step_name="experiment_run",
            agent_id="lab",
        )

        # 4. Generate next hypothesis
        next_experiment = await self.lab_agent.hypothesize(
            experiment, result, metric_value
        )

        return next_experiment
```

---

## 11. Implementation Priority Matrix

### Tier 1: Do Now (Weeks 1-4)

These changes directly improve maintainability with minimal risk:

| # | Change | Files Affected | Effort |
|---|--------|---------------|--------|
| 1 | Enable WAL mode for all SQLite databases | `brain/memory.py`, `swarm/event_log.py` | 1 hour |
| 2 | Unify EventBus + swarm event_log | `infrastructure/events/`, `swarm/event_log.py` | 1 day |
| 3 | Add lazy init guards to singletons | `spark/engine.py`, `config.py` | 1 day |
| 4 | Add MockLLM for deterministic agent tests | `tests/conftest.py` | 1 day |
| 5 | Consolidate MEMORY.md hot memory into brain DB | `timmy/memory_system.py`, `brain/memory.py` | 2 days |

### Tier 2: Near-Term (Months 1-2)

Architectural improvements that set up the AGI foundation:

| # | Change | Impact | Effort |
|---|--------|--------|--------|
| 6 | Introduce AgentCore perceive-decide-act-reflect interface | Enables substrate portability | 3 days |
| 7 | Replace keyword routing with LLM-based intent classification | Better orchestration quality | 2 days |
| 8 | Add explicit TaskState machine with persistence | Debuggability, recovery | 3 days |
| 9 | Split config.py into namespaced sections | Maintainability | 2 days |
| 10 | Integrate sqlite-vec for vector search | Memory performance at scale | 2 days |

### Tier 3: Medium-Term (Months 2-4)

Scale and autonomy features:

| # | Change | Impact | Effort |
|---|--------|--------|--------|
| 11 | Memory consolidation pipeline | Durable knowledge extraction | 1 week |
| 12 | Capability-based permission model | Multi-agent security | 1 week |
| 13 | Agent execution pool with backpressure | Concurrent agent scaling | 3 days |
| 14 | Reflection-based self-improvement loop | Autonomous improvement | 2 weeks |

### Tier 4: Long-Term (Months 4+)

As documented in ROADMAP.md and REVELATION_PLAN.md:
- Substrate abstraction layer
- Federation via Nostr/rqlite
- Plugin marketplace
- ZK-ML verification

---

## Appendix A: Key Architectural Principles for Sovereign AGI

1. **Local-first, always.** Every capability must work without internet.
   Cloud is augmentation, never requirement.

2. **Memory is identity.** The memory system IS the agent. Lose the memories,
   lose the agent. Durability is non-negotiable.

3. **Events are truth.** Every action, decision, and outcome is an event.
   Current state is derived from events. This enables replay, debugging,
   and learning.

4. **Capabilities, not permissions.** Agents have capabilities they can use,
   not global flags that enable features. Capabilities can be granted,
   revoked, and audited.

5. **Reflection closes the loop.** Every action should produce a reflection
   opportunity. Did it work? What can be learned? How can the system improve?

6. **Substrate independence.** The cognitive architecture (perceive, decide,
   act, remember) doesn't change. Only the substrate (server, desktop, robot)
   changes.

7. **Economic sovereignty.** The agent should be able to earn, save, spend,
   and invest. Bitcoin/Lightning is the economic substrate.

8. **Graceful degradation, always.** If a component fails, the system
   continues with reduced capability. Never crash. Never lose data.

---

## Appendix B: File-Level Impact Map

Key files that need changes for each tier:

```
Tier 1 (Immediate):
  src/brain/memory.py          — WAL mode, hot memory integration
  src/infrastructure/events/bus.py — Persistence, replay
  src/swarm/event_log.py       — Merge into EventBus
  src/spark/engine.py          — Lazy singleton
  tests/conftest.py            — MockLLM fixture

Tier 2 (Foundation):
  src/timmy/agent_core/interface.py — AgentCore ABC
  src/timmy/agents/timmy.py    — LLM-based routing
  src/timmy/agents/base.py     — TaskState machine
  src/config.py                — Namespace split
  src/brain/memory.py          — sqlite-vec integration

Tier 3 (Scale):
  src/timmy/memory_system.py   — Consolidation pipeline
  src/infrastructure/hands/tools.py — Capability model
  src/timmy/agentic_loop.py    — Execution pool
  NEW: src/timmy/reflection.py — Self-improvement loop

Tier 4 (Vision):
  NEW: src/timmy/substrate.py  — Substrate abstraction
  src/brain/client.py          — Federation
  NEW: src/plugins/           — Plugin system
```

---

## Appendix C: Grok's Research Validation (March 9, 2026)

Grok performed independent research validation against 2026 literature and
confirmed alignment on all major architectural choices:

| Component | Research Backing | Confidence |
|-----------|-----------------|------------|
| ReAct+Reflexion loop | Original 2023 Reflexion paper (most-cited in 2026). Agents that self-critique after every step outperform GPT-4 by 11-22% on real tasks. Minimal ReAct cores routinely built in <300 lines. | High |
| YAML workflows as intelligence | Proven anti-bloat strategy in production agents. Self-modifying YAML with git versioning keeps agents lean while evolving. Intelligence in patchable files, not redeployable code. | High |
| Dynamic Docker tool registry | 2026 cutting-edge patterns (Docker MCP Gateway, Agent Sandbox) use on-demand container spin-up. Keeps core tiny and secure. | High |
| Lightning L402 economic layer | Lightning Labs' 2026 AI toolkit (lnget + L402) enables autonomous API payment and paid service hosting. Workflows can self-fund. | High |
| Three-tier memory (hot/vault/semantic) | 2026 enterprise pattern. Lightweight, local-first, pairs with Reflexion's episodic lessons. | High |

**Grok's key insight:** The Ghost Core spec is not an approximation — it's the
refined, research-validated evolution of current agent architecture patterns.
The 2,000-line core constraint is achievable and maintainable.

---

## Appendix D: Architectural Decisions Record (ADR)

These decisions were made during the research interview (March 8, 2026):

### ADR-1: Migration Strategy — Hybrid Core Rewrite + Incremental Shell

**Decision:** Build the Ghost Core (ghost.py, reflexion.py, workflow_engine.py)
as NEW modules alongside existing code. Gradually route traffic from the old
orchestrator to the new Ghost Core. Keep the dashboard as-is during migration.

**Rationale:** Avoids the risk of a big-bang rewrite while still achieving the
architectural target. The old code continues to work while the new cognitive
kernel is validated. Traffic can be shifted incrementally per-route.

**Consequence:** Temporary complexity from having two orchestration paths. Must
be disciplined about completing migration — don't let both paths persist
indefinitely.

### ADR-2: Intelligence Model — Dual-Track (YAML + LLM)

**Decision:** Known tasks use YAML workflows (fast, deterministic, auditable).
Novel tasks trigger LLM-driven agentic loops. Over time, successful LLM
patterns get codified into new YAML workflows automatically.

**Rationale:** Pure YAML is too rigid for emergent AGI behavior. Pure LLM is
too unpredictable and expensive. The dual-track model gives determinism where
possible and flexibility where needed, with a natural path for workflows to
evolve.

**Consequence:** Need a clear decision point for "is this a known task?"
(solved by the intent classifier in Section 7.3). Need a workflow generation
pipeline that captures successful LLM patterns as YAML.

### ADR-3: Tool Runtime — Subprocess Sandboxing

**Decision:** Tools run as separate processes with bubblewrap/namespace
sandboxing on Linux. Not Docker-only.

**Rationale:** Docker is heavy for RPi deployments and adds latency for
lightweight tools. Subprocess sandboxing with bubblewrap provides isolation
without the container overhead. Works on bare metal, still isolated.

**Consequence:** Need a ToolRunner abstraction that supports both subprocess
and Docker backends. The tool contract (health check, execute, shutdown) stays
the same regardless of backend.

### ADR-4: Immediate Priority — Event System + State Machine

**Decision:** First 2-week sprint focuses on unifying EventBus + swarm
event_log, adding persistent event sourcing, and implementing the TaskState
machine.

**Rationale:** This is the foundation everything else depends on. Without
reliable event persistence and task state tracking, the Ghost Core can't
debug itself, the Reflexion loop can't learn from history, and the workflow
engine can't recover from failures.

**Work items for breakdown in follow-up session:**
1. Merge `infrastructure/events/bus.py` + `swarm/event_log.py` into unified
   persistent event system
2. Add WAL mode to all SQLite databases
3. Implement TaskState enum and TaskContext dataclass with persistence
4. Add event replay capability for debugging
5. Wire Spark engine to consume from unified event stream
6. Add lazy init guards to all module-level singletons

---

*This research is a living document. Update as decisions are made and
implementations are validated. Track decisions in `docs/DECISIONS.md`.*

---

## Appendices: Peer Reviews

The following appendices contain external peer reviews solicited from five
independent AI systems. Each was given the full research document and asked to
identify gaps, risks, alternative approaches, prioritization feedback, and
concrete suggestions. Reviews are reproduced as-received with minimal
formatting edits.

---

### Appendix E: Replit Review

**Reviewer:** Replit AI
**Date:** March 2026

#### 1. Gaps and Blind Spots

- **Distributed Tracing & Observability:** The report champions event sourcing and a local-first approach (even turning telemetry off). However, by breaking tools out into ephemeral Docker containers/subprocess sandboxes, you are introducing a distributed system. The report misses a strategy for distributed tracing (e.g., passing trace IDs or OpenTelemetry context across container boundaries). Without this, debugging a workflow where an agent calls a container that calls another API will be nearly impossible.
- **Data Privacy & Scrubbing:** A local-first system is inherently private, but "UnifiedMemory" acts as a sponge for all events and context. There is no mention of a memory scrubbing or redaction layer. If the agent acts on Discord or Telegram (as noted in integrations), how do we prevent PII or sensitive keys from becoming permanent "Semantic Memory" that is later retrieved and potentially leaked by the LLM?
- **Workflow Resiliency Semantics:** The report introduces YAML workflows and a TaskState machine. But what happens when the host machine reboots halfway through a 3-day workflow? The report misses the specific snapshotting/resume mechanics required for long-running processes.
- **Network Sandboxing:** While Bubblewrap/Docker are mentioned for execution sandboxing, network isolation is omitted. If a dynamic web-scraper tool is spun up, how do we prevent it from performing SSRF attacks against the host's internal network?

#### 2. Challenges and Risks

- **The "Cold Start" Latency Penalty:** Spinning up Docker containers dynamically for tools (ToolRegistry) introduces significant latency. An LLM ReAct loop that waits 2–5 seconds per tool step for a container to boot will feel sluggish and break the illusion of continuous thought. This is a severe risk to the user experience.
- **Fragility of Self-Modifying YAML:** Using YAML as the medium for LLM self-modification is highly risky. LLMs frequently make subtle indentation or syntax errors. A single malformed YAML file could brick the workflow engine. The assumption that an LLM can reliably edit YAML orchestrations without breaking the parser is an underestimation of complexity.
- **The 2,000-Line Code Golf Trap:** Setting a strict 2K line limit for the Ghost Core is an excellent guiding philosophy but a dangerous metric. It risks encouraging "code golf" (overly dense, clever code) over readability.
- **Dual-Track Orchestration Drift:** ADR-1 proposes running the old core and Ghost Core side-by-side. The risk of these diverging and never actually completing the migration is extremely high. "Strangler Fig" patterns often leave behind permanent legacy appendages if not aggressively time-boxed.

#### 3. Alternative Approaches

- **WebAssembly (WASM) over Docker/Bubblewrap:** For the dynamic tool registry, strongly consider WASM (via Wasmtime or Extism) instead of Docker. WASM provides millisecond cold starts, strict capability-based security (WASI), and language agnosticism. It perfectly aligns with "Substrate Independence" and "Capability-Based Security" goals while eliminating the Docker latency tax.
- **DSL or Embedded Scripting over YAML:** Instead of self-modifying YAML, consider using a sandboxed scripting language (like Starlark, Lua, or RestrictedPython) or strict JSON with a Pydantic schema validator. Starlark is designed exactly for this kind of deterministic, hermetic execution and is much safer for an LLM to generate than YAML.
- **Standardized DI over Homegrown:** In Section 2.1, the report proposes a custom Container class. Instead of reinventing dependency injection, leverage FastAPI's existing Depends system, or a lightweight standard like contextvars to manage scoped state without global singletons.

#### 4. Prioritization Feedback

- **Security Must Shift Left:** The "Capability-based permission model" is currently in Tier 3 (Months 2-4). However, you are introducing dynamic Docker tool registries and self-modifying YAML in Tier 2. You cannot introduce dynamic code execution without the capability model already in place. Reprioritize Capability-based permissions to Tier 2.
- **Evaluation Harnesses belong in Tier 1:** Tier 1 lists "MockLLM for deterministic tests," which is good, but structural refactoring requires behavioral evaluations. Before rewriting the core, Tier 1 should include an automated eval suite (even just 10 core prompts) to guarantee the Ghost Core migration doesn't degrade intelligence.
- **WAL Mode and DB Unification:** Complete agreement on Tier 1. Consolidating SQLite databases and enabling WAL mode will yield immediate, high-ROI stability improvements.

#### 5. Concrete Suggestions

- **Actionable YAML Validation (Section 5.5):** If you commit to YAML for workflows, implement a strict Pydantic model for the YAML schema. Force the agent to pass its proposed YAML modifications through an isolated validation tool before it is allowed to overwrite the actual file on disk.
- **Tool Warming (Section 5.6):** Implement "warm pools" for tool execution. Keep 1-2 generic worker processes running idly, and inject the specific tool instructions dynamically. This sidesteps the Docker cold-start issue.
- **Deprecation Deadline (ADR-1):** Add a concrete termination condition to ADR-1. For example: "The legacy orchestrator will be entirely deleted exactly 4 weeks after Ghost Core handles 50% of traffic, regardless of edge-case parity."
- **Vector Search (Section 3.4):** Proceed with sqlite-vec as recommended, but ensure you are chunking memories intelligently. Vector search performance degradation is often a symptom of storing monolithic chunks rather than the math itself. Implement a rolling summary for long contexts before they ever hit the vector DB.

---

### Appendix F: Kimi Review

**Reviewer:** Kimi AI
**Date:** March 2026

#### Executive Summary

The report presents a compelling vision for evolving Timmy Time from a dashboard-centric architecture to a sovereign AGI system via the "Ghost Core" pattern. The analysis of current maintainability issues (singleton proliferation, dual memory systems, import-time side effects) is accurate and actionable.

However, the report significantly underestimates production complexities in four critical areas: **security architecture for autonomous systems**, **operational feasibility of the 2,000-line constraint**, **data migration safety**, and **human oversight mechanisms**.

**Recommendation:** Revise Tier 1 priorities to include security hardening and observability infrastructure before proceeding with Ghost Core extraction. Increase core line budget to 4,000 lines with strict justification requirements. Add explicit human-in-the-loop circuit breakers before enabling self-modification.

#### 1. Gaps & Blind Spots

**1.1 Security Architecture (CRITICAL GAP)**

The report mentions CSRF and security headers but entirely omits a threat model for autonomous agent security.

| Risk | Current State | Required Mitigation |
|------|--------------|---------------------|
| Prompt Injection | No discussion of input sanitization for YAML workflows that execute shell commands | YAML schema validation + capability sandboxing |
| Capability Escalation | Timmy can spawn Docker containers and self-modify workflows; no containment strategy | Substrate isolation (gVisor/Firecracker) for untrusted tools |
| Secrets Rotation | Lightning wallet keys, API keys stored in `.env` with no rotation strategy | HashiCorp Vault integration or SOPS-based secret management |
| Network Segmentation | Externalized tools communicate over plaintext HTTP on localhost | mTLS or WireGuard mesh between core and tools |

Required Addition: A `security/threat_model.md` documenting attack vectors for self-modifying YAML, container escape prevention, and memory poisoning defenses.

**1.2 Observability & Debugging (HIGH SEVERITY)**

Missing: distributed tracing, LLM call inspection, state machine inspection, and memory provenance tracking.

**1.3 Data Migration & Backward Compatibility (MEDIUM SEVERITY)**

No data migration strategy for unifying four memory systems. Missing: migration tooling with dry-run capability, rollback procedures, dual-write strategy during transition.

**1.4 Human-in-the-Loop & Oversight (PHILOSOPHICAL GAP)**

Missing safeguards: emergency stop for runaway self-modification, human approval gates for high-cost actions, alignment checkpoints, kill switch for autonomous tool spawning.

#### 2. Challenges & Risks

**2.1 The "2,000 Line Core" Constraint (SEVERELY UNDERESTIMATED)**

Line budget breakdown shows zero headroom for edge cases, platform abstractions, or migration code. The Linux kernel's scheduler is ~3,000 lines. Recommendation: 4,000-line soft limit with explicit security/observability/reliability justification for lines >2,000.

**2.2 YAML as Intelligence (MODERATELY UNDERESTIMATED)**

Schema evolution, validation overhead, git conflicts with concurrent human edits, and the Turing tarpit risk of YAML-with-conditionals becoming "a programming language—but a bad one."

**2.3 Docker Dependency for Tools**

Image pull latency on RPi, storage overhead, cold start latency, ARM64 availability. Required: subprocess sandboxing (bubblewrap) as primary runtime, Docker opt-in for heavy ML tools.

**2.4 Event Sourcing Complexity**

Missing: event schema evolution, snapshotting strategy, handling of non-deterministic replay.

**2.5 SQLite Concurrency**

WAL mode is necessary but insufficient. Missing: write queue management, connection pool exhaustion strategy, distributed SQLite integration.

#### 3. Alternative Approaches

- **Cellular Architecture:** Self-contained agent cells with peer-to-peer communication instead of central Ghost Core. Consider hybrid—Ghost Core for orchestration, cells for network partition resilience.
- **WebAssembly Components:** WASM for lightweight tools (ms startup, KB size, capability-based sandboxing), Docker for heavy ML tools.
- **Fedimint + Cashu:** Ecash for privacy-critical workflows alongside Lightning.

#### 4. Prioritization Feedback

**Reprioritized Roadmap:**

- Tier 1: WAL mode, lazy init guards, capability-based permissions (NEW), input validation (NEW), MockLLM, unified EventBus
- Tier 2: AgentCore interface, observability infrastructure (NEW), TaskState machine, sqlite-vec, config splitting
- Tier 3: Memory consolidation, agent pool with backpressure, human-in-the-loop gates (NEW), LLM intent classification
- Tier 4: Substrate abstraction, federation, gated self-improvement loops

#### 5. Concrete Suggestions

- **Guardian Layer:** Security component that approves/rejects actions based on economic bounds, workflow safety verification, and human approval for high-risk actions.
- **Workflow Schema Versioning:** `schema_version` field with migration instructions.
- **Event Compaction Strategy:** Snapshot + summarize + archive pattern to prevent infinite log growth.
- **Line Budget Revision:** 4,000 lines with explicit allocation including 250 lines for security/guardian and 300 for observability.

**Go/No-Go Criteria for Ghost Core Migration:**
- Guardian layer implemented and tested
- Observability infrastructure operational
- Migration tooling tested on real data
- Human approval gates wired for high-cost actions
- Line count under 4,000 with documented allocation
- Rollback strategy validated

---

### Appendix G: Claude (Anthropic) Review

**Reviewer:** Claude (Anthropic)
**Date:** March 2026

*Note: This review was identical in structure and content to Appendix F (Kimi). Both models independently converged on the same critical gaps, risk assessments, and recommendations. This convergence strengthens the signal: security architecture, line budget realism, human oversight, and migration safety are genuine blind spots requiring attention. The duplicate review has been consolidated here by reference rather than repeated in full.*

---

### Appendix H: Perplexity Review

**Reviewer:** Perplexity AI
**Date:** March 2026

#### 1. Gaps and Blind Spots

**1.1 Security and Threat Model**

Missing elements:

- **Defined adversaries:** Malicious/compromised tools/containers, prompt-injected content triggering tool calls or LN spend, local OS users/processes tampering with state or keys.
- **Clear sovereignty scope:** What "sovereign" actually guarantees at each layer (hardware, OS, runtime, data, models, economics). Ollama models and LN peers are still external dependencies.
- **Container/tool sandboxing details:** Network isolation policies, filesystem isolation, capability-to-container constraint mapping.

Concrete suggestion: Add a "Security & Sovereignty Model" section with threat actors table, explicit network egress policies, and OS-level capability constraints.

**1.2 Operational & SRE Concerns**

Missing: standard event schema for ALL subsystems, forensic query indices, minimal operator console, disaster recovery procedures (DB snapshot cadence, crash-safe backups, LN key sync), and upgrade/rollback playbooks.

**1.3 Governance & Multi-Human Use**

Implicitly single-user. Missing: role/permission model (Owner/Maintainer/Guest), conflict resolution for incompatible goals, formal process for changing critical policies.

**1.4 Safety & Alignment**

Missing: interruption/rollback for multi-step workflows with external effects, side-effect budgeting (`max_file_writes`, `max_external_domains`, `max_shell_commands`), goal scoping with explicit action domains, and a kill switch component.

#### 2. Challenges and Risks

**2.1 Multi-Agent & Workflow Complexity**

Emergent loops from recursive self-improvement. State explosion from three-tier memory + event logs + workflow versions. Need global `max_self_modify_depth`, `max_total_workflow_versions_per_id`, and per-run `max_spawned_workflows`.

**2.2 Tool Registry & Containerization**

Docker-everywhere assumption fails on RPi/homelab. Cold start and port exhaustion under concurrency. Supply chain trust for container images (signing, checksums, local mirroring).

**2.3 LN Economic Layer**

Channel lifecycle complexity with intermittent uptime. Fee/route variability makes per-workflow cost estimation simplistic. Partial failure recovery undefined.

Recommendation: Treat LN as asynchronous side-effect with retry/compensation events. Start with manual channel management.

**2.4 SQLite & Concurrency**

Long-running transactions from event-sourced batched writes. Schema coordination across agents/tools. Need strict short-transaction discipline and append-only events with materialized views.

#### 3. Alternative Approaches

- **Event-First Core:** Single canonical `events` table with normalized schema (`event_id`, `parent_event_id`, `timestamp`, `actor`, `type`, `payload_json`, `run_id`, `workflow_id`, `agent_id`). All subsystems produce/consume events. Natural audit trail, easier replay, fits non-deterministic testing.
- **Policy Engine:** Capabilities as data (YAML/JSON) with tiny evaluator instead of hardcoded checks. Enables user configuration without code edits and lets Timmy propose policy changes.
- **Substrate-Aware Tooling:** `ToolSubstrate` with Container/Process/Remote/WASM variants, selected per environment based on capabilities and policies.
- **Split Storage:** Operational store (SQLite for transactional state) vs analytical/knowledge store (DuckDB for large-scale events, metrics, memory materializations).

#### 4. Prioritization Feedback

**Move earlier:**
- Threat model + basic policy/capability enforcement (Tier 1, even if hardcoded)
- Minimal "Timmy Console" for event/state introspection

**Move later:**
- Aggressive Docker-based tool externalization (start subprocess/bubblewrap first)
- Full autoresearch loops (defer until policies, budgets, and observability are solid)

**Tier 1 additions:** Minimal threat model, hard "no outbound HTTP except allowed" rule, centralized capability check path.
**Tier 2 additions:** Simple governance roles, toggleable approval mode, basic Timmy Console.

#### 5. Concrete Suggestions

- **Add Security & Sovereignty section** with threat model table and sovereignty boundary diagram.
- **Expand capability model into mini-policy system** with YAML-defined policy rules.
- **Clarify tool substrates** (container/subprocess/remote) with environment-based selection.
- **Tighten LN description:** estimated_cost as upper bound, async event representation, partial failure handling.
- **Make event sourcing more central:** canonical Event type, unified events table/schema.
- **Add governance & modes section:** `assist_only`, `propose_then_execute`, `autonomous_with_budget` with capability/approval implications.

---

### Appendix I: Grok Review

**Reviewer:** Grok (xAI)
**Date:** March 2026

#### 1. Gaps and Blind Spots

**Security & Threat Model:** No explicit adversary list or sovereignty boundary. Tools from Docker images risk malicious tags or prompt-injection triggering LN spend. Missing:

- Network egress policy (core never outbound except via proxy)
- Prompt sanitization layer before tool calls
- Rootless Podman as default (daemonless, no root escalation—beats Docker on homelab security)

**Operational Resilience:** No backup cadence, crash-recovery playbook, or event schema for forensics. Power cut mid-channel-open = funds gone. LN key sync undefined.

**Governance & Alignment:** Single-user assumption. No roles, approval gates for risky steps, or intent drift detection.

#### 2. Challenges and Risks

- **Self-Modification Loops:** Recursive "improve" steps burning sats on useless tools. No global depth cap or meta-guardrail against spawning sub-workflows.
- **LN Economics in Practice:** Channel liquidity, routing fees, offline failures. Partial payment fails with no compensation logic.
- **Tool Cold Starts & Port Hell:** Per-step container spin-up = latency spikes. Port exhaustion under concurrency. No image signing/checksums = supply chain attack vector.
- **State Explosion:** Three-tier memory + event logs + workflow versions = unprunable mess. No hard caps on active runs or decay policy.

#### 3. Alternative Approaches

- **Tool Substrate:** Rootless Podman (security-first), subprocess (no container tax for trusted tools), WASM (sandboxed, fast cold-start).
- **Event-First Core:** Append-only JSONL events table. Replayable, audit-proof. DuckDB for analytics.
- **Policy Engine:** YAML policies (`shell.execute: requires_approval, max_3_per_run`). Tiny evaluator scales without code bloat.
- **LN Handling:** Payments as async events with retry/compensation. Start manual-channel only.

#### 4. Prioritization Feedback

**Move up (Tier 1):**
- Threat model + basic policy checks. One rogue LN call kills trust.
- Unified events + minimal console. Debug hell otherwise.

**Defer slightly (Tier 2):**
- Full Docker registry. Start subprocess + Podman hybrid.
- Advanced reflexion/self-modify. Stabilize budgets/policies first.

**Keep as-is:** Migration phases—bots out first is perfect.

#### 5. Concrete Suggestions

**Workflow Schema Extension:**
```yaml
budget:
  max_sats: 500
  max_file_writes: 5
  max_domains: 10
  review_required: [shell.execute, git.push, ln.payout]
```

**Registry Patch:**
```python
if tool.trust_level == "untrusted":
    use_substrate("subprocess", airgapped=True)
else:
    use_substrate("podman_rootless")
```

**Ops Playbook:** Hourly `tar + gpg` backups offsite. Drain workflows before schema migration, replay events after. `/admin/stop` kill switch route (auth-only).

**Event Schema:**
```json
{"event_id": "...", "timestamp": "...", "actor": "workflow:alpha_hunter", "type": "tool_call", "payload": {...}}
```

---

## Cross-Review Consensus Summary

All five reviewers independently converged on the following critical findings:

| Finding | Reviewers Flagging | Severity |
|---------|-------------------|----------|
| Missing security/threat model | All 5 | CRITICAL |
| 2,000-line budget is unrealistic | 4 of 5 | HIGH |
| Docker cold-start latency risk | All 5 | HIGH |
| Need WASM/subprocess tool substrates | All 5 | HIGH |
| Missing human-in-the-loop safeguards | 4 of 5 | HIGH |
| Self-modifying YAML fragility | 4 of 5 | HIGH |
| No observability/tracing strategy | All 5 | HIGH |
| LN partial failure handling missing | 3 of 5 | MEDIUM |
| Missing data migration strategy | 3 of 5 | MEDIUM |
| Event sourcing needs snapshotting | 3 of 5 | MEDIUM |
| Security/capabilities must move to Tier 1 | All 5 | CRITICAL |
| Need policy engine (data-driven, not hardcoded) | 3 of 5 | MEDIUM |
| SQLite concurrency insufficiently addressed | 3 of 5 | MEDIUM |

**Unanimous recommendation:** Security architecture and observability infrastructure
must be Tier 1 prerequisites before Ghost Core extraction or self-modification
capabilities are enabled.