Rockachopa/Timmy-time-dashboard

Fork 2

Files

Trip T f6a6c0f62e

Tests / lint (pull_request) Successful in 2s

Details

Tests / test (pull_request) Successful in 32s

Details

feat: upgrade to qwen3.5, self-hosted Gitea CI, optimize Docker image

Model upgrade:
- qwen2.5:14b → qwen3.5:latest across config, tools, and docs
- Added qwen3.5 to multimodal model registry

Self-hosted Gitea CI:
- .gitea/workflows/tests.yml: lint + test jobs via act_runner
- Unified Dockerfile: pre-baked deps from poetry.lock for fast CI
- sitepackages=true in tox for ~2s dep resolution (was ~40s)
- OLLAMA_URL set to dead port in CI to prevent real LLM calls

Test isolation fixes:
- Smoke test fixture mocks create_timmy (was hitting real Ollama)
- WebSocket sends initial_state before joining broadcast pool (race fix)
- Tests use settings.ollama_model/url instead of hardcoded values
- skip_ci marker for Ollama-dependent tests, excluded in CI tox envs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-11 18:36:42 -04:00

69 KiB

Raw Blame History

Sovereign AGI Research: Maintainability & Scalability Analysis

Date: 2026-03-08 Scope: Deep architecture review of Timmy Time Dashboard with actionable recommendations for evolving toward a robust sovereign AGI. Inputs: Codebase analysis, ROADMAP.md, REVELATION_PLAN.md, Kimi's Ghost Core Knowledge Transfer (March 9, 2026), Grok's research validation (March 9, 2026)

Executive Summary

Timmy Time has a solid foundation: local-first inference, graceful degradation, event-driven architecture, and a clear module boundary system. However, several architectural patterns need to evolve to support the sovereign AGI vision — particularly around memory durability, agent coordination determinism, interface contracts, and dependency injection.

The Ghost Core vision (from Kimi's consolidation) proposes a radical minimization: the current ~13K-line monolith compressed to a ~2,000-line cognitive kernel that orchestrates intelligence through YAML workflows and containerized tools. This report synthesizes both perspectives — the incremental improvements needed now and the architectural target state — into a unified roadmap with 14 specific areas for improvement across 4 priority tiers.

The core tension this report resolves: How to move from "dashboard that contains an agent" to "agent that exposes a dashboard" without breaking what already works.

Current Architecture Assessment
Critical Maintainability Issues
Scalability Bottlenecks
Sovereign AGI Architecture Recommendations
The Ghost Core Vision
Memory System Evolution
Agent Coordination & Orchestration
Testing Strategy for Non-Deterministic Systems
Plugin & Extension Architecture
Self-Improvement Loops
Implementation Priority Matrix

1. Current Architecture Assessment

Strengths

Area	Implementation	Quality
Config management	`pydantic-settings` with env/file cascade	Excellent
Graceful degradation	Try/except with fallback at every integration point	Excellent
Event system	Async EventBus with wildcard subscriptions	Good
LLM routing	CascadeRouter with circuit breakers	Good
Memory tiers	Hot (MEMORY.md) → Vault (markdown) → Semantic (SQLite+vectors)	Good foundation
Module boundaries	8 packages with clear responsibilities	Good
Multi-backend LLM	Ollama/AirLLM/Grok/Claude with auto-detection	Good
Security posture	CSRF, security headers, secret validation, telemetry off	Good

Architecture Diagram (Current State)

┌──────────────────────────────────────────────────────────────┐
│                    Dashboard (FastAPI)                        │
│  23 route modules · HTMX+Jinja2 · WebSocket · CSRF          │
├───────────────┬──────────────┬───────────────┬───────────────┤
│   timmy/      │  spark/      │  swarm/       │ timmy_serve/  │
│ Agent core    │ Intelligence │ Task queue    │ API server    │
│ Memory system │ EIDOS predict│ Event log     │ Voice TTS     │
│ Agentic loop  │ Advisory     │               │ Inter-agent   │
├───────────────┴──────────────┴───────────────┴───────────────┤
│                   infrastructure/                             │
│ CascadeRouter · EventBus · WebSocket · Notifications · Hands │
├──────────────────────────────────────────────────────────────┤
│                   integrations/                               │
│ Discord · Telegram · Siri · Voice NLU · Paperclip · ChatBridge│
├──────────────────────────────────────────────────────────────┤
│                      brain/                                   │
│ UnifiedMemory (SQLite+vectors) · Embeddings · rqlite client  │
├──────────────────────────────────────────────────────────────┤
│                    config.py                                  │
│ 90+ settings · pydantic-settings · env/file cascade          │
└──────────────────────────────────────────────────────────────┘

2. Critical Maintainability Issues

2.1 Singleton Proliferation (Severity: HIGH)

Problem: The codebase uses module-level singletons extensively:

config.settings (global mutable state)
event_bus (infrastructure/events/bus.py)
memory_system (timmy/memory_system.py)
spark_engine (spark/engine.py)
cascade_router (infrastructure/router/cascade.py)
ws_manager, notifier, etc.

These singletons are instantiated at import time, making testing difficult and creating hidden dependencies between modules. A test that imports spark.engine triggers a chain: spark_engine → settings → reads .env → tries to connect to SQLite.

Recommendation: Migrate to a lightweight dependency injection container. Not a heavy framework — a simple registry pattern:

# infrastructure/container.py
class Container:
    """Lightweight DI container — singletons with lazy init."""

    _registry: dict[str, Any] = {}
    _factories: dict[str, Callable] = {}

    @classmethod
    def register(cls, name: str, factory: Callable):
        cls._factories[name] = factory

    @classmethod
    def get(cls, name: str) -> Any:
        if name not in cls._registry:
            cls._registry[name] = cls._factories[name]()
        return cls._registry[name]

    @classmethod
    def reset(cls):
        """For tests — clear all instances."""
        cls._registry.clear()

This preserves the simplicity of from x import singleton while enabling test isolation and lazy initialization.

2.2 Dual Memory Systems (Severity: HIGH)

Problem: There are TWO independent memory systems that don't talk to each other:

timmy/memory_system.py — MemorySystem with HotMemory (MEMORY.md), VaultMemory (markdown files), HandoffProtocol. Used by the agent creation path (agent.py).
brain/memory.py — UnifiedMemory with SQLite-backed memories, facts, embeddings, rqlite support. Used by brain-related routes and tools.

Data stored in one system is invisible to the other. The agent's context includes HotMemory (MEMORY.md) but not brain facts. The brain stores semantic memories but doesn't know about vault markdown files.

Recommendation: Unify under brain/memory.py (UnifiedMemory) as the single source of truth. Make MemorySystem a thin orchestration layer that delegates:

Hot memory → a "hot" table/partition in UnifiedMemory
Vault → keep markdown files but index them in UnifiedMemory for search
Handoff → serialize to UnifiedMemory with a "handoff" tag

2.3 Config Monolith (Severity: MEDIUM)

Problem: config.py has 90+ settings in a single Settings class. This violates single-responsibility and makes it hard to understand which settings belong to which module. Adding a new feature means touching the same 280-line file every time.

Recommendation: Split into namespaced configs using pydantic's nested model support:

class LLMConfig(BaseModel):
    ollama_url: str = "http://localhost:11434"
    ollama_model: str = "qwen3.5:latest"
    # ... all LLM settings

class MemoryConfig(BaseModel):
    prune_days: int = 90
    vault_max_mb: int = 100
    # ... all memory settings

class Settings(BaseSettings):
    llm: LLMConfig = LLMConfig()
    memory: MemoryConfig = MemoryConfig()
    # ...

This can be done incrementally — start with the largest groups (LLM, memory, security, creative) and keep the flat namespace available via properties for backwards compatibility.

2.4 Import-Time Side Effects (Severity: MEDIUM)

Problem: Several modules execute significant logic at import time:

config.py lines 342-376: production validation, logging setup, sys.exit(1)
spark/engine.py line 359: spark_engine = _create_engine()
swarm/event_log.py: SQLite table creation on _ensure_db()
dashboard/app.py: middleware setup, router registration

This means importing any module can trigger database connections, file I/O, or even process exits. This makes testing fragile and increases startup time.

Recommendation: Guard all side effects behind explicit initialization:

# Instead of module-level execution:
spark_engine = _create_engine()

# Use lazy pattern:
_spark_engine = None
def get_spark_engine() -> SparkEngine:
    global _spark_engine
    if _spark_engine is None:
        _spark_engine = _create_engine()
    return _spark_engine

The brain/memory.py already does this correctly with get_memory().

3. Scalability Bottlenecks

3.1 SQLite Concurrency (Severity: HIGH for AGI vision)

Problem: Multiple components write to different SQLite databases:

data/brain.db — UnifiedMemory
data/events.db — Swarm event log
timmy.db — Agno conversation history
Spark memory (internal SQLite)

SQLite has a single-writer lock. With multiple async tasks (agentic loop, background thinking, event capture, WebSocket handlers), write contention will increase as the system becomes more autonomous.

Current impact: Low (single-user system), but the AGI vision requires concurrent agent operation.

Recommendations (progressive):

Immediate: Use WAL mode for all SQLite databases:

conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA busy_timeout=5000")

Near-term: Consolidate into fewer databases. Currently 4+ separate .db files that could be 1-2 databases with proper schemas.
Medium-term: Move to DuckDB for analytical queries (agent performance metrics, Spark intelligence) while keeping SQLite for transactional data.
Long-term: The rqlite path (brain/client.py) is the right direction for distributed memory across federated nodes.

3.2 In-Memory Event Bus (Severity: MEDIUM)

Problem: The EventBus at infrastructure/events/bus.py is in-memory only. Events are lost on restart. History is capped at 1000 entries. No persistence, no replay.

For a sovereign AGI, every event is valuable data. Agent decisions, tool executions, task outcomes — all of this is training signal.

Recommendations:

Persist events to SQLite (the swarm event_log.py already does this separately — unify these two systems).

Add event replay for debugging agent behavior:

# Replay all events from a specific task
events = event_bus.replay(task_id="abc123")

Event sourcing pattern: Make events the source of truth for system state. Current state is derived by replaying events. This enables:
- Time-travel debugging
- State reconstruction after crashes
- Deterministic replay for testing

3.3 Synchronous Agent Execution (Severity: MEDIUM)

Problem: The BaseAgent.run() method calls self.agent.run() synchronously (Agno's run is blocking). The agentic loop wraps this in asyncio.to_thread() which works but creates thread contention with multiple concurrent agents.

Recommendation: Create a proper async agent execution pool:

class AgentPool:
    """Manages concurrent agent execution with backpressure."""

    def __init__(self, max_concurrent: int = 4):
        self._semaphore = asyncio.Semaphore(max_concurrent)
        self._executor = ThreadPoolExecutor(max_workers=max_concurrent)

    async def execute(self, agent: Agent, message: str) -> str:
        async with self._semaphore:
            loop = asyncio.get_event_loop()
            return await loop.run_in_executor(
                self._executor,
                lambda: agent.run(message, stream=False)
            )

3.4 Vector Search Performance (Severity: MEDIUM)

Problem: The current vector search in brain/memory.py loads ALL embeddings into memory and computes cosine similarity in Python:

rows = conn.execute("SELECT ... FROM memories WHERE embedding IS NOT NULL").fetchall()
for row in rows:
    stored_vec = np.frombuffer(row["embedding"], dtype=np.float32)
    score = float(np.dot(query_vec, stored_vec))

This is O(n) with no indexing. At 100K+ memories, this becomes a significant latency bottleneck.

Recommendations (as documented in ROADMAP.md Phase 3):

sqlite-vec — drop-in SQLite extension, zero new dependencies
LanceDB — embedded, disk-based, IVF_PQ indexing, handles millions
Qdrant — only if federated search across nodes is needed

sqlite-vec is the clear first step given the SQLite-native architecture.

4. Sovereign AGI Architecture Recommendations

4.1 The Perceive-Decide-Act-Remember Loop

The REVELATION_PLAN.md defines the right core interface:

class TimAgent(ABC):
    async def perceive(self, input) -> WorldState
    async def decide(self, state) -> Action
    async def act(self, action) -> Result
    async def remember(self, key, value)
    async def recall(self, key) -> Value

Current gap: The existing agent system (Agno-based) doesn't implement this pattern. The BaseAgent and SubAgent classes just wrap Agent.run() with event bus integration. There's no structured perception→decision→action cycle.

Recommendation: Introduce this as the AgentCore interface, sitting between the current Agno wrapper and the orchestrator:

# timmy/agent_core/interface.py (file already exists but is minimal)

class AgentCore(ABC):
    """Sovereign agent loop — the fundamental cognitive cycle."""

    @abstractmethod
    async def perceive(self, inputs: list[Perception]) -> WorldState:
        """Gather and fuse sensory inputs."""

    @abstractmethod
    async def decide(self, state: WorldState, memory: MemoryContext) -> Plan:
        """Deliberate and choose actions."""

    @abstractmethod
    async def act(self, plan: Plan) -> list[ActionResult]:
        """Execute planned actions."""

    @abstractmethod
    async def reflect(self, results: list[ActionResult]) -> list[Memory]:
        """Extract learnings from action results."""

The reflect step is the key addition — it closes the loop for self-improvement.

4.2 Capability-Based Security Model

For a sovereign AGI, the permission model must evolve from "config flags" to a capability-based system:

@dataclass
class Capability:
    name: str           # "shell.execute", "git.push", "memory.write"
    scope: str          # "project", "system", "network"
    requires_approval: bool
    max_cost_sats: int  # Economic bound on the capability

class AgentPermissions:
    """Each agent has a set of capabilities, not global flags."""
    capabilities: dict[str, Capability]

    def can(self, action: str) -> bool:
        cap = self.capabilities.get(action)
        return cap is not None and not cap.requires_approval

Currently, permissions are global (self_modify_enabled, hands_shell_enabled). For a multi-agent sovereign system, each agent needs its own permission set with escalation paths.

4.3 Substrate Abstraction Layer

The REVELATION_PLAN envisions multiple substrates (Cloud, Desktop, Robot, Sim). The current architecture is tightly coupled to the "cloud/server" substrate.

Recommendation: Introduce a substrate interface early:

class Substrate(ABC):
    """Where and how the agent executes."""

    @abstractmethod
    def get_llm_provider(self) -> LLMProvider: ...

    @abstractmethod
    def get_memory_backend(self) -> MemoryBackend: ...

    @abstractmethod
    def get_io_channels(self) -> list[IOChannel]: ...

    @abstractmethod
    def get_capabilities(self) -> list[Capability]: ...

class ServerSubstrate(Substrate):
    """Current implementation — Ollama + SQLite + FastAPI."""
    ...

class RobotSubstrate(Substrate):
    """RPi + camera + motors + local Ollama."""
    ...

This doesn't require rewriting existing code — it's an abstraction that wraps current components and makes substrate-switching possible.

5. The Ghost Core Vision

Source: Kimi's Knowledge Transfer Document (March 9, 2026)

5.1 The Proposition

The Ghost Core philosophy proposes a fundamental inversion of the current architecture:

CURRENT:  Dashboard (monolith) → contains agents → agents use tools
TARGET:   Ghost Core (~2K lines) → orchestrates workflows → tools are containers

The formula:

Ghost Core (ReAct+Reflexion)
  + Workflow Layer (YAML-defined, self-modifying)
  + Tool Registry (dynamic container spin-up/spin-down)
  + Three-Tier Memory (hot/vault/semantic)
  + Lightning Economic Layer
  = Sovereign, Wealth-Generating Agent

5.2 Ghost Core Module Budget

The proposal enforces a strict 2,000-line limit for src/timmy/:

Module	Lines	Purpose
`ghost.py`	~150	ReAct loop: Observe → Plan → Act → Reflect
`reflexion.py`	~100	Critique generation + lesson extraction
`workflow_engine.py`	~200	YAML loader, step executor, state machine
`tool_registry.py`	~200	Dynamic tool discovery, spawn, health check
`memory_system.py`	~300	Hot/Vault/Semantic memory interface (existing)
`backends.py`	~200	Ollama/AirLLM/Claude/Grok adapters
`config.py`	~150	Pydantic-settings (existing)
`lightning_wallet.py`	~200	L402 handling, invoice generation, balance
`utils/`	~300	Shared helpers, logging, serialization
Total	~1,600	Headroom for 400 lines critical fixes

5.3 What Gets Externalized

Everything heavy moves to Docker containers with standardized HTTP APIs:

Current In-Process	Target Container	Interface
Discord/Telegram bots	`timmy-bridges`	POST /ingest
Voice (pyttsx3)	`timmy-voice`	POST /speak, /transcribe
Creative tools	`timmy-creative`	POST /generate
Web scraping	`timmy-tools/scraper`	POST /execute
Experiment runners	`timmy-tools/lab`	POST /execute

5.4 The ReAct+Reflexion Loop

class GhostCore:
    async def run(self, goal: str, workflow_id: Optional[str] = None) -> Result:
        workflow = (self.workflow_engine.load(workflow_id)
                    if workflow_id
                    else self.planner.create_workflow(goal))

        context = {
            "goal": goal,
            "memory": self.memory.retrieve_relevant(goal),
            "lessons": self.reflexion.get_lessons(goal),
            "wallet": self.lightning.get_balance()
        }

        for step in workflow.steps:
            state = await self.observe(step, context)
            tool = await self.registry.get(step.tool)
            result = await tool.execute(step.params, context)

            critique = await self.reflexion.critique(step, result, context)
            if not critique.success:
                workflow.insert_step(critique.fix_step)
                continue

            context["results"].append(result)
            if critique.is_novel:
                await self.reflexion.store(critique.lesson)

        summary = await self.reflexion.summarize(context)
        return Result(context["results"], summary)

5.5 YAML Workflow Layer

Intelligence lives in workflows, not code. Workflows are YAML files that the Ghost Core executes. Timmy can modify his own workflows.

workflow_id: alpha_hunter_v2
name: Alpha Hunter - Crypto Scanner
goal: Find 3 undervalued assets with 5x potential

budget:
  max_sats_per_run: 1000
  estimated_time: 300s

tools_required:
  - name: web_scraper
    image: timmy-tools/scraper:latest
  - name: sentiment_analyzer
    image: timmy-tools/sentiment:latest

steps:
  - id: fetch_market_data
    tool: web_scraper
    action: scrape
    params: { source: coingecko, assets: top_100 }
    output: market_data

  - id: analyze_sentiment
    tool: sentiment_analyzer
    action: analyze
    input: ${market_data.asset_names}
    output: sentiment_scores

  - id: validate_candidates
    tool: web_scraper
    action: deep_dive
    foreach: ${candidates}
    output: reports

self_modify:
  enabled: true
  max_iterations: 5
  persist_lessons: true

5.6 Dynamic Tool Registry

Tools are Docker images spawned on demand, health-checked, and shut down after idle timeout:

class ToolRegistry:
    async def get(self, tool_name: str) -> ToolEndpoint:
        if tool_name in self.active_containers:
            return self.active_containers[tool_name]

        image = f"timmy-tools/{tool_name}:latest"
        if not self.docker.has_image(image):
            await self.docker.pull(image)

        port = self._find_free_port()
        container = await self.docker.run(image, ports={f"{port}/tcp": port})
        endpoint = ToolEndpoint(url=f"http://localhost:{port}", ...)

        if not await self._health_check(endpoint):
            await self.destroy(tool_name)
            raise ToolUnavailable(tool_name)

        self.active_containers[tool_name] = endpoint
        asyncio.create_task(self._idle_watcher(tool_name))
        return endpoint

All tool containers expose a standardized contract:

GET /health — capability advertisement
POST /execute — stateless command execution
POST /shutdown — graceful teardown

5.7 Economic Layer

Every workflow has a Lightning budget. Tools cost sats to run:

class LightningWallet:
    async def execute_with_budget(self, workflow: Workflow) -> Result:
        estimated_cost = self.estimate_cost(workflow.tools_required)
        if self.balance < estimated_cost:
            await self.request_funding(estimated_cost)

        self.escrow[run_id] = estimated_cost
        self.balance -= estimated_cost

        try:
            result = await self.execute_workflow(workflow)
            actual_cost = self.calculate_actual_cost(result)
            self.balance += (estimated_cost - actual_cost)
            return result
        except Exception:
            self.balance += estimated_cost
            raise

5.8 Assessment: Ghost Core vs Current Architecture

Where Ghost Core aligns with this research:

ReAct+Reflexion loop = Section 4.1's perceive-decide-act-reflect
YAML workflows = formalized version of the agentic loop
Dynamic tool registry = Section 8's plugin architecture
Economic layer = REVELATION_PLAN's Lightning treasury
Self-modifying workflows = Section 9's self-improvement loops

Open questions requiring decisions (see Interview section):

Migration strategy: Big-bang rewrite to ~2K lines vs incremental extraction? The 2K line limit is aspirational but risks breaking working code.
Docker dependency: The tool registry assumes Docker everywhere. What about bare-metal RPi deployments? Need a substrate-aware fallback.
YAML as intelligence: Workflows in YAML are powerful for structured tasks but may be limiting for emergent agent behavior. Need both structured (YAML) and unstructured (LLM-driven) execution paths.
Dashboard fate: The Ghost Core vision externalizes the dashboard, but the dashboard IS the primary user interface today. Needs careful migration planning.

5.9 Proposed Migration Phases

Phase	Weeks	Scope	Risk
1. Extract Bots	1-2	Discord/Telegram to `timmy-bridges` container	Low
2. Workflow Engine	3-4	`workflow_engine.py` + convert hardcoded logic to YAML	Medium
3. Tool Registry	5-6	Dockerize first tool, implement spawn/destroy	Medium
4. Economic Layer	7-8	Lightning wallet integration, budget constraints	Medium
5. Reflexion	9-10	Self-critique, YAML auto-modification, LESSONS.md	Low

5.10 Anti-Patterns to Enforce

From the Ghost Core philosophy, enforceable via CI:

Line count check: CI warns if src/timmy/ exceeds target threshold
No in-process heavy tools: PyTorch, transformers in containers only
All capabilities as workflows or registered tools
No cloud dependencies in core (user-provided API keys are opt-in)
Every YAML workflow must have a contract test

6. Memory System Evolution

6.1 Current State Analysis

The system has four separate memory stores:

Store	Location	Purpose	Technology
Hot Memory	`MEMORY.md`	Always-loaded context	Flat file
Vault	`memory/` dir	Structured notes	Markdown files
UnifiedMemory	`data/brain.db`	Semantic search, facts	SQLite + embeddings
Agno DB	`timmy.db`	Conversation history	SQLite (Agno-managed)

Plus Spark has its own event/memory SQLite tables.

6.2 Target Architecture: Unified Memory with Tiers

┌──────────────────────────────────────────────────────────┐
│                    MemoryFacade                           │
│  Single API for all memory operations                    │
├──────────────────────────────────────────────────────────┤
│  Working Memory    │ Episodic Memory  │ Semantic Memory  │
│  (current context, │ (conversations,  │ (facts, skills,  │
│   active plans,    │  events, actions) │  patterns)       │
│   tool state)      │                  │                  │
├──────────────────────────────────────────────────────────┤
│  Storage Backend (pluggable)                             │
│  SQLite (dev) → DuckDB (analytics) → rqlite (federation)│
└──────────────────────────────────────────────────────────┘

Working Memory replaces MEMORY.md and hot memory:

In-memory cache with periodic persistence
Current conversation context, active plans, tool state
Size-limited with LRU eviction
Survives restarts via checkpoint

Episodic Memory replaces vault notes and conversation history:

Timestamped records of events, conversations, decisions
Compressed summaries for older episodes
Searchable by time range, tags, participants
The agentic loop steps should automatically become episodes

Semantic Memory replaces brain facts and vector store:

Long-term knowledge: user preferences, learned patterns, world knowledge
Vector-indexed for similarity search
Confidence-scored with decay over time
Facts extracted from episodic memory via reflection

6.3 Memory Consolidation Pipeline

For a sovereign AGI, raw memories must be consolidated into durable knowledge:

Raw Events → Working Memory → Episodic Summary → Semantic Extraction
                                    ↓
                           Reflection Agent
                                    ↓
                    Patterns / Skills / Facts

The Spark engine already has a primitive version of this (_maybe_consolidate). Generalize it:

class MemoryConsolidator:
    """Periodic background task that strengthens memories."""

    async def consolidate(self):
        # 1. Summarize recent episodes into condensed form
        recent = await self.memory.get_episodes(hours=24)
        summary = await self.llm.summarize(recent)
        await self.memory.store_episode_summary(summary)

        # 2. Extract facts from summaries
        facts = await self.llm.extract_facts(summary)
        for fact in facts:
            await self.memory.store_fact(fact)

        # 3. Decay old, unaccessed memories
        await self.memory.decay_unused(older_than_days=30)

        # 4. Strengthen frequently accessed memories
        await self.memory.reinforce_popular(min_access_count=5)

7. Agent Coordination & Orchestration

7.1 Current Orchestrator Analysis

TimmyOrchestrator in timmy/agents/timmy.py routes requests using keyword matching:

direct_patterns = ["your name", "who are you", "hello", ...]
memory_patterns = ["we talked about", "remember", ...]

This is fragile — it doesn't handle ambiguous requests well and doesn't learn from routing outcomes.

7.2 Deterministic State Machine

Replace keyword-based routing with an explicit state machine:

from enum import Enum, auto

class TaskState(Enum):
    RECEIVED = auto()
    CLASSIFIED = auto()
    ROUTED = auto()
    EXECUTING = auto()
    REVIEWING = auto()
    COMPLETED = auto()
    FAILED = auto()

@dataclass
class TaskContext:
    state: TaskState
    request: str
    classification: dict  # intent, complexity, required_capabilities
    assigned_agent: str
    execution_history: list[StepResult]

    def transition(self, new_state: TaskState):
        # Validate transition is legal
        # Persist state for recovery
        # Emit event
        ...

Key benefits:

Every task has a clear audit trail
Failed tasks can be replayed from last checkpoint
State is serializable — survives restarts
Transitions emit events → Spark can learn from routing patterns

7.3 Intelligent Routing (Replace Keyword Matching)

Instead of regex patterns, use the LLM to classify intent:

class IntentClassifier:
    """Classify user intent using the local LLM."""

    CLASSIFICATION_PROMPT = """Classify this request into ONE category:
    - DIRECT: Simple question, greeting, status check
    - RESEARCH: Needs external information gathering
    - CODE: Programming, file operations, tool building
    - MEMORY: Recall past conversations or decisions
    - CREATIVE: Writing, content generation
    - COMPLEX: Multi-step task requiring orchestration

    Request: {request}
    Category:"""

    async def classify(self, request: str) -> str:
        result = await self.llm.complete(
            self.CLASSIFICATION_PROMPT.format(request=request)
        )
        return result.strip().upper()

Cache classifications to avoid repeated LLM calls for similar requests.

7.4 Inter-Agent Communication Protocol

Currently agents communicate through the event bus with unstructured dicts. Define a protocol:

@dataclass
class AgentMessage:
    """Structured inter-agent communication."""
    from_agent: str
    to_agent: str
    message_type: str  # "request", "response", "delegate", "escalate"
    content: str
    context: dict       # Shared context for the conversation
    thread_id: str      # Groups related messages
    priority: int = 0
    requires_response: bool = True

8. Testing Strategy for Non-Deterministic Systems

8.1 Current Testing Gaps

The test suite has good coverage (73%+) but tests primarily cover deterministic code paths. The agent behavior, LLM responses, and orchestration decisions are largely untested because they're non-deterministic.

8.2 Testing Layers for AGI Systems

┌─────────────────────────────────────────────┐
│  Property-Based Tests                        │
│  "Agent always stores user facts"            │
│  "Memory never loses data on restart"        │
├─────────────────────────────────────────────┤
│  Behavioral Tests (with mock LLM)            │
│  "Given X input, agent routes to Y agent"    │
│  "Agentic loop produces 3+ steps"            │
├─────────────────────────────────────────────┤
│  Contract Tests                              │
│  "Agent always returns valid JSON"           │
│  "Memory API satisfies interface contract"   │
├─────────────────────────────────────────────┤
│  Unit Tests (current)                        │
│  "Config loads correctly"                    │
│  "Router circuits break at threshold"        │
└─────────────────────────────────────────────┘

8.3 Mock LLM for Deterministic Agent Tests

class MockLLM:
    """Deterministic LLM for testing agent behavior."""

    def __init__(self, responses: dict[str, str]):
        self.responses = responses
        self.calls = []

    def complete(self, prompt: str) -> str:
        self.calls.append(prompt)
        for pattern, response in self.responses.items():
            if pattern in prompt.lower():
                return response
        return "I don't know."

# Usage in tests:
def test_orchestrator_routes_code_requests():
    llm = MockLLM({"classify": "CODE"})
    orch = TimmyOrchestrator(llm=llm)
    result = await orch.orchestrate("Write a Python function")
    assert result.routed_to == "forge"

8.4 Golden Path Tests

Record real agent interactions and replay them as regression tests:

@pytest.mark.golden
def test_agentic_loop_solves_simple_task():
    """Golden test: agentic loop produces valid plan and executes it."""
    result = run_agentic_loop("List the files in the current directory")

    assert result.status in ("completed", "partial")
    assert len(result.steps) >= 1
    assert result.summary  # Non-empty summary
    assert result.total_duration_ms > 0

8.5 Chaos Testing for Resilience

@pytest.mark.chaos
def test_agent_handles_ollama_crash():
    """Agent degrades gracefully when Ollama dies mid-request."""
    with SimulatedOllamaCrash(after_n_requests=2):
        result = run_agentic_loop("Complex multi-step task")
        assert result.status in ("partial", "failed")
        assert "error" not in result.summary.lower() or "recovered" in result.summary.lower()

9. Plugin & Extension Architecture

9.1 Current State

The MCP (Model Context Protocol) tool registry is partially implemented:

infrastructure/hands/tools.py registers shell and git tools
BaseAgent._create_agent() looks up tools from registry
But the registry import is wrapped in try/except — it's optional

9.2 Recommendation: Tool Capability System

Formalize tools as capabilities with a discovery protocol:

@dataclass
class ToolCapability:
    name: str
    description: str
    input_schema: dict
    output_schema: dict
    cost_estimate: int  # sats per invocation (for economic routing)
    requires_approval: bool
    safety_level: str  # "safe", "review", "dangerous"

class ToolRegistry:
    """Central registry for all agent capabilities."""

    def register(self, tool: ToolCapability, handler: Callable):
        ...

    def discover(self, query: str) -> list[ToolCapability]:
        """Natural language tool discovery."""
        ...

    def get_tools_for_agent(self, agent_id: str) -> list[ToolCapability]:
        """Permission-filtered tools for a specific agent."""
        ...

9.3 Plugin Loading Pattern

# plugins/my_plugin/__init__.py
def register(registry: ToolRegistry):
    registry.register(
        ToolCapability(name="my_tool", ...),
        handler=my_handler,
    )

# App startup
for plugin_dir in Path("plugins").iterdir():
    if plugin_dir.is_dir() and (plugin_dir / "__init__.py").exists():
        module = importlib.import_module(f"plugins.{plugin_dir.name}")
        module.register(tool_registry)

This enables third-party extensions without modifying core code.

10. Self-Improvement Loops

10.1 Current Self-Modification

The self_modify_enabled config flag exists but the implementation is basic. For sovereign AGI, self-improvement must be systematic.

10.2 Reflection-Based Self-Improvement

┌─────────────────────────────────────────────────┐
│                Improvement Loop                  │
│                                                  │
│  1. OBSERVE: Collect performance metrics         │
│     - Task success rates per agent               │
│     - Latency distributions                      │
│     - Memory retrieval relevance scores          │
│     - User satisfaction signals                  │
│                                                  │
│  2. REFLECT: Identify patterns                   │
│     - Which tasks fail most?                     │
│     - Which tool combinations work best?         │
│     - Which prompts produce best results?        │
│                                                  │
│  3. HYPOTHESIZE: Generate improvements           │
│     - Better system prompts                      │
│     - Better tool selection strategies           │
│     - Better memory retrieval queries            │
│                                                  │
│  4. EXPERIMENT: Test improvements                │
│     - A/B test prompt variations                 │
│     - Measure before/after metrics               │
│     - Roll back if worse                         │
│                                                  │
│  5. INTEGRATE: Apply validated improvements      │
│     - Update system prompts                      │
│     - Adjust routing weights                     │
│     - Store new patterns in semantic memory      │
└─────────────────────────────────────────────────┘

10.3 Reward Model Integration

The config already has reward_model_enabled and reward_model_votes. Wire this into the reflection loop:

class RewardSignal:
    """Evaluate agent output quality."""

    async def score(self, task: str, output: str) -> float:
        if settings.reward_model_enabled:
            # Use PRM-style scoring via Ollama
            scores = []
            for _ in range(settings.reward_model_votes):
                score = await self.reward_model.evaluate(task, output)
                scores.append(score)
            return statistics.median(scores)

        # Fallback: heuristic scoring
        return self._heuristic_score(task, output)

10.4 Autoresearch Integration

The autoresearch_* config fields show intent for Karpathy-style experiment loops. Connect this to the Lab agent:

class AutoresearchLoop:
    """Autonomous ML experiment iteration."""

    async def run_iteration(self, experiment: Experiment):
        # 1. Run experiment with time budget
        result = await self.run_with_timeout(
            experiment, timeout=settings.autoresearch_time_budget
        )

        # 2. Evaluate metric
        metric_value = result.metrics[settings.autoresearch_metric]

        # 3. Log to Spark for pattern detection
        spark_engine.on_creative_step(
            project_id=experiment.id,
            step_name="experiment_run",
            agent_id="lab",
        )

        # 4. Generate next hypothesis
        next_experiment = await self.lab_agent.hypothesize(
            experiment, result, metric_value
        )

        return next_experiment

11. Implementation Priority Matrix

Tier 1: Do Now (Weeks 1-4)

These changes directly improve maintainability with minimal risk:

#	Change	Files Affected	Effort
1	Enable WAL mode for all SQLite databases	`brain/memory.py`, `swarm/event_log.py`	1 hour
2	Unify EventBus + swarm event_log	`infrastructure/events/`, `swarm/event_log.py`	1 day
3	Add lazy init guards to singletons	`spark/engine.py`, `config.py`	1 day
4	Add MockLLM for deterministic agent tests	`tests/conftest.py`	1 day
5	Consolidate MEMORY.md hot memory into brain DB	`timmy/memory_system.py`, `brain/memory.py`	2 days

Tier 2: Near-Term (Months 1-2)

Architectural improvements that set up the AGI foundation:

#	Change	Impact	Effort
6	Introduce AgentCore perceive-decide-act-reflect interface	Enables substrate portability	3 days
7	Replace keyword routing with LLM-based intent classification	Better orchestration quality	2 days
8	Add explicit TaskState machine with persistence	Debuggability, recovery	3 days
9	Split config.py into namespaced sections	Maintainability	2 days
10	Integrate sqlite-vec for vector search	Memory performance at scale	2 days

Tier 3: Medium-Term (Months 2-4)

Scale and autonomy features:

#	Change	Impact	Effort
11	Memory consolidation pipeline	Durable knowledge extraction	1 week
12	Capability-based permission model	Multi-agent security	1 week
13	Agent execution pool with backpressure	Concurrent agent scaling	3 days
14	Reflection-based self-improvement loop	Autonomous improvement	2 weeks

Tier 4: Long-Term (Months 4+)

As documented in ROADMAP.md and REVELATION_PLAN.md:

Substrate abstraction layer
Federation via Nostr/rqlite
Plugin marketplace
ZK-ML verification

Appendix A: Key Architectural Principles for Sovereign AGI

Local-first, always. Every capability must work without internet. Cloud is augmentation, never requirement.
Memory is identity. The memory system IS the agent. Lose the memories, lose the agent. Durability is non-negotiable.
Events are truth. Every action, decision, and outcome is an event. Current state is derived from events. This enables replay, debugging, and learning.
Capabilities, not permissions. Agents have capabilities they can use, not global flags that enable features. Capabilities can be granted, revoked, and audited.
Reflection closes the loop. Every action should produce a reflection opportunity. Did it work? What can be learned? How can the system improve?
Substrate independence. The cognitive architecture (perceive, decide, act, remember) doesn't change. Only the substrate (server, desktop, robot) changes.
Economic sovereignty. The agent should be able to earn, save, spend, and invest. Bitcoin/Lightning is the economic substrate.
Graceful degradation, always. If a component fails, the system continues with reduced capability. Never crash. Never lose data.

Appendix B: File-Level Impact Map

Key files that need changes for each tier:

Tier 1 (Immediate):
  src/brain/memory.py          — WAL mode, hot memory integration
  src/infrastructure/events/bus.py — Persistence, replay
  src/swarm/event_log.py       — Merge into EventBus
  src/spark/engine.py          — Lazy singleton
  tests/conftest.py            — MockLLM fixture

Tier 2 (Foundation):
  src/timmy/agent_core/interface.py — AgentCore ABC
  src/timmy/agents/timmy.py    — LLM-based routing
  src/timmy/agents/base.py     — TaskState machine
  src/config.py                — Namespace split
  src/brain/memory.py          — sqlite-vec integration

Tier 3 (Scale):
  src/timmy/memory_system.py   — Consolidation pipeline
  src/infrastructure/hands/tools.py — Capability model
  src/timmy/agentic_loop.py    — Execution pool
  NEW: src/timmy/reflection.py — Self-improvement loop

Tier 4 (Vision):
  NEW: src/timmy/substrate.py  — Substrate abstraction
  src/brain/client.py          — Federation
  NEW: src/plugins/           — Plugin system

Appendix C: Grok's Research Validation (March 9, 2026)

Grok performed independent research validation against 2026 literature and confirmed alignment on all major architectural choices:

Component	Research Backing	Confidence
ReAct+Reflexion loop	Original 2023 Reflexion paper (most-cited in 2026). Agents that self-critique after every step outperform GPT-4 by 11-22% on real tasks. Minimal ReAct cores routinely built in <300 lines.	High
YAML workflows as intelligence	Proven anti-bloat strategy in production agents. Self-modifying YAML with git versioning keeps agents lean while evolving. Intelligence in patchable files, not redeployable code.	High
Dynamic Docker tool registry	2026 cutting-edge patterns (Docker MCP Gateway, Agent Sandbox) use on-demand container spin-up. Keeps core tiny and secure.	High
Lightning L402 economic layer	Lightning Labs' 2026 AI toolkit (lnget + L402) enables autonomous API payment and paid service hosting. Workflows can self-fund.	High
Three-tier memory (hot/vault/semantic)	2026 enterprise pattern. Lightweight, local-first, pairs with Reflexion's episodic lessons.	High

Grok's key insight: The Ghost Core spec is not an approximation — it's the refined, research-validated evolution of current agent architecture patterns. The 2,000-line core constraint is achievable and maintainable.

Appendix D: Architectural Decisions Record (ADR)

These decisions were made during the research interview (March 8, 2026):

ADR-1: Migration Strategy — Hybrid Core Rewrite + Incremental Shell

Decision: Build the Ghost Core (ghost.py, reflexion.py, workflow_engine.py) as NEW modules alongside existing code. Gradually route traffic from the old orchestrator to the new Ghost Core. Keep the dashboard as-is during migration.

Rationale: Avoids the risk of a big-bang rewrite while still achieving the architectural target. The old code continues to work while the new cognitive kernel is validated. Traffic can be shifted incrementally per-route.

Consequence: Temporary complexity from having two orchestration paths. Must be disciplined about completing migration — don't let both paths persist indefinitely.

ADR-2: Intelligence Model — Dual-Track (YAML + LLM)

Decision: Known tasks use YAML workflows (fast, deterministic, auditable). Novel tasks trigger LLM-driven agentic loops. Over time, successful LLM patterns get codified into new YAML workflows automatically.

Rationale: Pure YAML is too rigid for emergent AGI behavior. Pure LLM is too unpredictable and expensive. The dual-track model gives determinism where possible and flexibility where needed, with a natural path for workflows to evolve.

Consequence: Need a clear decision point for "is this a known task?" (solved by the intent classifier in Section 7.3). Need a workflow generation pipeline that captures successful LLM patterns as YAML.

ADR-3: Tool Runtime — Subprocess Sandboxing

Decision: Tools run as separate processes with bubblewrap/namespace sandboxing on Linux. Not Docker-only.

Rationale: Docker is heavy for RPi deployments and adds latency for lightweight tools. Subprocess sandboxing with bubblewrap provides isolation without the container overhead. Works on bare metal, still isolated.

Consequence: Need a ToolRunner abstraction that supports both subprocess and Docker backends. The tool contract (health check, execute, shutdown) stays the same regardless of backend.

ADR-4: Immediate Priority — Event System + State Machine

Decision: First 2-week sprint focuses on unifying EventBus + swarm event_log, adding persistent event sourcing, and implementing the TaskState machine.

Rationale: This is the foundation everything else depends on. Without reliable event persistence and task state tracking, the Ghost Core can't debug itself, the Reflexion loop can't learn from history, and the workflow engine can't recover from failures.

Work items for breakdown in follow-up session:

Merge infrastructure/events/bus.py + swarm/event_log.py into unified persistent event system
Add WAL mode to all SQLite databases
Implement TaskState enum and TaskContext dataclass with persistence
Add event replay capability for debugging
Wire Spark engine to consume from unified event stream
Add lazy init guards to all module-level singletons

This research is a living document. Update as decisions are made and implementations are validated. Track decisions in docs/DECISIONS.md.

Appendices: Peer Reviews

The following appendices contain external peer reviews solicited from five independent AI systems. Each was given the full research document and asked to identify gaps, risks, alternative approaches, prioritization feedback, and concrete suggestions. Reviews are reproduced as-received with minimal formatting edits.

Appendix E: Replit Review

Reviewer: Replit AI Date: March 2026

1. Gaps and Blind Spots

Distributed Tracing & Observability: The report champions event sourcing and a local-first approach (even turning telemetry off). However, by breaking tools out into ephemeral Docker containers/subprocess sandboxes, you are introducing a distributed system. The report misses a strategy for distributed tracing (e.g., passing trace IDs or OpenTelemetry context across container boundaries). Without this, debugging a workflow where an agent calls a container that calls another API will be nearly impossible.
Data Privacy & Scrubbing: A local-first system is inherently private, but "UnifiedMemory" acts as a sponge for all events and context. There is no mention of a memory scrubbing or redaction layer. If the agent acts on Discord or Telegram (as noted in integrations), how do we prevent PII or sensitive keys from becoming permanent "Semantic Memory" that is later retrieved and potentially leaked by the LLM?
Workflow Resiliency Semantics: The report introduces YAML workflows and a TaskState machine. But what happens when the host machine reboots halfway through a 3-day workflow? The report misses the specific snapshotting/resume mechanics required for long-running processes.
Network Sandboxing: While Bubblewrap/Docker are mentioned for execution sandboxing, network isolation is omitted. If a dynamic web-scraper tool is spun up, how do we prevent it from performing SSRF attacks against the host's internal network?

2. Challenges and Risks

The "Cold Start" Latency Penalty: Spinning up Docker containers dynamically for tools (ToolRegistry) introduces significant latency. An LLM ReAct loop that waits 2–5 seconds per tool step for a container to boot will feel sluggish and break the illusion of continuous thought. This is a severe risk to the user experience.
Fragility of Self-Modifying YAML: Using YAML as the medium for LLM self-modification is highly risky. LLMs frequently make subtle indentation or syntax errors. A single malformed YAML file could brick the workflow engine. The assumption that an LLM can reliably edit YAML orchestrations without breaking the parser is an underestimation of complexity.
The 2,000-Line Code Golf Trap: Setting a strict 2K line limit for the Ghost Core is an excellent guiding philosophy but a dangerous metric. It risks encouraging "code golf" (overly dense, clever code) over readability.
Dual-Track Orchestration Drift: ADR-1 proposes running the old core and Ghost Core side-by-side. The risk of these diverging and never actually completing the migration is extremely high. "Strangler Fig" patterns often leave behind permanent legacy appendages if not aggressively time-boxed.

3. Alternative Approaches

WebAssembly (WASM) over Docker/Bubblewrap: For the dynamic tool registry, strongly consider WASM (via Wasmtime or Extism) instead of Docker. WASM provides millisecond cold starts, strict capability-based security (WASI), and language agnosticism. It perfectly aligns with "Substrate Independence" and "Capability-Based Security" goals while eliminating the Docker latency tax.
DSL or Embedded Scripting over YAML: Instead of self-modifying YAML, consider using a sandboxed scripting language (like Starlark, Lua, or RestrictedPython) or strict JSON with a Pydantic schema validator. Starlark is designed exactly for this kind of deterministic, hermetic execution and is much safer for an LLM to generate than YAML.
Standardized DI over Homegrown: In Section 2.1, the report proposes a custom Container class. Instead of reinventing dependency injection, leverage FastAPI's existing Depends system, or a lightweight standard like contextvars to manage scoped state without global singletons.

4. Prioritization Feedback

Security Must Shift Left: The "Capability-based permission model" is currently in Tier 3 (Months 2-4). However, you are introducing dynamic Docker tool registries and self-modifying YAML in Tier 2. You cannot introduce dynamic code execution without the capability model already in place. Reprioritize Capability-based permissions to Tier 2.
Evaluation Harnesses belong in Tier 1: Tier 1 lists "MockLLM for deterministic tests," which is good, but structural refactoring requires behavioral evaluations. Before rewriting the core, Tier 1 should include an automated eval suite (even just 10 core prompts) to guarantee the Ghost Core migration doesn't degrade intelligence.
WAL Mode and DB Unification: Complete agreement on Tier 1. Consolidating SQLite databases and enabling WAL mode will yield immediate, high-ROI stability improvements.

5. Concrete Suggestions

Actionable YAML Validation (Section 5.5): If you commit to YAML for workflows, implement a strict Pydantic model for the YAML schema. Force the agent to pass its proposed YAML modifications through an isolated validation tool before it is allowed to overwrite the actual file on disk.
Tool Warming (Section 5.6): Implement "warm pools" for tool execution. Keep 1-2 generic worker processes running idly, and inject the specific tool instructions dynamically. This sidesteps the Docker cold-start issue.
Deprecation Deadline (ADR-1): Add a concrete termination condition to ADR-1. For example: "The legacy orchestrator will be entirely deleted exactly 4 weeks after Ghost Core handles 50% of traffic, regardless of edge-case parity."
Vector Search (Section 3.4): Proceed with sqlite-vec as recommended, but ensure you are chunking memories intelligently. Vector search performance degradation is often a symptom of storing monolithic chunks rather than the math itself. Implement a rolling summary for long contexts before they ever hit the vector DB.

Appendix F: Kimi Review

Reviewer: Kimi AI Date: March 2026

Executive Summary

The report presents a compelling vision for evolving Timmy Time from a dashboard-centric architecture to a sovereign AGI system via the "Ghost Core" pattern. The analysis of current maintainability issues (singleton proliferation, dual memory systems, import-time side effects) is accurate and actionable.

However, the report significantly underestimates production complexities in four critical areas: security architecture for autonomous systems, operational feasibility of the 2,000-line constraint, data migration safety, and human oversight mechanisms.

Recommendation: Revise Tier 1 priorities to include security hardening and observability infrastructure before proceeding with Ghost Core extraction. Increase core line budget to 4,000 lines with strict justification requirements. Add explicit human-in-the-loop circuit breakers before enabling self-modification.

1. Gaps & Blind Spots

1.1 Security Architecture (CRITICAL GAP)

The report mentions CSRF and security headers but entirely omits a threat model for autonomous agent security.

Risk	Current State	Required Mitigation
Prompt Injection	No discussion of input sanitization for YAML workflows that execute shell commands	YAML schema validation + capability sandboxing
Capability Escalation	Timmy can spawn Docker containers and self-modify workflows; no containment strategy	Substrate isolation (gVisor/Firecracker) for untrusted tools
Secrets Rotation	Lightning wallet keys, API keys stored in `.env` with no rotation strategy	HashiCorp Vault integration or SOPS-based secret management
Network Segmentation	Externalized tools communicate over plaintext HTTP on localhost	mTLS or WireGuard mesh between core and tools

Required Addition: A security/threat_model.md documenting attack vectors for self-modifying YAML, container escape prevention, and memory poisoning defenses.

1.2 Observability & Debugging (HIGH SEVERITY)

Missing: distributed tracing, LLM call inspection, state machine inspection, and memory provenance tracking.

1.3 Data Migration & Backward Compatibility (MEDIUM SEVERITY)

No data migration strategy for unifying four memory systems. Missing: migration tooling with dry-run capability, rollback procedures, dual-write strategy during transition.

1.4 Human-in-the-Loop & Oversight (PHILOSOPHICAL GAP)

Missing safeguards: emergency stop for runaway self-modification, human approval gates for high-cost actions, alignment checkpoints, kill switch for autonomous tool spawning.

2. Challenges & Risks

2.1 The "2,000 Line Core" Constraint (SEVERELY UNDERESTIMATED)

Line budget breakdown shows zero headroom for edge cases, platform abstractions, or migration code. The Linux kernel's scheduler is ~3,000 lines. Recommendation: 4,000-line soft limit with explicit security/observability/reliability justification for lines >2,000.

2.2 YAML as Intelligence (MODERATELY UNDERESTIMATED)

Schema evolution, validation overhead, git conflicts with concurrent human edits, and the Turing tarpit risk of YAML-with-conditionals becoming "a programming language—but a bad one."

2.3 Docker Dependency for Tools

Image pull latency on RPi, storage overhead, cold start latency, ARM64 availability. Required: subprocess sandboxing (bubblewrap) as primary runtime, Docker opt-in for heavy ML tools.

2.4 Event Sourcing Complexity

Missing: event schema evolution, snapshotting strategy, handling of non-deterministic replay.

2.5 SQLite Concurrency

WAL mode is necessary but insufficient. Missing: write queue management, connection pool exhaustion strategy, distributed SQLite integration.

3. Alternative Approaches

Cellular Architecture: Self-contained agent cells with peer-to-peer communication instead of central Ghost Core. Consider hybrid—Ghost Core for orchestration, cells for network partition resilience.
WebAssembly Components: WASM for lightweight tools (ms startup, KB size, capability-based sandboxing), Docker for heavy ML tools.
Fedimint + Cashu: Ecash for privacy-critical workflows alongside Lightning.

4. Prioritization Feedback

Reprioritized Roadmap:

Tier 1: WAL mode, lazy init guards, capability-based permissions (NEW), input validation (NEW), MockLLM, unified EventBus
Tier 2: AgentCore interface, observability infrastructure (NEW), TaskState machine, sqlite-vec, config splitting
Tier 3: Memory consolidation, agent pool with backpressure, human-in-the-loop gates (NEW), LLM intent classification
Tier 4: Substrate abstraction, federation, gated self-improvement loops

5. Concrete Suggestions

Guardian Layer: Security component that approves/rejects actions based on economic bounds, workflow safety verification, and human approval for high-risk actions.
Workflow Schema Versioning: schema_version field with migration instructions.
Event Compaction Strategy: Snapshot + summarize + archive pattern to prevent infinite log growth.
Line Budget Revision: 4,000 lines with explicit allocation including 250 lines for security/guardian and 300 for observability.

Go/No-Go Criteria for Ghost Core Migration:

Guardian layer implemented and tested
Observability infrastructure operational
Migration tooling tested on real data
Human approval gates wired for high-cost actions
Line count under 4,000 with documented allocation
Rollback strategy validated

Appendix G: Claude (Anthropic) Review

Reviewer: Claude (Anthropic) Date: March 2026

Note: This review was identical in structure and content to Appendix F (Kimi). Both models independently converged on the same critical gaps, risk assessments, and recommendations. This convergence strengthens the signal: security architecture, line budget realism, human oversight, and migration safety are genuine blind spots requiring attention. The duplicate review has been consolidated here by reference rather than repeated in full.

Appendix H: Perplexity Review

Reviewer: Perplexity AI Date: March 2026

1. Gaps and Blind Spots

1.1 Security and Threat Model

Missing elements:

Defined adversaries: Malicious/compromised tools/containers, prompt-injected content triggering tool calls or LN spend, local OS users/processes tampering with state or keys.
Clear sovereignty scope: What "sovereign" actually guarantees at each layer (hardware, OS, runtime, data, models, economics). Ollama models and LN peers are still external dependencies.
Container/tool sandboxing details: Network isolation policies, filesystem isolation, capability-to-container constraint mapping.

Concrete suggestion: Add a "Security & Sovereignty Model" section with threat actors table, explicit network egress policies, and OS-level capability constraints.

1.2 Operational & SRE Concerns

Missing: standard event schema for ALL subsystems, forensic query indices, minimal operator console, disaster recovery procedures (DB snapshot cadence, crash-safe backups, LN key sync), and upgrade/rollback playbooks.

1.3 Governance & Multi-Human Use

Implicitly single-user. Missing: role/permission model (Owner/Maintainer/Guest), conflict resolution for incompatible goals, formal process for changing critical policies.

1.4 Safety & Alignment

Missing: interruption/rollback for multi-step workflows with external effects, side-effect budgeting (max_file_writes, max_external_domains, max_shell_commands), goal scoping with explicit action domains, and a kill switch component.

2. Challenges and Risks

2.1 Multi-Agent & Workflow Complexity

Emergent loops from recursive self-improvement. State explosion from three-tier memory + event logs + workflow versions. Need global max_self_modify_depth, max_total_workflow_versions_per_id, and per-run max_spawned_workflows.

2.2 Tool Registry & Containerization

Docker-everywhere assumption fails on RPi/homelab. Cold start and port exhaustion under concurrency. Supply chain trust for container images (signing, checksums, local mirroring).

2.3 LN Economic Layer

Channel lifecycle complexity with intermittent uptime. Fee/route variability makes per-workflow cost estimation simplistic. Partial failure recovery undefined.

Recommendation: Treat LN as asynchronous side-effect with retry/compensation events. Start with manual channel management.

2.4 SQLite & Concurrency

Long-running transactions from event-sourced batched writes. Schema coordination across agents/tools. Need strict short-transaction discipline and append-only events with materialized views.

3. Alternative Approaches

Event-First Core: Single canonical events table with normalized schema (event_id, parent_event_id, timestamp, actor, type, payload_json, run_id, workflow_id, agent_id). All subsystems produce/consume events. Natural audit trail, easier replay, fits non-deterministic testing.
Policy Engine: Capabilities as data (YAML/JSON) with tiny evaluator instead of hardcoded checks. Enables user configuration without code edits and lets Timmy propose policy changes.
Substrate-Aware Tooling: ToolSubstrate with Container/Process/Remote/WASM variants, selected per environment based on capabilities and policies.
Split Storage: Operational store (SQLite for transactional state) vs analytical/knowledge store (DuckDB for large-scale events, metrics, memory materializations).

4. Prioritization Feedback

Move earlier:

Threat model + basic policy/capability enforcement (Tier 1, even if hardcoded)
Minimal "Timmy Console" for event/state introspection

Move later:

Aggressive Docker-based tool externalization (start subprocess/bubblewrap first)
Full autoresearch loops (defer until policies, budgets, and observability are solid)

Tier 1 additions: Minimal threat model, hard "no outbound HTTP except allowed" rule, centralized capability check path. Tier 2 additions: Simple governance roles, toggleable approval mode, basic Timmy Console.

5. Concrete Suggestions

Add Security & Sovereignty section with threat model table and sovereignty boundary diagram.
Expand capability model into mini-policy system with YAML-defined policy rules.
Clarify tool substrates (container/subprocess/remote) with environment-based selection.
Tighten LN description: estimated_cost as upper bound, async event representation, partial failure handling.
Make event sourcing more central: canonical Event type, unified events table/schema.
Add governance & modes section: assist_only, propose_then_execute, autonomous_with_budget with capability/approval implications.

Appendix I: Grok Review

Reviewer: Grok (xAI) Date: March 2026

1. Gaps and Blind Spots

Security & Threat Model: No explicit adversary list or sovereignty boundary. Tools from Docker images risk malicious tags or prompt-injection triggering LN spend. Missing:

Network egress policy (core never outbound except via proxy)
Prompt sanitization layer before tool calls
Rootless Podman as default (daemonless, no root escalation—beats Docker on homelab security)

Operational Resilience: No backup cadence, crash-recovery playbook, or event schema for forensics. Power cut mid-channel-open = funds gone. LN key sync undefined.

Governance & Alignment: Single-user assumption. No roles, approval gates for risky steps, or intent drift detection.

2. Challenges and Risks

Self-Modification Loops: Recursive "improve" steps burning sats on useless tools. No global depth cap or meta-guardrail against spawning sub-workflows.
LN Economics in Practice: Channel liquidity, routing fees, offline failures. Partial payment fails with no compensation logic.
Tool Cold Starts & Port Hell: Per-step container spin-up = latency spikes. Port exhaustion under concurrency. No image signing/checksums = supply chain attack vector.
State Explosion: Three-tier memory + event logs + workflow versions = unprunable mess. No hard caps on active runs or decay policy.

3. Alternative Approaches

Tool Substrate: Rootless Podman (security-first), subprocess (no container tax for trusted tools), WASM (sandboxed, fast cold-start).
Event-First Core: Append-only JSONL events table. Replayable, audit-proof. DuckDB for analytics.
Policy Engine: YAML policies (shell.execute: requires_approval, max_3_per_run). Tiny evaluator scales without code bloat.
LN Handling: Payments as async events with retry/compensation. Start manual-channel only.

4. Prioritization Feedback

Move up (Tier 1):

Threat model + basic policy checks. One rogue LN call kills trust.
Unified events + minimal console. Debug hell otherwise.

Defer slightly (Tier 2):

Full Docker registry. Start subprocess + Podman hybrid.
Advanced reflexion/self-modify. Stabilize budgets/policies first.

Keep as-is: Migration phases—bots out first is perfect.

5. Concrete Suggestions

Workflow Schema Extension:

budget:
  max_sats: 500
  max_file_writes: 5
  max_domains: 10
  review_required: [shell.execute, git.push, ln.payout]

Registry Patch:

if tool.trust_level == "untrusted":
    use_substrate("subprocess", airgapped=True)
else:
    use_substrate("podman_rootless")

Ops Playbook: Hourly tar + gpg backups offsite. Drain workflows before schema migration, replay events after. /admin/stop kill switch route (auth-only).

Event Schema:

{"event_id": "...", "timestamp": "...", "actor": "workflow:alpha_hunter", "type": "tool_call", "payload": {...}}

Cross-Review Consensus Summary

All five reviewers independently converged on the following critical findings:

Finding	Reviewers Flagging	Severity
Missing security/threat model	All 5	CRITICAL
2,000-line budget is unrealistic	4 of 5	HIGH
Docker cold-start latency risk	All 5	HIGH
Need WASM/subprocess tool substrates	All 5	HIGH
Missing human-in-the-loop safeguards	4 of 5	HIGH
Self-modifying YAML fragility	4 of 5	HIGH
No observability/tracing strategy	All 5	HIGH
LN partial failure handling missing	3 of 5	MEDIUM
Missing data migration strategy	3 of 5	MEDIUM
Event sourcing needs snapshotting	3 of 5	MEDIUM
Security/capabilities must move to Tier 1	All 5	CRITICAL
Need policy engine (data-driven, not hardcoded)	3 of 5	MEDIUM
SQLite concurrency insufficiently addressed	3 of 5	MEDIUM

Unanimous recommendation: Security architecture and observability infrastructure must be Tier 1 prerequisites before Ghost Core extraction or self-modification capabilities are enabled.

69 KiB Raw Blame History Unescape Escape

Sovereign AGI Research: Maintainability & Scalability Analysis

Executive Summary

Table of Contents

1. Current Architecture Assessment

Strengths

Architecture Diagram (Current State)

2. Critical Maintainability Issues

2.1 Singleton Proliferation (Severity: HIGH)

2.2 Dual Memory Systems (Severity: HIGH)

2.3 Config Monolith (Severity: MEDIUM)

2.4 Import-Time Side Effects (Severity: MEDIUM)

3. Scalability Bottlenecks

3.1 SQLite Concurrency (Severity: HIGH for AGI vision)

3.2 In-Memory Event Bus (Severity: MEDIUM)

3.3 Synchronous Agent Execution (Severity: MEDIUM)

3.4 Vector Search Performance (Severity: MEDIUM)

4. Sovereign AGI Architecture Recommendations

4.1 The Perceive-Decide-Act-Remember Loop

4.2 Capability-Based Security Model

4.3 Substrate Abstraction Layer

5. The Ghost Core Vision

5.1 The Proposition

5.2 Ghost Core Module Budget

5.3 What Gets Externalized

5.4 The ReAct+Reflexion Loop

5.5 YAML Workflow Layer

5.6 Dynamic Tool Registry

5.7 Economic Layer

5.8 Assessment: Ghost Core vs Current Architecture

5.9 Proposed Migration Phases

5.10 Anti-Patterns to Enforce

6. Memory System Evolution

6.1 Current State Analysis

6.2 Target Architecture: Unified Memory with Tiers

6.3 Memory Consolidation Pipeline

7. Agent Coordination & Orchestration

7.1 Current Orchestrator Analysis

7.2 Deterministic State Machine

7.3 Intelligent Routing (Replace Keyword Matching)

7.4 Inter-Agent Communication Protocol

8. Testing Strategy for Non-Deterministic Systems

8.1 Current Testing Gaps

8.2 Testing Layers for AGI Systems

8.3 Mock LLM for Deterministic Agent Tests

8.4 Golden Path Tests

8.5 Chaos Testing for Resilience

9. Plugin & Extension Architecture

9.1 Current State

9.2 Recommendation: Tool Capability System

9.3 Plugin Loading Pattern

10. Self-Improvement Loops

10.1 Current Self-Modification

10.2 Reflection-Based Self-Improvement

10.3 Reward Model Integration

10.4 Autoresearch Integration

11. Implementation Priority Matrix

Tier 1: Do Now (Weeks 1-4)

Tier 2: Near-Term (Months 1-2)

Tier 3: Medium-Term (Months 2-4)

Tier 4: Long-Term (Months 4+)

Appendix A: Key Architectural Principles for Sovereign AGI

Appendix B: File-Level Impact Map

Appendix C: Grok's Research Validation (March 9, 2026)

Appendix D: Architectural Decisions Record (ADR)

ADR-1: Migration Strategy — Hybrid Core Rewrite + Incremental Shell

ADR-2: Intelligence Model — Dual-Track (YAML + LLM)

ADR-3: Tool Runtime — Subprocess Sandboxing

ADR-4: Immediate Priority — Event System + State Machine

Appendices: Peer Reviews

Appendix E: Replit Review

1. Gaps and Blind Spots

2. Challenges and Risks

3. Alternative Approaches

4. Prioritization Feedback

5. Concrete Suggestions

Appendix F: Kimi Review

Executive Summary

1. Gaps & Blind Spots

2. Challenges & Risks

69 KiB

Raw Blame History