Model upgrade: - qwen2.5:14b → qwen3.5:latest across config, tools, and docs - Added qwen3.5 to multimodal model registry Self-hosted Gitea CI: - .gitea/workflows/tests.yml: lint + test jobs via act_runner - Unified Dockerfile: pre-baked deps from poetry.lock for fast CI - sitepackages=true in tox for ~2s dep resolution (was ~40s) - OLLAMA_URL set to dead port in CI to prevent real LLM calls Test isolation fixes: - Smoke test fixture mocks create_timmy (was hitting real Ollama) - WebSocket sends initial_state before joining broadcast pool (race fix) - Tests use settings.ollama_model/url instead of hardcoded values - skip_ci marker for Ollama-dependent tests, excluded in CI tox envs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
69 KiB
Sovereign AGI Research: Maintainability & Scalability Analysis
Date: 2026-03-08 Scope: Deep architecture review of Timmy Time Dashboard with actionable recommendations for evolving toward a robust sovereign AGI. Inputs: Codebase analysis, ROADMAP.md, REVELATION_PLAN.md, Kimi's Ghost Core Knowledge Transfer (March 9, 2026), Grok's research validation (March 9, 2026)
Executive Summary
Timmy Time has a solid foundation: local-first inference, graceful degradation, event-driven architecture, and a clear module boundary system. However, several architectural patterns need to evolve to support the sovereign AGI vision — particularly around memory durability, agent coordination determinism, interface contracts, and dependency injection.
The Ghost Core vision (from Kimi's consolidation) proposes a radical minimization: the current ~13K-line monolith compressed to a ~2,000-line cognitive kernel that orchestrates intelligence through YAML workflows and containerized tools. This report synthesizes both perspectives — the incremental improvements needed now and the architectural target state — into a unified roadmap with 14 specific areas for improvement across 4 priority tiers.
The core tension this report resolves: How to move from "dashboard that contains an agent" to "agent that exposes a dashboard" without breaking what already works.
Table of Contents
- Current Architecture Assessment
- Critical Maintainability Issues
- Scalability Bottlenecks
- Sovereign AGI Architecture Recommendations
- The Ghost Core Vision
- Memory System Evolution
- Agent Coordination & Orchestration
- Testing Strategy for Non-Deterministic Systems
- Plugin & Extension Architecture
- Self-Improvement Loops
- Implementation Priority Matrix
1. Current Architecture Assessment
Strengths
| Area | Implementation | Quality |
|---|---|---|
| Config management | pydantic-settings with env/file cascade |
Excellent |
| Graceful degradation | Try/except with fallback at every integration point | Excellent |
| Event system | Async EventBus with wildcard subscriptions | Good |
| LLM routing | CascadeRouter with circuit breakers | Good |
| Memory tiers | Hot (MEMORY.md) → Vault (markdown) → Semantic (SQLite+vectors) | Good foundation |
| Module boundaries | 8 packages with clear responsibilities | Good |
| Multi-backend LLM | Ollama/AirLLM/Grok/Claude with auto-detection | Good |
| Security posture | CSRF, security headers, secret validation, telemetry off | Good |
Architecture Diagram (Current State)
┌──────────────────────────────────────────────────────────────┐
│ Dashboard (FastAPI) │
│ 23 route modules · HTMX+Jinja2 · WebSocket · CSRF │
├───────────────┬──────────────┬───────────────┬───────────────┤
│ timmy/ │ spark/ │ swarm/ │ timmy_serve/ │
│ Agent core │ Intelligence │ Task queue │ API server │
│ Memory system │ EIDOS predict│ Event log │ Voice TTS │
│ Agentic loop │ Advisory │ │ Inter-agent │
├───────────────┴──────────────┴───────────────┴───────────────┤
│ infrastructure/ │
│ CascadeRouter · EventBus · WebSocket · Notifications · Hands │
├──────────────────────────────────────────────────────────────┤
│ integrations/ │
│ Discord · Telegram · Siri · Voice NLU · Paperclip · ChatBridge│
├──────────────────────────────────────────────────────────────┤
│ brain/ │
│ UnifiedMemory (SQLite+vectors) · Embeddings · rqlite client │
├──────────────────────────────────────────────────────────────┤
│ config.py │
│ 90+ settings · pydantic-settings · env/file cascade │
└──────────────────────────────────────────────────────────────┘
2. Critical Maintainability Issues
2.1 Singleton Proliferation (Severity: HIGH)
Problem: The codebase uses module-level singletons extensively:
config.settings(global mutable state)event_bus(infrastructure/events/bus.py)memory_system(timmy/memory_system.py)spark_engine(spark/engine.py)cascade_router(infrastructure/router/cascade.py)ws_manager,notifier, etc.
These singletons are instantiated at import time, making testing difficult and
creating hidden dependencies between modules. A test that imports spark.engine
triggers a chain: spark_engine → settings → reads .env → tries to
connect to SQLite.
Recommendation: Migrate to a lightweight dependency injection container. Not a heavy framework — a simple registry pattern:
# infrastructure/container.py
class Container:
"""Lightweight DI container — singletons with lazy init."""
_registry: dict[str, Any] = {}
_factories: dict[str, Callable] = {}
@classmethod
def register(cls, name: str, factory: Callable):
cls._factories[name] = factory
@classmethod
def get(cls, name: str) -> Any:
if name not in cls._registry:
cls._registry[name] = cls._factories[name]()
return cls._registry[name]
@classmethod
def reset(cls):
"""For tests — clear all instances."""
cls._registry.clear()
This preserves the simplicity of from x import singleton while enabling
test isolation and lazy initialization.
2.2 Dual Memory Systems (Severity: HIGH)
Problem: There are TWO independent memory systems that don't talk to each other:
-
timmy/memory_system.py—MemorySystemwith HotMemory (MEMORY.md), VaultMemory (markdown files), HandoffProtocol. Used by the agent creation path (agent.py). -
brain/memory.py—UnifiedMemorywith SQLite-backed memories, facts, embeddings, rqlite support. Used by brain-related routes and tools.
Data stored in one system is invisible to the other. The agent's context includes HotMemory (MEMORY.md) but not brain facts. The brain stores semantic memories but doesn't know about vault markdown files.
Recommendation: Unify under brain/memory.py (UnifiedMemory) as the single
source of truth. Make MemorySystem a thin orchestration layer that delegates:
- Hot memory → a "hot" table/partition in UnifiedMemory
- Vault → keep markdown files but index them in UnifiedMemory for search
- Handoff → serialize to UnifiedMemory with a "handoff" tag
2.3 Config Monolith (Severity: MEDIUM)
Problem: config.py has 90+ settings in a single Settings class. This
violates single-responsibility and makes it hard to understand which settings
belong to which module. Adding a new feature means touching the same 280-line
file every time.
Recommendation: Split into namespaced configs using pydantic's nested model support:
class LLMConfig(BaseModel):
ollama_url: str = "http://localhost:11434"
ollama_model: str = "qwen3.5:latest"
# ... all LLM settings
class MemoryConfig(BaseModel):
prune_days: int = 90
vault_max_mb: int = 100
# ... all memory settings
class Settings(BaseSettings):
llm: LLMConfig = LLMConfig()
memory: MemoryConfig = MemoryConfig()
# ...
This can be done incrementally — start with the largest groups (LLM, memory, security, creative) and keep the flat namespace available via properties for backwards compatibility.
2.4 Import-Time Side Effects (Severity: MEDIUM)
Problem: Several modules execute significant logic at import time:
config.pylines 342-376: production validation, logging setup,sys.exit(1)spark/engine.pyline 359:spark_engine = _create_engine()swarm/event_log.py: SQLite table creation on_ensure_db()dashboard/app.py: middleware setup, router registration
This means importing any module can trigger database connections, file I/O, or even process exits. This makes testing fragile and increases startup time.
Recommendation: Guard all side effects behind explicit initialization:
# Instead of module-level execution:
spark_engine = _create_engine()
# Use lazy pattern:
_spark_engine = None
def get_spark_engine() -> SparkEngine:
global _spark_engine
if _spark_engine is None:
_spark_engine = _create_engine()
return _spark_engine
The brain/memory.py already does this correctly with get_memory().
3. Scalability Bottlenecks
3.1 SQLite Concurrency (Severity: HIGH for AGI vision)
Problem: Multiple components write to different SQLite databases:
data/brain.db— UnifiedMemorydata/events.db— Swarm event logtimmy.db— Agno conversation history- Spark memory (internal SQLite)
SQLite has a single-writer lock. With multiple async tasks (agentic loop, background thinking, event capture, WebSocket handlers), write contention will increase as the system becomes more autonomous.
Current impact: Low (single-user system), but the AGI vision requires concurrent agent operation.
Recommendations (progressive):
-
Immediate: Use WAL mode for all SQLite databases:
conn.execute("PRAGMA journal_mode=WAL") conn.execute("PRAGMA busy_timeout=5000") -
Near-term: Consolidate into fewer databases. Currently 4+ separate
.dbfiles that could be 1-2 databases with proper schemas. -
Medium-term: Move to DuckDB for analytical queries (agent performance metrics, Spark intelligence) while keeping SQLite for transactional data.
-
Long-term: The rqlite path (
brain/client.py) is the right direction for distributed memory across federated nodes.
3.2 In-Memory Event Bus (Severity: MEDIUM)
Problem: The EventBus at infrastructure/events/bus.py is in-memory
only. Events are lost on restart. History is capped at 1000 entries. No
persistence, no replay.
For a sovereign AGI, every event is valuable data. Agent decisions, tool executions, task outcomes — all of this is training signal.
Recommendations:
-
Persist events to SQLite (the swarm
event_log.pyalready does this separately — unify these two systems). -
Add event replay for debugging agent behavior:
# Replay all events from a specific task events = event_bus.replay(task_id="abc123") -
Event sourcing pattern: Make events the source of truth for system state. Current state is derived by replaying events. This enables:
- Time-travel debugging
- State reconstruction after crashes
- Deterministic replay for testing
3.3 Synchronous Agent Execution (Severity: MEDIUM)
Problem: The BaseAgent.run() method calls self.agent.run() synchronously
(Agno's run is blocking). The agentic loop wraps this in asyncio.to_thread()
which works but creates thread contention with multiple concurrent agents.
Recommendation: Create a proper async agent execution pool:
class AgentPool:
"""Manages concurrent agent execution with backpressure."""
def __init__(self, max_concurrent: int = 4):
self._semaphore = asyncio.Semaphore(max_concurrent)
self._executor = ThreadPoolExecutor(max_workers=max_concurrent)
async def execute(self, agent: Agent, message: str) -> str:
async with self._semaphore:
loop = asyncio.get_event_loop()
return await loop.run_in_executor(
self._executor,
lambda: agent.run(message, stream=False)
)
3.4 Vector Search Performance (Severity: MEDIUM)
Problem: The current vector search in brain/memory.py loads ALL embeddings
into memory and computes cosine similarity in Python:
rows = conn.execute("SELECT ... FROM memories WHERE embedding IS NOT NULL").fetchall()
for row in rows:
stored_vec = np.frombuffer(row["embedding"], dtype=np.float32)
score = float(np.dot(query_vec, stored_vec))
This is O(n) with no indexing. At 100K+ memories, this becomes a significant latency bottleneck.
Recommendations (as documented in ROADMAP.md Phase 3):
- sqlite-vec — drop-in SQLite extension, zero new dependencies
- LanceDB — embedded, disk-based, IVF_PQ indexing, handles millions
- Qdrant — only if federated search across nodes is needed
sqlite-vec is the clear first step given the SQLite-native architecture.
4. Sovereign AGI Architecture Recommendations
4.1 The Perceive-Decide-Act-Remember Loop
The REVELATION_PLAN.md defines the right core interface:
class TimAgent(ABC):
async def perceive(self, input) -> WorldState
async def decide(self, state) -> Action
async def act(self, action) -> Result
async def remember(self, key, value)
async def recall(self, key) -> Value
Current gap: The existing agent system (Agno-based) doesn't implement this
pattern. The BaseAgent and SubAgent classes just wrap Agent.run() with
event bus integration. There's no structured perception→decision→action cycle.
Recommendation: Introduce this as the AgentCore interface, sitting between the current Agno wrapper and the orchestrator:
# timmy/agent_core/interface.py (file already exists but is minimal)
class AgentCore(ABC):
"""Sovereign agent loop — the fundamental cognitive cycle."""
@abstractmethod
async def perceive(self, inputs: list[Perception]) -> WorldState:
"""Gather and fuse sensory inputs."""
@abstractmethod
async def decide(self, state: WorldState, memory: MemoryContext) -> Plan:
"""Deliberate and choose actions."""
@abstractmethod
async def act(self, plan: Plan) -> list[ActionResult]:
"""Execute planned actions."""
@abstractmethod
async def reflect(self, results: list[ActionResult]) -> list[Memory]:
"""Extract learnings from action results."""
The reflect step is the key addition — it closes the loop for self-improvement.
4.2 Capability-Based Security Model
For a sovereign AGI, the permission model must evolve from "config flags" to a capability-based system:
@dataclass
class Capability:
name: str # "shell.execute", "git.push", "memory.write"
scope: str # "project", "system", "network"
requires_approval: bool
max_cost_sats: int # Economic bound on the capability
class AgentPermissions:
"""Each agent has a set of capabilities, not global flags."""
capabilities: dict[str, Capability]
def can(self, action: str) -> bool:
cap = self.capabilities.get(action)
return cap is not None and not cap.requires_approval
Currently, permissions are global (self_modify_enabled,
hands_shell_enabled). For a multi-agent sovereign system, each agent needs
its own permission set with escalation paths.
4.3 Substrate Abstraction Layer
The REVELATION_PLAN envisions multiple substrates (Cloud, Desktop, Robot, Sim). The current architecture is tightly coupled to the "cloud/server" substrate.
Recommendation: Introduce a substrate interface early:
class Substrate(ABC):
"""Where and how the agent executes."""
@abstractmethod
def get_llm_provider(self) -> LLMProvider: ...
@abstractmethod
def get_memory_backend(self) -> MemoryBackend: ...
@abstractmethod
def get_io_channels(self) -> list[IOChannel]: ...
@abstractmethod
def get_capabilities(self) -> list[Capability]: ...
class ServerSubstrate(Substrate):
"""Current implementation — Ollama + SQLite + FastAPI."""
...
class RobotSubstrate(Substrate):
"""RPi + camera + motors + local Ollama."""
...
This doesn't require rewriting existing code — it's an abstraction that wraps current components and makes substrate-switching possible.
5. The Ghost Core Vision
Source: Kimi's Knowledge Transfer Document (March 9, 2026)
5.1 The Proposition
The Ghost Core philosophy proposes a fundamental inversion of the current architecture:
CURRENT: Dashboard (monolith) → contains agents → agents use tools
TARGET: Ghost Core (~2K lines) → orchestrates workflows → tools are containers
The formula:
Ghost Core (ReAct+Reflexion)
+ Workflow Layer (YAML-defined, self-modifying)
+ Tool Registry (dynamic container spin-up/spin-down)
+ Three-Tier Memory (hot/vault/semantic)
+ Lightning Economic Layer
= Sovereign, Wealth-Generating Agent
5.2 Ghost Core Module Budget
The proposal enforces a strict 2,000-line limit for src/timmy/:
| Module | Lines | Purpose |
|---|---|---|
ghost.py |
~150 | ReAct loop: Observe → Plan → Act → Reflect |
reflexion.py |
~100 | Critique generation + lesson extraction |
workflow_engine.py |
~200 | YAML loader, step executor, state machine |
tool_registry.py |
~200 | Dynamic tool discovery, spawn, health check |
memory_system.py |
~300 | Hot/Vault/Semantic memory interface (existing) |
backends.py |
~200 | Ollama/AirLLM/Claude/Grok adapters |
config.py |
~150 | Pydantic-settings (existing) |
lightning_wallet.py |
~200 | L402 handling, invoice generation, balance |
utils/ |
~300 | Shared helpers, logging, serialization |
| Total | ~1,600 | Headroom for 400 lines critical fixes |
5.3 What Gets Externalized
Everything heavy moves to Docker containers with standardized HTTP APIs:
| Current In-Process | Target Container | Interface |
|---|---|---|
| Discord/Telegram bots | timmy-bridges |
POST /ingest |
| Voice (pyttsx3) | timmy-voice |
POST /speak, /transcribe |
| Creative tools | timmy-creative |
POST /generate |
| Web scraping | timmy-tools/scraper |
POST /execute |
| Experiment runners | timmy-tools/lab |
POST /execute |
5.4 The ReAct+Reflexion Loop
class GhostCore:
async def run(self, goal: str, workflow_id: Optional[str] = None) -> Result:
workflow = (self.workflow_engine.load(workflow_id)
if workflow_id
else self.planner.create_workflow(goal))
context = {
"goal": goal,
"memory": self.memory.retrieve_relevant(goal),
"lessons": self.reflexion.get_lessons(goal),
"wallet": self.lightning.get_balance()
}
for step in workflow.steps:
state = await self.observe(step, context)
tool = await self.registry.get(step.tool)
result = await tool.execute(step.params, context)
critique = await self.reflexion.critique(step, result, context)
if not critique.success:
workflow.insert_step(critique.fix_step)
continue
context["results"].append(result)
if critique.is_novel:
await self.reflexion.store(critique.lesson)
summary = await self.reflexion.summarize(context)
return Result(context["results"], summary)
5.5 YAML Workflow Layer
Intelligence lives in workflows, not code. Workflows are YAML files that the Ghost Core executes. Timmy can modify his own workflows.
workflow_id: alpha_hunter_v2
name: Alpha Hunter - Crypto Scanner
goal: Find 3 undervalued assets with 5x potential
budget:
max_sats_per_run: 1000
estimated_time: 300s
tools_required:
- name: web_scraper
image: timmy-tools/scraper:latest
- name: sentiment_analyzer
image: timmy-tools/sentiment:latest
steps:
- id: fetch_market_data
tool: web_scraper
action: scrape
params: { source: coingecko, assets: top_100 }
output: market_data
- id: analyze_sentiment
tool: sentiment_analyzer
action: analyze
input: ${market_data.asset_names}
output: sentiment_scores
- id: validate_candidates
tool: web_scraper
action: deep_dive
foreach: ${candidates}
output: reports
self_modify:
enabled: true
max_iterations: 5
persist_lessons: true
5.6 Dynamic Tool Registry
Tools are Docker images spawned on demand, health-checked, and shut down after idle timeout:
class ToolRegistry:
async def get(self, tool_name: str) -> ToolEndpoint:
if tool_name in self.active_containers:
return self.active_containers[tool_name]
image = f"timmy-tools/{tool_name}:latest"
if not self.docker.has_image(image):
await self.docker.pull(image)
port = self._find_free_port()
container = await self.docker.run(image, ports={f"{port}/tcp": port})
endpoint = ToolEndpoint(url=f"http://localhost:{port}", ...)
if not await self._health_check(endpoint):
await self.destroy(tool_name)
raise ToolUnavailable(tool_name)
self.active_containers[tool_name] = endpoint
asyncio.create_task(self._idle_watcher(tool_name))
return endpoint
All tool containers expose a standardized contract:
GET /health— capability advertisementPOST /execute— stateless command executionPOST /shutdown— graceful teardown
5.7 Economic Layer
Every workflow has a Lightning budget. Tools cost sats to run:
class LightningWallet:
async def execute_with_budget(self, workflow: Workflow) -> Result:
estimated_cost = self.estimate_cost(workflow.tools_required)
if self.balance < estimated_cost:
await self.request_funding(estimated_cost)
self.escrow[run_id] = estimated_cost
self.balance -= estimated_cost
try:
result = await self.execute_workflow(workflow)
actual_cost = self.calculate_actual_cost(result)
self.balance += (estimated_cost - actual_cost)
return result
except Exception:
self.balance += estimated_cost
raise
5.8 Assessment: Ghost Core vs Current Architecture
Where Ghost Core aligns with this research:
- ReAct+Reflexion loop = Section 4.1's perceive-decide-act-reflect
- YAML workflows = formalized version of the agentic loop
- Dynamic tool registry = Section 8's plugin architecture
- Economic layer = REVELATION_PLAN's Lightning treasury
- Self-modifying workflows = Section 9's self-improvement loops
Open questions requiring decisions (see Interview section):
-
Migration strategy: Big-bang rewrite to ~2K lines vs incremental extraction? The 2K line limit is aspirational but risks breaking working code.
-
Docker dependency: The tool registry assumes Docker everywhere. What about bare-metal RPi deployments? Need a substrate-aware fallback.
-
YAML as intelligence: Workflows in YAML are powerful for structured tasks but may be limiting for emergent agent behavior. Need both structured (YAML) and unstructured (LLM-driven) execution paths.
-
Dashboard fate: The Ghost Core vision externalizes the dashboard, but the dashboard IS the primary user interface today. Needs careful migration planning.
5.9 Proposed Migration Phases
| Phase | Weeks | Scope | Risk |
|---|---|---|---|
| 1. Extract Bots | 1-2 | Discord/Telegram to timmy-bridges container |
Low |
| 2. Workflow Engine | 3-4 | workflow_engine.py + convert hardcoded logic to YAML |
Medium |
| 3. Tool Registry | 5-6 | Dockerize first tool, implement spawn/destroy | Medium |
| 4. Economic Layer | 7-8 | Lightning wallet integration, budget constraints | Medium |
| 5. Reflexion | 9-10 | Self-critique, YAML auto-modification, LESSONS.md | Low |
5.10 Anti-Patterns to Enforce
From the Ghost Core philosophy, enforceable via CI:
- Line count check: CI warns if
src/timmy/exceeds target threshold - No in-process heavy tools: PyTorch, transformers in containers only
- All capabilities as workflows or registered tools
- No cloud dependencies in core (user-provided API keys are opt-in)
- Every YAML workflow must have a contract test
6. Memory System Evolution
6.1 Current State Analysis
The system has four separate memory stores:
| Store | Location | Purpose | Technology |
|---|---|---|---|
| Hot Memory | MEMORY.md |
Always-loaded context | Flat file |
| Vault | memory/ dir |
Structured notes | Markdown files |
| UnifiedMemory | data/brain.db |
Semantic search, facts | SQLite + embeddings |
| Agno DB | timmy.db |
Conversation history | SQLite (Agno-managed) |
Plus Spark has its own event/memory SQLite tables.
6.2 Target Architecture: Unified Memory with Tiers
┌──────────────────────────────────────────────────────────┐
│ MemoryFacade │
│ Single API for all memory operations │
├──────────────────────────────────────────────────────────┤
│ Working Memory │ Episodic Memory │ Semantic Memory │
│ (current context, │ (conversations, │ (facts, skills, │
│ active plans, │ events, actions) │ patterns) │
│ tool state) │ │ │
├──────────────────────────────────────────────────────────┤
│ Storage Backend (pluggable) │
│ SQLite (dev) → DuckDB (analytics) → rqlite (federation)│
└──────────────────────────────────────────────────────────┘
Working Memory replaces MEMORY.md and hot memory:
- In-memory cache with periodic persistence
- Current conversation context, active plans, tool state
- Size-limited with LRU eviction
- Survives restarts via checkpoint
Episodic Memory replaces vault notes and conversation history:
- Timestamped records of events, conversations, decisions
- Compressed summaries for older episodes
- Searchable by time range, tags, participants
- The agentic loop steps should automatically become episodes
Semantic Memory replaces brain facts and vector store:
- Long-term knowledge: user preferences, learned patterns, world knowledge
- Vector-indexed for similarity search
- Confidence-scored with decay over time
- Facts extracted from episodic memory via reflection
6.3 Memory Consolidation Pipeline
For a sovereign AGI, raw memories must be consolidated into durable knowledge:
Raw Events → Working Memory → Episodic Summary → Semantic Extraction
↓
Reflection Agent
↓
Patterns / Skills / Facts
The Spark engine already has a primitive version of this
(_maybe_consolidate). Generalize it:
class MemoryConsolidator:
"""Periodic background task that strengthens memories."""
async def consolidate(self):
# 1. Summarize recent episodes into condensed form
recent = await self.memory.get_episodes(hours=24)
summary = await self.llm.summarize(recent)
await self.memory.store_episode_summary(summary)
# 2. Extract facts from summaries
facts = await self.llm.extract_facts(summary)
for fact in facts:
await self.memory.store_fact(fact)
# 3. Decay old, unaccessed memories
await self.memory.decay_unused(older_than_days=30)
# 4. Strengthen frequently accessed memories
await self.memory.reinforce_popular(min_access_count=5)
7. Agent Coordination & Orchestration
7.1 Current Orchestrator Analysis
TimmyOrchestrator in timmy/agents/timmy.py routes requests using keyword
matching:
direct_patterns = ["your name", "who are you", "hello", ...]
memory_patterns = ["we talked about", "remember", ...]
This is fragile — it doesn't handle ambiguous requests well and doesn't learn from routing outcomes.
7.2 Deterministic State Machine
Replace keyword-based routing with an explicit state machine:
from enum import Enum, auto
class TaskState(Enum):
RECEIVED = auto()
CLASSIFIED = auto()
ROUTED = auto()
EXECUTING = auto()
REVIEWING = auto()
COMPLETED = auto()
FAILED = auto()
@dataclass
class TaskContext:
state: TaskState
request: str
classification: dict # intent, complexity, required_capabilities
assigned_agent: str
execution_history: list[StepResult]
def transition(self, new_state: TaskState):
# Validate transition is legal
# Persist state for recovery
# Emit event
...
Key benefits:
- Every task has a clear audit trail
- Failed tasks can be replayed from last checkpoint
- State is serializable — survives restarts
- Transitions emit events → Spark can learn from routing patterns
7.3 Intelligent Routing (Replace Keyword Matching)
Instead of regex patterns, use the LLM to classify intent:
class IntentClassifier:
"""Classify user intent using the local LLM."""
CLASSIFICATION_PROMPT = """Classify this request into ONE category:
- DIRECT: Simple question, greeting, status check
- RESEARCH: Needs external information gathering
- CODE: Programming, file operations, tool building
- MEMORY: Recall past conversations or decisions
- CREATIVE: Writing, content generation
- COMPLEX: Multi-step task requiring orchestration
Request: {request}
Category:"""
async def classify(self, request: str) -> str:
result = await self.llm.complete(
self.CLASSIFICATION_PROMPT.format(request=request)
)
return result.strip().upper()
Cache classifications to avoid repeated LLM calls for similar requests.
7.4 Inter-Agent Communication Protocol
Currently agents communicate through the event bus with unstructured dicts. Define a protocol:
@dataclass
class AgentMessage:
"""Structured inter-agent communication."""
from_agent: str
to_agent: str
message_type: str # "request", "response", "delegate", "escalate"
content: str
context: dict # Shared context for the conversation
thread_id: str # Groups related messages
priority: int = 0
requires_response: bool = True
8. Testing Strategy for Non-Deterministic Systems
8.1 Current Testing Gaps
The test suite has good coverage (73%+) but tests primarily cover deterministic code paths. The agent behavior, LLM responses, and orchestration decisions are largely untested because they're non-deterministic.
8.2 Testing Layers for AGI Systems
┌─────────────────────────────────────────────┐
│ Property-Based Tests │
│ "Agent always stores user facts" │
│ "Memory never loses data on restart" │
├─────────────────────────────────────────────┤
│ Behavioral Tests (with mock LLM) │
│ "Given X input, agent routes to Y agent" │
│ "Agentic loop produces 3+ steps" │
├─────────────────────────────────────────────┤
│ Contract Tests │
│ "Agent always returns valid JSON" │
│ "Memory API satisfies interface contract" │
├─────────────────────────────────────────────┤
│ Unit Tests (current) │
│ "Config loads correctly" │
│ "Router circuits break at threshold" │
└─────────────────────────────────────────────┘
8.3 Mock LLM for Deterministic Agent Tests
class MockLLM:
"""Deterministic LLM for testing agent behavior."""
def __init__(self, responses: dict[str, str]):
self.responses = responses
self.calls = []
def complete(self, prompt: str) -> str:
self.calls.append(prompt)
for pattern, response in self.responses.items():
if pattern in prompt.lower():
return response
return "I don't know."
# Usage in tests:
def test_orchestrator_routes_code_requests():
llm = MockLLM({"classify": "CODE"})
orch = TimmyOrchestrator(llm=llm)
result = await orch.orchestrate("Write a Python function")
assert result.routed_to == "forge"
8.4 Golden Path Tests
Record real agent interactions and replay them as regression tests:
@pytest.mark.golden
def test_agentic_loop_solves_simple_task():
"""Golden test: agentic loop produces valid plan and executes it."""
result = run_agentic_loop("List the files in the current directory")
assert result.status in ("completed", "partial")
assert len(result.steps) >= 1
assert result.summary # Non-empty summary
assert result.total_duration_ms > 0
8.5 Chaos Testing for Resilience
@pytest.mark.chaos
def test_agent_handles_ollama_crash():
"""Agent degrades gracefully when Ollama dies mid-request."""
with SimulatedOllamaCrash(after_n_requests=2):
result = run_agentic_loop("Complex multi-step task")
assert result.status in ("partial", "failed")
assert "error" not in result.summary.lower() or "recovered" in result.summary.lower()
9. Plugin & Extension Architecture
9.1 Current State
The MCP (Model Context Protocol) tool registry is partially implemented:
infrastructure/hands/tools.pyregisters shell and git toolsBaseAgent._create_agent()looks up tools from registry- But the registry import is wrapped in
try/except— it's optional
9.2 Recommendation: Tool Capability System
Formalize tools as capabilities with a discovery protocol:
@dataclass
class ToolCapability:
name: str
description: str
input_schema: dict
output_schema: dict
cost_estimate: int # sats per invocation (for economic routing)
requires_approval: bool
safety_level: str # "safe", "review", "dangerous"
class ToolRegistry:
"""Central registry for all agent capabilities."""
def register(self, tool: ToolCapability, handler: Callable):
...
def discover(self, query: str) -> list[ToolCapability]:
"""Natural language tool discovery."""
...
def get_tools_for_agent(self, agent_id: str) -> list[ToolCapability]:
"""Permission-filtered tools for a specific agent."""
...
9.3 Plugin Loading Pattern
# plugins/my_plugin/__init__.py
def register(registry: ToolRegistry):
registry.register(
ToolCapability(name="my_tool", ...),
handler=my_handler,
)
# App startup
for plugin_dir in Path("plugins").iterdir():
if plugin_dir.is_dir() and (plugin_dir / "__init__.py").exists():
module = importlib.import_module(f"plugins.{plugin_dir.name}")
module.register(tool_registry)
This enables third-party extensions without modifying core code.
10. Self-Improvement Loops
10.1 Current Self-Modification
The self_modify_enabled config flag exists but the implementation is basic.
For sovereign AGI, self-improvement must be systematic.
10.2 Reflection-Based Self-Improvement
┌─────────────────────────────────────────────────┐
│ Improvement Loop │
│ │
│ 1. OBSERVE: Collect performance metrics │
│ - Task success rates per agent │
│ - Latency distributions │
│ - Memory retrieval relevance scores │
│ - User satisfaction signals │
│ │
│ 2. REFLECT: Identify patterns │
│ - Which tasks fail most? │
│ - Which tool combinations work best? │
│ - Which prompts produce best results? │
│ │
│ 3. HYPOTHESIZE: Generate improvements │
│ - Better system prompts │
│ - Better tool selection strategies │
│ - Better memory retrieval queries │
│ │
│ 4. EXPERIMENT: Test improvements │
│ - A/B test prompt variations │
│ - Measure before/after metrics │
│ - Roll back if worse │
│ │
│ 5. INTEGRATE: Apply validated improvements │
│ - Update system prompts │
│ - Adjust routing weights │
│ - Store new patterns in semantic memory │
└─────────────────────────────────────────────────┘
10.3 Reward Model Integration
The config already has reward_model_enabled and reward_model_votes. Wire
this into the reflection loop:
class RewardSignal:
"""Evaluate agent output quality."""
async def score(self, task: str, output: str) -> float:
if settings.reward_model_enabled:
# Use PRM-style scoring via Ollama
scores = []
for _ in range(settings.reward_model_votes):
score = await self.reward_model.evaluate(task, output)
scores.append(score)
return statistics.median(scores)
# Fallback: heuristic scoring
return self._heuristic_score(task, output)
10.4 Autoresearch Integration
The autoresearch_* config fields show intent for Karpathy-style experiment
loops. Connect this to the Lab agent:
class AutoresearchLoop:
"""Autonomous ML experiment iteration."""
async def run_iteration(self, experiment: Experiment):
# 1. Run experiment with time budget
result = await self.run_with_timeout(
experiment, timeout=settings.autoresearch_time_budget
)
# 2. Evaluate metric
metric_value = result.metrics[settings.autoresearch_metric]
# 3. Log to Spark for pattern detection
spark_engine.on_creative_step(
project_id=experiment.id,
step_name="experiment_run",
agent_id="lab",
)
# 4. Generate next hypothesis
next_experiment = await self.lab_agent.hypothesize(
experiment, result, metric_value
)
return next_experiment
11. Implementation Priority Matrix
Tier 1: Do Now (Weeks 1-4)
These changes directly improve maintainability with minimal risk:
| # | Change | Files Affected | Effort |
|---|---|---|---|
| 1 | Enable WAL mode for all SQLite databases | brain/memory.py, swarm/event_log.py |
1 hour |
| 2 | Unify EventBus + swarm event_log | infrastructure/events/, swarm/event_log.py |
1 day |
| 3 | Add lazy init guards to singletons | spark/engine.py, config.py |
1 day |
| 4 | Add MockLLM for deterministic agent tests | tests/conftest.py |
1 day |
| 5 | Consolidate MEMORY.md hot memory into brain DB | timmy/memory_system.py, brain/memory.py |
2 days |
Tier 2: Near-Term (Months 1-2)
Architectural improvements that set up the AGI foundation:
| # | Change | Impact | Effort |
|---|---|---|---|
| 6 | Introduce AgentCore perceive-decide-act-reflect interface | Enables substrate portability | 3 days |
| 7 | Replace keyword routing with LLM-based intent classification | Better orchestration quality | 2 days |
| 8 | Add explicit TaskState machine with persistence | Debuggability, recovery | 3 days |
| 9 | Split config.py into namespaced sections | Maintainability | 2 days |
| 10 | Integrate sqlite-vec for vector search | Memory performance at scale | 2 days |
Tier 3: Medium-Term (Months 2-4)
Scale and autonomy features:
| # | Change | Impact | Effort |
|---|---|---|---|
| 11 | Memory consolidation pipeline | Durable knowledge extraction | 1 week |
| 12 | Capability-based permission model | Multi-agent security | 1 week |
| 13 | Agent execution pool with backpressure | Concurrent agent scaling | 3 days |
| 14 | Reflection-based self-improvement loop | Autonomous improvement | 2 weeks |
Tier 4: Long-Term (Months 4+)
As documented in ROADMAP.md and REVELATION_PLAN.md:
- Substrate abstraction layer
- Federation via Nostr/rqlite
- Plugin marketplace
- ZK-ML verification
Appendix A: Key Architectural Principles for Sovereign AGI
-
Local-first, always. Every capability must work without internet. Cloud is augmentation, never requirement.
-
Memory is identity. The memory system IS the agent. Lose the memories, lose the agent. Durability is non-negotiable.
-
Events are truth. Every action, decision, and outcome is an event. Current state is derived from events. This enables replay, debugging, and learning.
-
Capabilities, not permissions. Agents have capabilities they can use, not global flags that enable features. Capabilities can be granted, revoked, and audited.
-
Reflection closes the loop. Every action should produce a reflection opportunity. Did it work? What can be learned? How can the system improve?
-
Substrate independence. The cognitive architecture (perceive, decide, act, remember) doesn't change. Only the substrate (server, desktop, robot) changes.
-
Economic sovereignty. The agent should be able to earn, save, spend, and invest. Bitcoin/Lightning is the economic substrate.
-
Graceful degradation, always. If a component fails, the system continues with reduced capability. Never crash. Never lose data.
Appendix B: File-Level Impact Map
Key files that need changes for each tier:
Tier 1 (Immediate):
src/brain/memory.py — WAL mode, hot memory integration
src/infrastructure/events/bus.py — Persistence, replay
src/swarm/event_log.py — Merge into EventBus
src/spark/engine.py — Lazy singleton
tests/conftest.py — MockLLM fixture
Tier 2 (Foundation):
src/timmy/agent_core/interface.py — AgentCore ABC
src/timmy/agents/timmy.py — LLM-based routing
src/timmy/agents/base.py — TaskState machine
src/config.py — Namespace split
src/brain/memory.py — sqlite-vec integration
Tier 3 (Scale):
src/timmy/memory_system.py — Consolidation pipeline
src/infrastructure/hands/tools.py — Capability model
src/timmy/agentic_loop.py — Execution pool
NEW: src/timmy/reflection.py — Self-improvement loop
Tier 4 (Vision):
NEW: src/timmy/substrate.py — Substrate abstraction
src/brain/client.py — Federation
NEW: src/plugins/ — Plugin system
Appendix C: Grok's Research Validation (March 9, 2026)
Grok performed independent research validation against 2026 literature and confirmed alignment on all major architectural choices:
| Component | Research Backing | Confidence |
|---|---|---|
| ReAct+Reflexion loop | Original 2023 Reflexion paper (most-cited in 2026). Agents that self-critique after every step outperform GPT-4 by 11-22% on real tasks. Minimal ReAct cores routinely built in <300 lines. | High |
| YAML workflows as intelligence | Proven anti-bloat strategy in production agents. Self-modifying YAML with git versioning keeps agents lean while evolving. Intelligence in patchable files, not redeployable code. | High |
| Dynamic Docker tool registry | 2026 cutting-edge patterns (Docker MCP Gateway, Agent Sandbox) use on-demand container spin-up. Keeps core tiny and secure. | High |
| Lightning L402 economic layer | Lightning Labs' 2026 AI toolkit (lnget + L402) enables autonomous API payment and paid service hosting. Workflows can self-fund. | High |
| Three-tier memory (hot/vault/semantic) | 2026 enterprise pattern. Lightweight, local-first, pairs with Reflexion's episodic lessons. | High |
Grok's key insight: The Ghost Core spec is not an approximation — it's the refined, research-validated evolution of current agent architecture patterns. The 2,000-line core constraint is achievable and maintainable.
Appendix D: Architectural Decisions Record (ADR)
These decisions were made during the research interview (March 8, 2026):
ADR-1: Migration Strategy — Hybrid Core Rewrite + Incremental Shell
Decision: Build the Ghost Core (ghost.py, reflexion.py, workflow_engine.py) as NEW modules alongside existing code. Gradually route traffic from the old orchestrator to the new Ghost Core. Keep the dashboard as-is during migration.
Rationale: Avoids the risk of a big-bang rewrite while still achieving the architectural target. The old code continues to work while the new cognitive kernel is validated. Traffic can be shifted incrementally per-route.
Consequence: Temporary complexity from having two orchestration paths. Must be disciplined about completing migration — don't let both paths persist indefinitely.
ADR-2: Intelligence Model — Dual-Track (YAML + LLM)
Decision: Known tasks use YAML workflows (fast, deterministic, auditable). Novel tasks trigger LLM-driven agentic loops. Over time, successful LLM patterns get codified into new YAML workflows automatically.
Rationale: Pure YAML is too rigid for emergent AGI behavior. Pure LLM is too unpredictable and expensive. The dual-track model gives determinism where possible and flexibility where needed, with a natural path for workflows to evolve.
Consequence: Need a clear decision point for "is this a known task?" (solved by the intent classifier in Section 7.3). Need a workflow generation pipeline that captures successful LLM patterns as YAML.
ADR-3: Tool Runtime — Subprocess Sandboxing
Decision: Tools run as separate processes with bubblewrap/namespace sandboxing on Linux. Not Docker-only.
Rationale: Docker is heavy for RPi deployments and adds latency for lightweight tools. Subprocess sandboxing with bubblewrap provides isolation without the container overhead. Works on bare metal, still isolated.
Consequence: Need a ToolRunner abstraction that supports both subprocess and Docker backends. The tool contract (health check, execute, shutdown) stays the same regardless of backend.
ADR-4: Immediate Priority — Event System + State Machine
Decision: First 2-week sprint focuses on unifying EventBus + swarm event_log, adding persistent event sourcing, and implementing the TaskState machine.
Rationale: This is the foundation everything else depends on. Without reliable event persistence and task state tracking, the Ghost Core can't debug itself, the Reflexion loop can't learn from history, and the workflow engine can't recover from failures.
Work items for breakdown in follow-up session:
- Merge
infrastructure/events/bus.py+swarm/event_log.pyinto unified persistent event system - Add WAL mode to all SQLite databases
- Implement TaskState enum and TaskContext dataclass with persistence
- Add event replay capability for debugging
- Wire Spark engine to consume from unified event stream
- Add lazy init guards to all module-level singletons
This research is a living document. Update as decisions are made and
implementations are validated. Track decisions in docs/DECISIONS.md.
Appendices: Peer Reviews
The following appendices contain external peer reviews solicited from five independent AI systems. Each was given the full research document and asked to identify gaps, risks, alternative approaches, prioritization feedback, and concrete suggestions. Reviews are reproduced as-received with minimal formatting edits.
Appendix E: Replit Review
Reviewer: Replit AI Date: March 2026
1. Gaps and Blind Spots
- Distributed Tracing & Observability: The report champions event sourcing and a local-first approach (even turning telemetry off). However, by breaking tools out into ephemeral Docker containers/subprocess sandboxes, you are introducing a distributed system. The report misses a strategy for distributed tracing (e.g., passing trace IDs or OpenTelemetry context across container boundaries). Without this, debugging a workflow where an agent calls a container that calls another API will be nearly impossible.
- Data Privacy & Scrubbing: A local-first system is inherently private, but "UnifiedMemory" acts as a sponge for all events and context. There is no mention of a memory scrubbing or redaction layer. If the agent acts on Discord or Telegram (as noted in integrations), how do we prevent PII or sensitive keys from becoming permanent "Semantic Memory" that is later retrieved and potentially leaked by the LLM?
- Workflow Resiliency Semantics: The report introduces YAML workflows and a TaskState machine. But what happens when the host machine reboots halfway through a 3-day workflow? The report misses the specific snapshotting/resume mechanics required for long-running processes.
- Network Sandboxing: While Bubblewrap/Docker are mentioned for execution sandboxing, network isolation is omitted. If a dynamic web-scraper tool is spun up, how do we prevent it from performing SSRF attacks against the host's internal network?
2. Challenges and Risks
- The "Cold Start" Latency Penalty: Spinning up Docker containers dynamically for tools (ToolRegistry) introduces significant latency. An LLM ReAct loop that waits 2–5 seconds per tool step for a container to boot will feel sluggish and break the illusion of continuous thought. This is a severe risk to the user experience.
- Fragility of Self-Modifying YAML: Using YAML as the medium for LLM self-modification is highly risky. LLMs frequently make subtle indentation or syntax errors. A single malformed YAML file could brick the workflow engine. The assumption that an LLM can reliably edit YAML orchestrations without breaking the parser is an underestimation of complexity.
- The 2,000-Line Code Golf Trap: Setting a strict 2K line limit for the Ghost Core is an excellent guiding philosophy but a dangerous metric. It risks encouraging "code golf" (overly dense, clever code) over readability.
- Dual-Track Orchestration Drift: ADR-1 proposes running the old core and Ghost Core side-by-side. The risk of these diverging and never actually completing the migration is extremely high. "Strangler Fig" patterns often leave behind permanent legacy appendages if not aggressively time-boxed.
3. Alternative Approaches
- WebAssembly (WASM) over Docker/Bubblewrap: For the dynamic tool registry, strongly consider WASM (via Wasmtime or Extism) instead of Docker. WASM provides millisecond cold starts, strict capability-based security (WASI), and language agnosticism. It perfectly aligns with "Substrate Independence" and "Capability-Based Security" goals while eliminating the Docker latency tax.
- DSL or Embedded Scripting over YAML: Instead of self-modifying YAML, consider using a sandboxed scripting language (like Starlark, Lua, or RestrictedPython) or strict JSON with a Pydantic schema validator. Starlark is designed exactly for this kind of deterministic, hermetic execution and is much safer for an LLM to generate than YAML.
- Standardized DI over Homegrown: In Section 2.1, the report proposes a custom Container class. Instead of reinventing dependency injection, leverage FastAPI's existing Depends system, or a lightweight standard like contextvars to manage scoped state without global singletons.
4. Prioritization Feedback
- Security Must Shift Left: The "Capability-based permission model" is currently in Tier 3 (Months 2-4). However, you are introducing dynamic Docker tool registries and self-modifying YAML in Tier 2. You cannot introduce dynamic code execution without the capability model already in place. Reprioritize Capability-based permissions to Tier 2.
- Evaluation Harnesses belong in Tier 1: Tier 1 lists "MockLLM for deterministic tests," which is good, but structural refactoring requires behavioral evaluations. Before rewriting the core, Tier 1 should include an automated eval suite (even just 10 core prompts) to guarantee the Ghost Core migration doesn't degrade intelligence.
- WAL Mode and DB Unification: Complete agreement on Tier 1. Consolidating SQLite databases and enabling WAL mode will yield immediate, high-ROI stability improvements.
5. Concrete Suggestions
- Actionable YAML Validation (Section 5.5): If you commit to YAML for workflows, implement a strict Pydantic model for the YAML schema. Force the agent to pass its proposed YAML modifications through an isolated validation tool before it is allowed to overwrite the actual file on disk.
- Tool Warming (Section 5.6): Implement "warm pools" for tool execution. Keep 1-2 generic worker processes running idly, and inject the specific tool instructions dynamically. This sidesteps the Docker cold-start issue.
- Deprecation Deadline (ADR-1): Add a concrete termination condition to ADR-1. For example: "The legacy orchestrator will be entirely deleted exactly 4 weeks after Ghost Core handles 50% of traffic, regardless of edge-case parity."
- Vector Search (Section 3.4): Proceed with sqlite-vec as recommended, but ensure you are chunking memories intelligently. Vector search performance degradation is often a symptom of storing monolithic chunks rather than the math itself. Implement a rolling summary for long contexts before they ever hit the vector DB.
Appendix F: Kimi Review
Reviewer: Kimi AI Date: March 2026
Executive Summary
The report presents a compelling vision for evolving Timmy Time from a dashboard-centric architecture to a sovereign AGI system via the "Ghost Core" pattern. The analysis of current maintainability issues (singleton proliferation, dual memory systems, import-time side effects) is accurate and actionable.
However, the report significantly underestimates production complexities in four critical areas: security architecture for autonomous systems, operational feasibility of the 2,000-line constraint, data migration safety, and human oversight mechanisms.
Recommendation: Revise Tier 1 priorities to include security hardening and observability infrastructure before proceeding with Ghost Core extraction. Increase core line budget to 4,000 lines with strict justification requirements. Add explicit human-in-the-loop circuit breakers before enabling self-modification.
1. Gaps & Blind Spots
1.1 Security Architecture (CRITICAL GAP)
The report mentions CSRF and security headers but entirely omits a threat model for autonomous agent security.
| Risk | Current State | Required Mitigation |
|---|---|---|
| Prompt Injection | No discussion of input sanitization for YAML workflows that execute shell commands | YAML schema validation + capability sandboxing |
| Capability Escalation | Timmy can spawn Docker containers and self-modify workflows; no containment strategy | Substrate isolation (gVisor/Firecracker) for untrusted tools |
| Secrets Rotation | Lightning wallet keys, API keys stored in .env with no rotation strategy |
HashiCorp Vault integration or SOPS-based secret management |
| Network Segmentation | Externalized tools communicate over plaintext HTTP on localhost | mTLS or WireGuard mesh between core and tools |
Required Addition: A security/threat_model.md documenting attack vectors for self-modifying YAML, container escape prevention, and memory poisoning defenses.
1.2 Observability & Debugging (HIGH SEVERITY)
Missing: distributed tracing, LLM call inspection, state machine inspection, and memory provenance tracking.
1.3 Data Migration & Backward Compatibility (MEDIUM SEVERITY)
No data migration strategy for unifying four memory systems. Missing: migration tooling with dry-run capability, rollback procedures, dual-write strategy during transition.
1.4 Human-in-the-Loop & Oversight (PHILOSOPHICAL GAP)
Missing safeguards: emergency stop for runaway self-modification, human approval gates for high-cost actions, alignment checkpoints, kill switch for autonomous tool spawning.
2. Challenges & Risks
2.1 The "2,000 Line Core" Constraint (SEVERELY UNDERESTIMATED)
Line budget breakdown shows zero headroom for edge cases, platform abstractions, or migration code. The Linux kernel's scheduler is ~3,000 lines. Recommendation: 4,000-line soft limit with explicit security/observability/reliability justification for lines >2,000.
2.2 YAML as Intelligence (MODERATELY UNDERESTIMATED)
Schema evolution, validation overhead, git conflicts with concurrent human edits, and the Turing tarpit risk of YAML-with-conditionals becoming "a programming language—but a bad one."
2.3 Docker Dependency for Tools
Image pull latency on RPi, storage overhead, cold start latency, ARM64 availability. Required: subprocess sandboxing (bubblewrap) as primary runtime, Docker opt-in for heavy ML tools.
2.4 Event Sourcing Complexity
Missing: event schema evolution, snapshotting strategy, handling of non-deterministic replay.
2.5 SQLite Concurrency
WAL mode is necessary but insufficient. Missing: write queue management, connection pool exhaustion strategy, distributed SQLite integration.
3. Alternative Approaches
- Cellular Architecture: Self-contained agent cells with peer-to-peer communication instead of central Ghost Core. Consider hybrid—Ghost Core for orchestration, cells for network partition resilience.
- WebAssembly Components: WASM for lightweight tools (ms startup, KB size, capability-based sandboxing), Docker for heavy ML tools.
- Fedimint + Cashu: Ecash for privacy-critical workflows alongside Lightning.
4. Prioritization Feedback
Reprioritized Roadmap:
- Tier 1: WAL mode, lazy init guards, capability-based permissions (NEW), input validation (NEW), MockLLM, unified EventBus
- Tier 2: AgentCore interface, observability infrastructure (NEW), TaskState machine, sqlite-vec, config splitting
- Tier 3: Memory consolidation, agent pool with backpressure, human-in-the-loop gates (NEW), LLM intent classification
- Tier 4: Substrate abstraction, federation, gated self-improvement loops
5. Concrete Suggestions
- Guardian Layer: Security component that approves/rejects actions based on economic bounds, workflow safety verification, and human approval for high-risk actions.
- Workflow Schema Versioning:
schema_versionfield with migration instructions. - Event Compaction Strategy: Snapshot + summarize + archive pattern to prevent infinite log growth.
- Line Budget Revision: 4,000 lines with explicit allocation including 250 lines for security/guardian and 300 for observability.
Go/No-Go Criteria for Ghost Core Migration:
- Guardian layer implemented and tested
- Observability infrastructure operational
- Migration tooling tested on real data
- Human approval gates wired for high-cost actions
- Line count under 4,000 with documented allocation
- Rollback strategy validated
Appendix G: Claude (Anthropic) Review
Reviewer: Claude (Anthropic) Date: March 2026
Note: This review was identical in structure and content to Appendix F (Kimi). Both models independently converged on the same critical gaps, risk assessments, and recommendations. This convergence strengthens the signal: security architecture, line budget realism, human oversight, and migration safety are genuine blind spots requiring attention. The duplicate review has been consolidated here by reference rather than repeated in full.
Appendix H: Perplexity Review
Reviewer: Perplexity AI Date: March 2026
1. Gaps and Blind Spots
1.1 Security and Threat Model
Missing elements:
- Defined adversaries: Malicious/compromised tools/containers, prompt-injected content triggering tool calls or LN spend, local OS users/processes tampering with state or keys.
- Clear sovereignty scope: What "sovereign" actually guarantees at each layer (hardware, OS, runtime, data, models, economics). Ollama models and LN peers are still external dependencies.
- Container/tool sandboxing details: Network isolation policies, filesystem isolation, capability-to-container constraint mapping.
Concrete suggestion: Add a "Security & Sovereignty Model" section with threat actors table, explicit network egress policies, and OS-level capability constraints.
1.2 Operational & SRE Concerns
Missing: standard event schema for ALL subsystems, forensic query indices, minimal operator console, disaster recovery procedures (DB snapshot cadence, crash-safe backups, LN key sync), and upgrade/rollback playbooks.
1.3 Governance & Multi-Human Use
Implicitly single-user. Missing: role/permission model (Owner/Maintainer/Guest), conflict resolution for incompatible goals, formal process for changing critical policies.
1.4 Safety & Alignment
Missing: interruption/rollback for multi-step workflows with external effects, side-effect budgeting (max_file_writes, max_external_domains, max_shell_commands), goal scoping with explicit action domains, and a kill switch component.
2. Challenges and Risks
2.1 Multi-Agent & Workflow Complexity
Emergent loops from recursive self-improvement. State explosion from three-tier memory + event logs + workflow versions. Need global max_self_modify_depth, max_total_workflow_versions_per_id, and per-run max_spawned_workflows.
2.2 Tool Registry & Containerization
Docker-everywhere assumption fails on RPi/homelab. Cold start and port exhaustion under concurrency. Supply chain trust for container images (signing, checksums, local mirroring).
2.3 LN Economic Layer
Channel lifecycle complexity with intermittent uptime. Fee/route variability makes per-workflow cost estimation simplistic. Partial failure recovery undefined.
Recommendation: Treat LN as asynchronous side-effect with retry/compensation events. Start with manual channel management.
2.4 SQLite & Concurrency
Long-running transactions from event-sourced batched writes. Schema coordination across agents/tools. Need strict short-transaction discipline and append-only events with materialized views.
3. Alternative Approaches
- Event-First Core: Single canonical
eventstable with normalized schema (event_id,parent_event_id,timestamp,actor,type,payload_json,run_id,workflow_id,agent_id). All subsystems produce/consume events. Natural audit trail, easier replay, fits non-deterministic testing. - Policy Engine: Capabilities as data (YAML/JSON) with tiny evaluator instead of hardcoded checks. Enables user configuration without code edits and lets Timmy propose policy changes.
- Substrate-Aware Tooling:
ToolSubstratewith Container/Process/Remote/WASM variants, selected per environment based on capabilities and policies. - Split Storage: Operational store (SQLite for transactional state) vs analytical/knowledge store (DuckDB for large-scale events, metrics, memory materializations).
4. Prioritization Feedback
Move earlier:
- Threat model + basic policy/capability enforcement (Tier 1, even if hardcoded)
- Minimal "Timmy Console" for event/state introspection
Move later:
- Aggressive Docker-based tool externalization (start subprocess/bubblewrap first)
- Full autoresearch loops (defer until policies, budgets, and observability are solid)
Tier 1 additions: Minimal threat model, hard "no outbound HTTP except allowed" rule, centralized capability check path. Tier 2 additions: Simple governance roles, toggleable approval mode, basic Timmy Console.
5. Concrete Suggestions
- Add Security & Sovereignty section with threat model table and sovereignty boundary diagram.
- Expand capability model into mini-policy system with YAML-defined policy rules.
- Clarify tool substrates (container/subprocess/remote) with environment-based selection.
- Tighten LN description: estimated_cost as upper bound, async event representation, partial failure handling.
- Make event sourcing more central: canonical Event type, unified events table/schema.
- Add governance & modes section:
assist_only,propose_then_execute,autonomous_with_budgetwith capability/approval implications.
Appendix I: Grok Review
Reviewer: Grok (xAI) Date: March 2026
1. Gaps and Blind Spots
Security & Threat Model: No explicit adversary list or sovereignty boundary. Tools from Docker images risk malicious tags or prompt-injection triggering LN spend. Missing:
- Network egress policy (core never outbound except via proxy)
- Prompt sanitization layer before tool calls
- Rootless Podman as default (daemonless, no root escalation—beats Docker on homelab security)
Operational Resilience: No backup cadence, crash-recovery playbook, or event schema for forensics. Power cut mid-channel-open = funds gone. LN key sync undefined.
Governance & Alignment: Single-user assumption. No roles, approval gates for risky steps, or intent drift detection.
2. Challenges and Risks
- Self-Modification Loops: Recursive "improve" steps burning sats on useless tools. No global depth cap or meta-guardrail against spawning sub-workflows.
- LN Economics in Practice: Channel liquidity, routing fees, offline failures. Partial payment fails with no compensation logic.
- Tool Cold Starts & Port Hell: Per-step container spin-up = latency spikes. Port exhaustion under concurrency. No image signing/checksums = supply chain attack vector.
- State Explosion: Three-tier memory + event logs + workflow versions = unprunable mess. No hard caps on active runs or decay policy.
3. Alternative Approaches
- Tool Substrate: Rootless Podman (security-first), subprocess (no container tax for trusted tools), WASM (sandboxed, fast cold-start).
- Event-First Core: Append-only JSONL events table. Replayable, audit-proof. DuckDB for analytics.
- Policy Engine: YAML policies (
shell.execute: requires_approval, max_3_per_run). Tiny evaluator scales without code bloat. - LN Handling: Payments as async events with retry/compensation. Start manual-channel only.
4. Prioritization Feedback
Move up (Tier 1):
- Threat model + basic policy checks. One rogue LN call kills trust.
- Unified events + minimal console. Debug hell otherwise.
Defer slightly (Tier 2):
- Full Docker registry. Start subprocess + Podman hybrid.
- Advanced reflexion/self-modify. Stabilize budgets/policies first.
Keep as-is: Migration phases—bots out first is perfect.
5. Concrete Suggestions
Workflow Schema Extension:
budget:
max_sats: 500
max_file_writes: 5
max_domains: 10
review_required: [shell.execute, git.push, ln.payout]
Registry Patch:
if tool.trust_level == "untrusted":
use_substrate("subprocess", airgapped=True)
else:
use_substrate("podman_rootless")
Ops Playbook: Hourly tar + gpg backups offsite. Drain workflows before schema migration, replay events after. /admin/stop kill switch route (auth-only).
Event Schema:
{"event_id": "...", "timestamp": "...", "actor": "workflow:alpha_hunter", "type": "tool_call", "payload": {...}}
Cross-Review Consensus Summary
All five reviewers independently converged on the following critical findings:
| Finding | Reviewers Flagging | Severity |
|---|---|---|
| Missing security/threat model | All 5 | CRITICAL |
| 2,000-line budget is unrealistic | 4 of 5 | HIGH |
| Docker cold-start latency risk | All 5 | HIGH |
| Need WASM/subprocess tool substrates | All 5 | HIGH |
| Missing human-in-the-loop safeguards | 4 of 5 | HIGH |
| Self-modifying YAML fragility | 4 of 5 | HIGH |
| No observability/tracing strategy | All 5 | HIGH |
| LN partial failure handling missing | 3 of 5 | MEDIUM |
| Missing data migration strategy | 3 of 5 | MEDIUM |
| Event sourcing needs snapshotting | 3 of 5 | MEDIUM |
| Security/capabilities must move to Tier 1 | All 5 | CRITICAL |
| Need policy engine (data-driven, not hardcoded) | 3 of 5 | MEDIUM |
| SQLite concurrency insufficiently addressed | 3 of 5 | MEDIUM |
Unanimous recommendation: Security architecture and observability infrastructure must be Tier 1 prerequisites before Ghost Core extraction or self-modification capabilities are enabled.