diff --git a/docs/research/integration-architecture-deep-dives.md b/docs/research/integration-architecture-deep-dives.md new file mode 100644 index 00000000..f23a62e4 --- /dev/null +++ b/docs/research/integration-architecture-deep-dives.md @@ -0,0 +1,74 @@ +# Timmy Time Integration Architecture: Eight Deep Dives into Real Deployment + +> **Source:** PDF attached to issue #946, written during Veloren exploration phase. +> Many patterns are game-agnostic and apply to the Morrowind/OpenClaw pivot. + +## Summary of Eight Deep Dives + +### 1. Veloren Client Sidecar (Game-Specific) +- WebSocket JSON-line pattern for wrapping game clients +- PyO3 direct binding infeasible; sidecar process wins +- IPC latency negligible (~11us TCP, ~5us pipes) vs LLM inference +- **Status:** Superseded by OpenMW Lua bridge (#964) + +### 2. Agno Ollama Tool Calling is Broken +- Agno issues #2231, #2625, #1419, #1612, #4715 document persistent breakage +- Root cause: Agno's Ollama model class doesn't robustly parse native tool_calls +- **Fix:** Use Ollama's `format` parameter with Pydantic JSON schemas directly +- Recommended models: qwen3-coder:32b (top), glm-4.7-flash, gpt-oss:20b +- Critical settings: temperature 0.0-0.2, stream=False for tool calls +- **Status:** Covered by #966 (three-tier router) + +### 3. MCP is the Right Abstraction +- FastMCP averages 26.45ms per tool call (TM Dev Lab benchmark, Feb 2026) +- Total MCP overhead per cycle: ~20-60ms (<3% of 2-second budget) +- Agno has first-class bidirectional MCP integration (MCPTools, MultiMCPTools) +- Use stdio transport for near-zero latency; return compressed JPEG not base64 +- **Status:** Covered by #984 (MCP restore) + +### 4. Human + AI Co-op Architecture (Game-Specific) +- Headless client treated identically to graphical client by server +- Leverages party system, trade API, and /tell for communication +- Mode switching: solo autonomous play when human absent, assist when present +- **Status:** Defer until after tutorial completion + +### 5. Real Latency Numbers +- All-local M3 Max pipeline: 4-9 seconds per full cycle +- Groq hybrid pipeline: 3-7 seconds per full cycle +- VLM inference is 50-70% of total pipeline time (bottleneck) +- Dual-model Ollama on 96GB M3 Max: ~11-14GB, ~70GB free +- **Status:** Superseded by API-first perception (#963) + +### 6. Content Moderation (Three-Layer Defense) +- Layer 1: Game-context system prompts (Morrowind themes as game mechanics) +- Layer 2: Llama Guard 3 1B at <30ms/sentence for real-time filtering +- Layer 3: Per-game moderation profiles with vocabulary whitelists +- Run moderation + TTS preprocessing in parallel for zero added latency +- Neuro-sama incident (Dec 2022) is the cautionary tale +- **Status:** New issue created → #1056 + +### 7. Model Selection (Qwen3-8B vs Hermes 3) +- Three-role architecture: Perception (Qwen3-VL 8B), Decision (Qwen3-8B), Narration (Hermes 3 8B) +- Qwen3-8B outperforms Qwen2.5-14B on 15 benchmarks +- Hermes 3 best for narration (steerability, roleplaying) +- Both use identical Hermes Function Calling standard +- **Status:** Partially covered by #966 (three-tier router) + +### 8. Split Hetzner + Mac Deployment +- Hetzner GEX44 (RTX 4000 SFF Ada, €184/month) for rendering/streaming +- Mac M3 Max for all AI inference via Tailscale +- Use FFmpeg x11grab + NVENC, not OBS (no headless support) +- Use headless Xorg, not Xvfb (GPU access required for Vulkan) +- Total cost: ~$200/month +- **Status:** Referenced in #982 sprint plan + +## Cross-Reference to Active Issues + +| Research Topic | Active Issue | Status | +|---------------|-------------|--------| +| Pydantic structured output for Ollama | #966 (three-tier router) | In progress | +| FastMCP tool server | #984 (MCP restore) | In progress | +| Content moderation pipeline | #1056 (new) | Created from this research | +| Split Hetzner + Mac deployment | #982 (sprint plan) | Referenced | +| VLM latency / perception | #963 (perception bottleneck) | API-first approach | +| OpenMW bridge (replaces Veloren sidecar) | #964 | In progress |