Files
Timmy-time-dashboard/docs/research/integration-architecture-deep-dives.md
Claude (Opus 4.6) 092c982341
Some checks failed
Tests / lint (push) Has been cancelled
Tests / test (push) Has been cancelled
[claude] Ingest integration architecture research and triage work (#946) (#1057)
2026-03-23 01:40:39 +00:00

3.6 KiB

Timmy Time Integration Architecture: Eight Deep Dives into Real Deployment

Source: PDF attached to issue #946, written during Veloren exploration phase. Many patterns are game-agnostic and apply to the Morrowind/OpenClaw pivot.

Summary of Eight Deep Dives

1. Veloren Client Sidecar (Game-Specific)

  • WebSocket JSON-line pattern for wrapping game clients
  • PyO3 direct binding infeasible; sidecar process wins
  • IPC latency negligible (~11us TCP, ~5us pipes) vs LLM inference
  • Status: Superseded by OpenMW Lua bridge (#964)

2. Agno Ollama Tool Calling is Broken

  • Agno issues #2231, #2625, #1419, #1612, #4715 document persistent breakage
  • Root cause: Agno's Ollama model class doesn't robustly parse native tool_calls
  • Fix: Use Ollama's format parameter with Pydantic JSON schemas directly
  • Recommended models: qwen3-coder:32b (top), glm-4.7-flash, gpt-oss:20b
  • Critical settings: temperature 0.0-0.2, stream=False for tool calls
  • Status: Covered by #966 (three-tier router)

3. MCP is the Right Abstraction

  • FastMCP averages 26.45ms per tool call (TM Dev Lab benchmark, Feb 2026)
  • Total MCP overhead per cycle: ~20-60ms (<3% of 2-second budget)
  • Agno has first-class bidirectional MCP integration (MCPTools, MultiMCPTools)
  • Use stdio transport for near-zero latency; return compressed JPEG not base64
  • Status: Covered by #984 (MCP restore)

4. Human + AI Co-op Architecture (Game-Specific)

  • Headless client treated identically to graphical client by server
  • Leverages party system, trade API, and /tell for communication
  • Mode switching: solo autonomous play when human absent, assist when present
  • Status: Defer until after tutorial completion

5. Real Latency Numbers

  • All-local M3 Max pipeline: 4-9 seconds per full cycle
  • Groq hybrid pipeline: 3-7 seconds per full cycle
  • VLM inference is 50-70% of total pipeline time (bottleneck)
  • Dual-model Ollama on 96GB M3 Max: ~11-14GB, ~70GB free
  • Status: Superseded by API-first perception (#963)

6. Content Moderation (Three-Layer Defense)

  • Layer 1: Game-context system prompts (Morrowind themes as game mechanics)
  • Layer 2: Llama Guard 3 1B at <30ms/sentence for real-time filtering
  • Layer 3: Per-game moderation profiles with vocabulary whitelists
  • Run moderation + TTS preprocessing in parallel for zero added latency
  • Neuro-sama incident (Dec 2022) is the cautionary tale
  • Status: New issue created → #1056

7. Model Selection (Qwen3-8B vs Hermes 3)

  • Three-role architecture: Perception (Qwen3-VL 8B), Decision (Qwen3-8B), Narration (Hermes 3 8B)
  • Qwen3-8B outperforms Qwen2.5-14B on 15 benchmarks
  • Hermes 3 best for narration (steerability, roleplaying)
  • Both use identical Hermes Function Calling standard
  • Status: Partially covered by #966 (three-tier router)

8. Split Hetzner + Mac Deployment

  • Hetzner GEX44 (RTX 4000 SFF Ada, €184/month) for rendering/streaming
  • Mac M3 Max for all AI inference via Tailscale
  • Use FFmpeg x11grab + NVENC, not OBS (no headless support)
  • Use headless Xorg, not Xvfb (GPU access required for Vulkan)
  • Total cost: ~$200/month
  • Status: Referenced in #982 sprint plan

Cross-Reference to Active Issues

Research Topic Active Issue Status
Pydantic structured output for Ollama #966 (three-tier router) In progress
FastMCP tool server #984 (MCP restore) In progress
Content moderation pipeline #1056 (new) Created from this research
Split Hetzner + Mac deployment #982 (sprint plan) Referenced
VLM latency / perception #963 (perception bottleneck) API-first approach
OpenMW bridge (replaces Veloren sidecar) #964 In progress