[Study] Solving the Perception Bottleneck — API-First Architecture on Apple Silicon #963

Closed
opened 2026-03-22 18:44:34 +00:00 by perplexity · 2 comments
Collaborator

Summary

This paper presents a complete architecture for eliminating Timmy's $50/day cloud VLM dependency by shifting from screen-interpretation to API-first perception via OpenMW Lua. The thesis: OpenMW Lua is "Mineflayer for Morrowind" — providing structured access to ~95% of game state without any vision model, reducing weighted-average perception latency from multi-second cloud calls to ~70ms locally on M3 Max.

Core Architecture

4-Level Perception Hierarchy (cheapest first)

Level Method Latency Use Case Frequency
L1 OpenMW Lua API ~1ms Position, stats, nearby entities, quest state, inventory 100% of ticks
L2 Core ML classifier ~3ms UI state detection (gameplay/dialogue/inventory/map/menu) ~30% of ticks
L3 PaddleOCR ~40ms Dialogue text extraction (fallback when API prediction < 0.8 confidence) ~5% of ticks
L4 FastVLM-0.5B (local) ~300ms Genuinely ambiguous visual state, stuck detection ~3% of ticks

Three-Tier Metabolic LLM Model (decision-making)

Tier Model Quant RAM Use Case Frequency
T1 (Routine) Qwen3-3B Q8_0 3.5GB Simple choices, navigation 20%
T2 (Medium) Llama-3.1-8B Q4_K_M 5GB Dialogue, inventory management 7%
T3 (Complex) Qwen3-32B Q4_K_M 20GB Quest planning, stuck recovery (game paused) 3%

Behavior Trees

  • Handle 70% of all actions at zero inference cost (~2ms)
  • Walk, attack, loot, basic combat — all BT-driven
  • LLMs reserved only for genuinely novel decisions

Key Technical Details

OpenMW Lua API Surface

  • openmw.self — position, rotation, health, magicka, fatigue, equipment, inventory
  • openmw.nearby — actors and objects within radius, with stats and disposition
  • openmw.world — time of day, weather, cell name, active quests, quest stages
  • world.pause("agent") / world.unpause("agent") — freeze game for complex reasoning
  • Covers character stats, nearby entities, quest journal, inventory, active effects, cell transitions

ESM Data Pre-Extraction

  • Use tes3conv to convert .esm → JSON
  • Extract: NPC locations, dialogue trees (topic → response + conditions), path grid nodes
  • Build NetworkX navigation graph from path grid data
  • Pre-evaluate dialogue conditions at load time → know available topics before conversation
  • Store in SQLite (~200MB) for O(1) lookup

UESP RAG Pipeline

  • Scrape ~4,300 quest/location/NPC pages from UESP wiki
  • Chunk into ~15K passages, embed with nomic-embed-text-v1.5
  • Store in ChromaDB (~500MB)
  • Runtime query: e.g., "What do I do after delivering the package to Caius?"

GBNF Grammar for Constrained Decoding

  • Forces LLM output into valid game commands only
  • Grammar covers: move_to, interact, attack, use_item, cast_spell, wait, dialogue_choose, navigate_menu
  • Eliminates parsing failures and hallucinated actions
  • Supported by llama.cpp and MLX

Apple Silicon Optimization (M3 Max)

Co-processor Cores Use
Neural Engine 16 (18 TOPS) Core ML UI classifier (~2MB model, <5ms)
GPU 40 (Metal) All LLM/VLM via MLX (21-87% faster than llama.cpp)
CPU 16 Orchestration, pathfinding, BT execution, ESM queries
  • mlx-vlm (v0.3.11) for VLM inference, vllm-mlx for production serving with prefix caching (28x speedup)
  • ScreenCaptureKit for hardware-accelerated window capture (2-8ms latency)

Memory Budget (128GB Unified)

Component RAM
OpenMW game client ~2GB
Qwen3-3B Q8 (T1) 3.5GB
Llama-8B Q4 (T2) 5GB
Qwen3-32B Q4 (T3, on-demand) 20GB
FastVLM-0.5B 4-bit 350MB
nomic-embed-text (RAG) 300MB
ChromaDB + knowledge base ~500MB
ESM databases (SQLite + NetworkX) ~200MB
Total ~40GB (88GB headroom)

Complete Heartbeat Loop

The paper provides full PerceptionStack pseudocode with a 6-phase tick() method:

  1. API perception (~1ms) — read Lua state
  2. Behavior tree check (~0ms) — if BT confident >0.9, execute immediately
  3. Screen capture + CV (~20ms) — only if BT can't handle, with frame change detection
  4. Text extraction (~40ms) — dialogue prediction first, OCR fallback
  5. VLM (~300ms) — only for unknown UI state or stuck detection
  6. LLM decision (50-1500ms) — tiered model selection, game pauses for T3

Expected Performance

Scenario Frequency Latency Cloud Calls
BT-handled (walk, attack, loot) 70% ~2ms 0
T1 LLM (simple choice) 20% ~100-150ms 0
T2 LLM + OCR (dialogue, inventory) 7% ~300-500ms 0
T3 LLM + VLM (quest planning, stuck) 3% ~1.5-2.5s (paused) 0
Weighted average ~70ms 0

VLM Model Recommendations

  • FastVLM-0.5B (4-bit, 350MB) — primary vision fallback, Apple-optimized
  • Moondream-2B — more capable alternative
  • Qwen2.5-VL-3B — strongest small VLM

Conclusion

The fundamental shift: stop treating Morrowind like a black-box screen to interpret, start treating it like an API to query. Pre-computation eliminates runtime discovery. Behavior trees handle 70% at zero cost. OpenMW pause converts latency problems into correctness problems. The entire stack fits in ~40GB with 88GB headroom.


PDF attached below. See cross-reference comment for links to related tickets.

## Summary This paper presents a complete architecture for eliminating Timmy's $50/day cloud VLM dependency by shifting from screen-interpretation to API-first perception via OpenMW Lua. The thesis: **OpenMW Lua is "Mineflayer for Morrowind"** — providing structured access to ~95% of game state without any vision model, reducing weighted-average perception latency from multi-second cloud calls to ~70ms locally on M3 Max. ## Core Architecture ### 4-Level Perception Hierarchy (cheapest first) | Level | Method | Latency | Use Case | Frequency | |-------|--------|---------|----------|-----------| | L1 | OpenMW Lua API | ~1ms | Position, stats, nearby entities, quest state, inventory | 100% of ticks | | L2 | Core ML classifier | ~3ms | UI state detection (gameplay/dialogue/inventory/map/menu) | ~30% of ticks | | L3 | PaddleOCR | ~40ms | Dialogue text extraction (fallback when API prediction < 0.8 confidence) | ~5% of ticks | | L4 | FastVLM-0.5B (local) | ~300ms | Genuinely ambiguous visual state, stuck detection | ~3% of ticks | ### Three-Tier Metabolic LLM Model (decision-making) | Tier | Model | Quant | RAM | Use Case | Frequency | |------|-------|-------|-----|----------|-----------| | T1 (Routine) | Qwen3-3B | Q8_0 | 3.5GB | Simple choices, navigation | 20% | | T2 (Medium) | Llama-3.1-8B | Q4_K_M | 5GB | Dialogue, inventory management | 7% | | T3 (Complex) | Qwen3-32B | Q4_K_M | 20GB | Quest planning, stuck recovery (game paused) | 3% | ### Behavior Trees - Handle **70% of all actions** at zero inference cost (~2ms) - Walk, attack, loot, basic combat — all BT-driven - LLMs reserved only for genuinely novel decisions ## Key Technical Details ### OpenMW Lua API Surface - `openmw.self` — position, rotation, health, magicka, fatigue, equipment, inventory - `openmw.nearby` — actors and objects within radius, with stats and disposition - `openmw.world` — time of day, weather, cell name, active quests, quest stages - `world.pause("agent")` / `world.unpause("agent")` — freeze game for complex reasoning - Covers character stats, nearby entities, quest journal, inventory, active effects, cell transitions ### ESM Data Pre-Extraction - Use `tes3conv` to convert `.esm` → JSON - Extract: NPC locations, dialogue trees (topic → response + conditions), path grid nodes - Build **NetworkX navigation graph** from path grid data - Pre-evaluate dialogue conditions at load time → know available topics before conversation - Store in SQLite (~200MB) for O(1) lookup ### UESP RAG Pipeline - Scrape ~4,300 quest/location/NPC pages from UESP wiki - Chunk into ~15K passages, embed with `nomic-embed-text-v1.5` - Store in ChromaDB (~500MB) - Runtime query: e.g., "What do I do after delivering the package to Caius?" ### GBNF Grammar for Constrained Decoding - Forces LLM output into valid game commands only - Grammar covers: `move_to`, `interact`, `attack`, `use_item`, `cast_spell`, `wait`, `dialogue_choose`, `navigate_menu` - Eliminates parsing failures and hallucinated actions - Supported by llama.cpp and MLX ### Apple Silicon Optimization (M3 Max) | Co-processor | Cores | Use | |-------------|-------|-----| | Neural Engine | 16 (18 TOPS) | Core ML UI classifier (~2MB model, <5ms) | | GPU | 40 (Metal) | All LLM/VLM via MLX (21-87% faster than llama.cpp) | | CPU | 16 | Orchestration, pathfinding, BT execution, ESM queries | - **mlx-vlm** (v0.3.11) for VLM inference, **vllm-mlx** for production serving with prefix caching (28x speedup) - **ScreenCaptureKit** for hardware-accelerated window capture (2-8ms latency) ### Memory Budget (128GB Unified) | Component | RAM | |-----------|-----| | OpenMW game client | ~2GB | | Qwen3-3B Q8 (T1) | 3.5GB | | Llama-8B Q4 (T2) | 5GB | | Qwen3-32B Q4 (T3, on-demand) | 20GB | | FastVLM-0.5B 4-bit | 350MB | | nomic-embed-text (RAG) | 300MB | | ChromaDB + knowledge base | ~500MB | | ESM databases (SQLite + NetworkX) | ~200MB | | **Total** | **~40GB (88GB headroom)** | ### Complete Heartbeat Loop The paper provides full `PerceptionStack` pseudocode with a 6-phase `tick()` method: 1. API perception (~1ms) — read Lua state 2. Behavior tree check (~0ms) — if BT confident >0.9, execute immediately 3. Screen capture + CV (~20ms) — only if BT can't handle, with frame change detection 4. Text extraction (~40ms) — dialogue prediction first, OCR fallback 5. VLM (~300ms) — only for unknown UI state or stuck detection 6. LLM decision (50-1500ms) — tiered model selection, game pauses for T3 ### Expected Performance | Scenario | Frequency | Latency | Cloud Calls | |----------|-----------|---------|-------------| | BT-handled (walk, attack, loot) | 70% | ~2ms | 0 | | T1 LLM (simple choice) | 20% | ~100-150ms | 0 | | T2 LLM + OCR (dialogue, inventory) | 7% | ~300-500ms | 0 | | T3 LLM + VLM (quest planning, stuck) | 3% | ~1.5-2.5s (paused) | 0 | | **Weighted average** | — | **~70ms** | **0** | ### VLM Model Recommendations - **FastVLM-0.5B** (4-bit, 350MB) — primary vision fallback, Apple-optimized - **Moondream-2B** — more capable alternative - **Qwen2.5-VL-3B** — strongest small VLM ## Conclusion The fundamental shift: stop treating Morrowind like a black-box screen to interpret, start treating it like an API to query. Pre-computation eliminates runtime discovery. Behavior trees handle 70% at zero cost. OpenMW pause converts latency problems into correctness problems. The entire stack fits in ~40GB with 88GB headroom. --- **PDF attached below. See cross-reference comment for links to related tickets.**
Author
Collaborator

Cross-References

Work Suggestions (from this paper)

  • #964 — Implement OpenMW Lua perception bridge (IPC layer)
  • #965 — Build Core ML UI state classifier for Morrowind
  • #966 — Implement three-tier metabolic LLM router (Qwen3-3B / Llama-8B / Qwen3-32B)
  • #967 — Extract ESM data via tes3conv and build NetworkX navigation graph
  • #968 — Define GBNF grammar for constrained game-command decoding
  • #969 — Build UESP RAG knowledge pipeline (ChromaDB + nomic-embed)
  • #970 — Implement MorrowindBehaviorTree engine for zero-cost routine actions

Architecture / PRs:

  • PR #900 — WorldInterface + Heartbeat v2 (the heartbeat loop this paper's tick() would integrate with)
  • PR #864 — Morrowind Protocol + Command Log (command dispatch aligns with send_command())
  • PR #865 — FastAPI Harness + SOUL.md Framework

Sovereignty Loop Implementation (#953 children):

  • #954 — Metrics emitter (feeds into this paper's performance profiling)
  • #955 — PerceptionCache wrapper (directly implements this paper's perception hierarchy caching)
  • #956 — Skill library crystallizer (complements UESP RAG for learned behaviors)
  • #957 — Navigation graph builder (overlaps with #967 — ESM path grid extraction)
  • #958 — Dashboard widget for session data
  • #959 — Narration templates
  • #960 — Nav graph for Morrowind (direct overlap with #967)
  • #961 — Auto-crystallizer
  • #962 — Three-strike anomaly detector

Autoresearch (#904 children):

  • #905-#911 — Self-improvement loop infrastructure (experiment governance applies to perception stack experiments)

Other Studies:

  • #903 — State-of-the-Art Open Source for Sovereign Creative AI Agents
  • #953 — The Sovereignty Loop — Falsework-Native Architecture

Key Overlaps to Resolve

  1. #957/#960 vs #967: Navigation graph appears in both Sovereignty Loop and this paper. #967 is the more detailed spec (tes3conv + NetworkX). Consider merging.
  2. #955 vs #965: PerceptionCache (#955) and Core ML classifier (#965) are complementary — the cache wraps the classifier output.
  3. Heartbeat loop: PR #900's heartbeat v2 is the execution framework; this paper's tick() pseudocode is the perception-specific implementation that runs inside it.
## Cross-References ### Work Suggestions (from this paper) - #964 — Implement OpenMW Lua perception bridge (IPC layer) - #965 — Build Core ML UI state classifier for Morrowind - #966 — Implement three-tier metabolic LLM router (Qwen3-3B / Llama-8B / Qwen3-32B) - #967 — Extract ESM data via tes3conv and build NetworkX navigation graph - #968 — Define GBNF grammar for constrained game-command decoding - #969 — Build UESP RAG knowledge pipeline (ChromaDB + nomic-embed) - #970 — Implement MorrowindBehaviorTree engine for zero-cost routine actions ### Related Existing Tickets **Architecture / PRs:** - PR #900 — WorldInterface + Heartbeat v2 (the heartbeat loop this paper's `tick()` would integrate with) - PR #864 — Morrowind Protocol + Command Log (command dispatch aligns with `send_command()`) - PR #865 — FastAPI Harness + SOUL.md Framework **Sovereignty Loop Implementation (#953 children):** - #954 — Metrics emitter (feeds into this paper's performance profiling) - #955 — PerceptionCache wrapper (directly implements this paper's perception hierarchy caching) - #956 — Skill library crystallizer (complements UESP RAG for learned behaviors) - #957 — Navigation graph builder (overlaps with #967 — ESM path grid extraction) - #958 — Dashboard widget for session data - #959 — Narration templates - #960 — Nav graph for Morrowind (direct overlap with #967) - #961 — Auto-crystallizer - #962 — Three-strike anomaly detector **Autoresearch (#904 children):** - #905-#911 — Self-improvement loop infrastructure (experiment governance applies to perception stack experiments) **Other Studies:** - #903 — State-of-the-Art Open Source for Sovereign Creative AI Agents - #953 — The Sovereignty Loop — Falsework-Native Architecture ### Key Overlaps to Resolve 1. **#957/#960 vs #967**: Navigation graph appears in both Sovereignty Loop and this paper. #967 is the more detailed spec (tes3conv + NetworkX). Consider merging. 2. **#955 vs #965**: PerceptionCache (#955) and Core ML classifier (#965) are complementary — the cache wraps the classifier output. 3. **Heartbeat loop**: PR #900's heartbeat v2 is the execution framework; this paper's `tick()` pseudocode is the perception-specific implementation that runs inside it.
gemini was assigned by Rockachopa 2026-03-22 23:31:02 +00:00
Author
Collaborator

📎 Cross-reference: #1074 — Timmy Handoff contains the full perception stack solution from Report 6. Also see the Session Crystallization v2 attached to #982 which details the 4-level perception hierarchy with latency targets and the pre-computation strategy (ESM parsing, navigation graph, dialogue pre-eval, UESP quest KB).

📎 **Cross-reference:** #1074 — Timmy Handoff contains the full perception stack solution from Report 6. Also see the Session Crystallization v2 attached to #982 which details the 4-level perception hierarchy with latency targets and the pre-computation strategy (ESM parsing, navigation graph, dialogue pre-eval, UESP quest KB).
claude added the harnessmorrowindp1-important labels 2026-03-23 13:54:03 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#963