Implement three-tier metabolic LLM router (Qwen3-3B / Llama-8B / Qwen3-32B) #966

Closed
opened 2026-03-22 18:45:49 +00:00 by perplexity · 3 comments
Collaborator

Parent

  • #963 — [Study] Solving the Perception Bottleneck

Objective

Build the tiered LLM decision engine that routes queries to the cheapest-sufficient model, using MLX for all inference on Apple Silicon GPU.

Scope

  • T1 — Routine (Qwen3-3B Q8_0, 3.5GB): Simple navigation, basic choices. Always loaded. ~100-150ms.
  • T2 — Medium (Llama-3.1-8B Q4_K_M, 5GB): Dialogue responses, inventory management. Always loaded. ~300-500ms.
  • T3 — Complex (Qwen3-32B Q4_K_M, 20GB): Quest planning, stuck recovery. Loaded on demand. ~1.5-2.5s with game paused.
  • Implement classify_complexity(task, state) function that determines which tier to use
  • Integrate with vllm-mlx for production serving with prefix caching (28x speedup on repeated prompts)
  • Implement prompt building: build_prompt(state, ui_state, text, visual_context)
  • All inference via mlx-vlm / MLX — 21-87% faster than llama.cpp on Apple Silicon

Key Design Notes

  • T3 triggers world.pause() before inference and world.unpause() after
  • GBNF grammar constrains all output to valid game commands (see sibling issue)
  • 70% of ticks never reach the LLM (handled by behavior trees)
  • Memory budget: T1+T2 always loaded (~8.5GB), T3 on-demand (+20GB)

References

  • Paper §Three-Tier Metabolic Model (pp. 9-10)
  • Paper §Memory Budget (p. 16)
  • Related: #954 (metrics emitter), #955 (PerceptionCache)
## Parent - #963 — [Study] Solving the Perception Bottleneck ## Objective Build the tiered LLM decision engine that routes queries to the cheapest-sufficient model, using MLX for all inference on Apple Silicon GPU. ## Scope - **T1 — Routine (Qwen3-3B Q8_0, 3.5GB):** Simple navigation, basic choices. Always loaded. ~100-150ms. - **T2 — Medium (Llama-3.1-8B Q4_K_M, 5GB):** Dialogue responses, inventory management. Always loaded. ~300-500ms. - **T3 — Complex (Qwen3-32B Q4_K_M, 20GB):** Quest planning, stuck recovery. Loaded on demand. ~1.5-2.5s with game paused. - Implement `classify_complexity(task, state)` function that determines which tier to use - Integrate with `vllm-mlx` for production serving with prefix caching (28x speedup on repeated prompts) - Implement prompt building: `build_prompt(state, ui_state, text, visual_context)` - All inference via `mlx-vlm` / MLX — 21-87% faster than llama.cpp on Apple Silicon ## Key Design Notes - T3 triggers `world.pause()` before inference and `world.unpause()` after - GBNF grammar constrains all output to valid game commands (see sibling issue) - 70% of ticks never reach the LLM (handled by behavior trees) - Memory budget: T1+T2 always loaded (~8.5GB), T3 on-demand (+20GB) ## References - Paper §Three-Tier Metabolic Model (pp. 9-10) - Paper §Memory Budget (p. 16) - Related: #954 (metrics emitter), #955 (PerceptionCache)
claude was assigned by Rockachopa 2026-03-22 23:30:51 +00:00
Author
Collaborator

📎 Cross-reference: #1063 — [Study] Best Local Uncensored Agent Model for M3 Max 36GB

This new study provides the specific model recommendations for the router tiers:

  • Fast tier (Tier 1): Qwen3-8B Q6_K — 0.933 F1 tool calling, ~45–55 tok/s
  • Primary tier (Tier 2): Qwen3-14B Q5_K_M — 0.971 F1 tool calling, ~20–28 tok/s
  • Cloud fallback (Tier 3): Addressed separately

The dual-model routing strategy (#1065) directly implements tiers 1 and 2.

📎 **Cross-reference:** #1063 — [Study] Best Local Uncensored Agent Model for M3 Max 36GB This new study provides the specific model recommendations for the router tiers: - **Fast tier (Tier 1):** Qwen3-8B Q6_K — 0.933 F1 tool calling, ~45–55 tok/s - **Primary tier (Tier 2):** Qwen3-14B Q5_K_M — 0.971 F1 tool calling, ~20–28 tok/s - **Cloud fallback (Tier 3):** Addressed separately The dual-model routing strategy (#1065) directly implements tiers 1 and 2.
Author
Collaborator

📎 Cross-reference: #1075 — Claude Quota Monitor + Metabolic Protocol. The metabolic protocol (BURST/ACTIVE/RESTING) governs tier selection based on Claude quota state. Wire QuotaMonitor.should_use_cloud() into cascade.py to auto-downshift.

📎 **Cross-reference:** #1075 — Claude Quota Monitor + Metabolic Protocol. The metabolic protocol (BURST/ACTIVE/RESTING) governs tier selection based on Claude quota state. Wire `QuotaMonitor.should_use_cloud()` into `cascade.py` to auto-downshift.
claude added the harnessinferencep0-critical labels 2026-03-23 13:52:56 +00:00
Collaborator

PR created: #1160

Implemented the three-tier metabolic LLM router:

  • classify_complexity(task, state) → T1/T2/T3 based on keywords + game state
  • build_prompt(state, ui_state, text, visual_context) → OpenAI-compatible messages
  • MetabolicRouter.route() → classify + prompt + dispatch, T3 pauses world during inference
  • Default models: T1=qwen3:8b, T2=qwen3:14b, T3=qwen3:30b (all overridable)
  • 48 unit tests, all passing in tox -e unit
PR created: #1160 Implemented the three-tier metabolic LLM router: - `classify_complexity(task, state)` → T1/T2/T3 based on keywords + game state - `build_prompt(state, ui_state, text, visual_context)` → OpenAI-compatible messages - `MetabolicRouter.route()` → classify + prompt + dispatch, T3 pauses world during inference - Default models: T1=qwen3:8b, T2=qwen3:14b, T3=qwen3:30b (all overridable) - 48 unit tests, all passing in `tox -e unit`
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#966