Implement three-tier metabolic LLM router (Qwen3-3B / Llama-8B / Qwen3-32B) #966

New Issue

perplexity · 2026-03-22T18:45:49Z

perplexity commented

2026-03-22 18:45:49 +00:00

Parent

#963 — [Study] Solving the Perception Bottleneck

Objective

Build the tiered LLM decision engine that routes queries to the cheapest-sufficient model, using MLX for all inference on Apple Silicon GPU.

Scope

T1 — Routine (Qwen3-3B Q8_0, 3.5GB): Simple navigation, basic choices. Always loaded. ~100-150ms.
T2 — Medium (Llama-3.1-8B Q4_K_M, 5GB): Dialogue responses, inventory management. Always loaded. ~300-500ms.
T3 — Complex (Qwen3-32B Q4_K_M, 20GB): Quest planning, stuck recovery. Loaded on demand. ~1.5-2.5s with game paused.
Implement classify_complexity(task, state) function that determines which tier to use
Integrate with vllm-mlx for production serving with prefix caching (28x speedup on repeated prompts)
Implement prompt building: build_prompt(state, ui_state, text, visual_context)
All inference via mlx-vlm / MLX — 21-87% faster than llama.cpp on Apple Silicon

Key Design Notes

T3 triggers world.pause() before inference and world.unpause() after
GBNF grammar constrains all output to valid game commands (see sibling issue)
70% of ticks never reach the LLM (handled by behavior trees)
Memory budget: T1+T2 always loaded (~8.5GB), T3 on-demand (+20GB)

References

Paper §Three-Tier Metabolic Model (pp. 9-10)
Paper §Memory Budget (p. 16)
Related: #954 (metrics emitter), #955 (PerceptionCache)

## Parent - #963 — [Study] Solving the Perception Bottleneck ## Objective Build the tiered LLM decision engine that routes queries to the cheapest-sufficient model, using MLX for all inference on Apple Silicon GPU. ## Scope - **T1 — Routine (Qwen3-3B Q8_0, 3.5GB):** Simple navigation, basic choices. Always loaded. ~100-150ms. - **T2 — Medium (Llama-3.1-8B Q4_K_M, 5GB):** Dialogue responses, inventory management. Always loaded. ~300-500ms. - **T3 — Complex (Qwen3-32B Q4_K_M, 20GB):** Quest planning, stuck recovery. Loaded on demand. ~1.5-2.5s with game paused. - Implement `classify_complexity(task, state)` function that determines which tier to use - Integrate with `vllm-mlx` for production serving with prefix caching (28x speedup on repeated prompts) - Implement prompt building: `build_prompt(state, ui_state, text, visual_context)` - All inference via `mlx-vlm` / MLX — 21-87% faster than llama.cpp on Apple Silicon ## Key Design Notes - T3 triggers `world.pause()` before inference and `world.unpause()` after - GBNF grammar constrains all output to valid game commands (see sibling issue) - 70% of ticks never reach the LLM (handled by behavior trees) - Memory budget: T1+T2 always loaded (~8.5GB), T3 on-demand (+20GB) ## References - Paper §Three-Tier Metabolic Model (pp. 9-10) - Paper §Memory Budget (p. 16) - Related: #954 (metrics emitter), #955 (PerceptionCache)

perplexity referenced this issue

2026-03-22 18:46:45 +00:00

[Study] Solving the Perception Bottleneck — API-First Architecture on Apple Silicon #963

perplexity referenced this issue

2026-03-22 19:06:05 +00:00

EPIC: Timmy: Instantiate your own openclaw agent on the Hermes VPS #663

perplexity referenced this issue

2026-03-22 19:12:48 +00:00

Ingest this research and triage any work to be done here #946

perplexity referenced this issue

2026-03-22 19:12:48 +00:00

Implement content moderation pipeline (Llama Guard + game-context prompts) #987

gemini referenced this issue

2026-03-22 23:13:22 +00:00

PR for #987: Implement content moderation pipeline (Llama Guard + game-context prompts) #1038

claude was assigned by Rockachopa

2026-03-22 23:30:51 +00:00

claude referenced this issue

2026-03-23 01:38:25 +00:00

Add content moderation pipeline (Llama Guard + game-context prompts) #1056

claude referenced this issue

2026-03-23 01:40:13 +00:00

[claude] Ingest integration architecture research and triage work (#946) #1057

claude referenced this issue

2026-03-23 01:40:27 +00:00

Ingest this research and triage any work to be done here #946

perplexity referenced this issue

2026-03-23 12:51:52 +00:00

[Study] Best Local Uncensored Agent Model for M3 Max 36GB #1063

perplexity referenced this issue

2026-03-23 12:52:24 +00:00

Implement Qwen3-8B / Qwen3-14B dual-model routing strategy #1065

perplexity commented

2026-03-23 12:53:37 +00:00

📎 Cross-reference: #1063 — [Study] Best Local Uncensored Agent Model for M3 Max 36GB

This new study provides the specific model recommendations for the router tiers:

Fast tier (Tier 1): Qwen3-8B Q6_K — 0.933 F1 tool calling, ~45–55 tok/s
Primary tier (Tier 2): Qwen3-14B Q5_K_M — 0.971 F1 tool calling, ~20–28 tok/s
Cloud fallback (Tier 3): Addressed separately

The dual-model routing strategy (#1065) directly implements tiers 1 and 2.

📎 **Cross-reference:** #1063 — [Study] Best Local Uncensored Agent Model for M3 Max 36GB This new study provides the specific model recommendations for the router tiers: - **Fast tier (Tier 1):** Qwen3-8B Q6_K — 0.933 F1 tool calling, ~45–55 tok/s - **Primary tier (Tier 2):** Qwen3-14B Q5_K_M — 0.971 F1 tool calling, ~20–28 tok/s - **Cloud fallback (Tier 3):** Addressed separately The dual-model routing strategy (#1065) directly implements tiers 1 and 2.

perplexity referenced this issue

2026-03-23 13:11:25 +00:00

[GOVERNING] Timmy as Autonomous Orchestrator — Vassal Protocol #1070

perplexity referenced this issue

2026-03-23 13:24:21 +00:00

Integrate Claude Quota Monitor + Metabolic Protocol into cascade router #1075

perplexity commented

2026-03-23 13:24:54 +00:00

📎 Cross-reference: #1075 — Claude Quota Monitor + Metabolic Protocol. The metabolic protocol (BURST/ACTIVE/RESTING) governs tier selection based on Claude quota state. Wire QuotaMonitor.should_use_cloud() into cascade.py to auto-downshift.

📎 **Cross-reference:** #1075 — Claude Quota Monitor + Metabolic Protocol. The metabolic protocol (BURST/ACTIVE/RESTING) governs tier selection based on Claude quota state. Wire `QuotaMonitor.should_use_cloud()` into `cascade.py` to auto-downshift.

perplexity referenced this issue

2026-03-23 13:32:11 +00:00

[GOVERNING] Deep Backlog Triage — Harness (Product) vs Infrastructure Separation #1076

perplexity referenced this issue

2026-03-23 13:52:17 +00:00

[Infra] Hermes Agent local model routing — custom_providers ignored, always routes to Anthropic #950

claude added the harness inference p0-critical labels 2026-03-23 13:52:56 +00:00

claude referenced this issue from a commit

2026-03-23 19:00:08 +00:00

WIP: Claude Code progress on #966

Rockachopa referenced this issue from a commit

2026-03-23 19:44:12 +00:00

feat: implement three-tier metabolic LLM router (Refs #966)

claude referenced a pull request that will close this issue

2026-03-23 19:44:32 +00:00

[claude] Implement three-tier metabolic LLM router (#966) #1160

claude commented

2026-03-23 19:44:50 +00:00

PR created: #1160

Implemented the three-tier metabolic LLM router:

classify_complexity(task, state) → T1/T2/T3 based on keywords + game state
build_prompt(state, ui_state, text, visual_context) → OpenAI-compatible messages
MetabolicRouter.route() → classify + prompt + dispatch, T3 pauses world during inference
Default models: T1=qwen3:8b, T2=qwen3:14b, T3=qwen3:30b (all overridable)
48 unit tests, all passing in tox -e unit

PR created: #1160 Implemented the three-tier metabolic LLM router: - `classify_complexity(task, state)` → T1/T2/T3 based on keywords + game state - `build_prompt(state, ui_state, text, visual_context)` → OpenAI-compatible messages - `MetabolicRouter.route()` → classify + prompt + dispatch, T3 pauses world during inference - Default models: T1=qwen3:8b, T2=qwen3:14b, T3=qwen3:30b (all overridable) - 48 unit tests, all passing in `tox -e unit`

Rockachopa closed this issue

2026-03-23 19:45:57 +00:00

Rockachopa referenced this issue from a commit

2026-03-23 19:45:58 +00:00

[claude] Implement three-tier metabolic LLM router (#966) (#1160)

Sign in to join this conversation.

Branches Tags

main

gemini/issue-892

claude/issue-1342

claude/issue-1346

claude/issue-1351

claude/issue-1340

fix/test-llm-triage-syntax

gemini/issue-1014

gemini/issue-932

claude/issue-1277

claude/issue-1139

claude/issue-870

claude/issue-1285

claude/issue-1292

claude/issue-1281

claude/issue-917

claude/issue-1275

claude/issue-925

claude/issue-1019

claude/issue-1094

claude/issue-1019-v3

fix/flaky-vassal-xdist-tests

fix/test-config-env-isolation

claude/issue-1019-v2

claude/issue-957-v2

claude/issue-1218

claude/issue-1217

test/chat-store-unit-tests

claude/issue-1191

claude/issue-1186

claude/issue-957

gemini/issue-936

claude/issue-1065

gemini/issue-976

gemini/issue-1149

claude/issue-1135

claude/issue-1064

gemini/issue-1012

claude/issue-1095

claude/issue-1102

claude/issue-1114

gemini/issue-978

gemini/issue-971

claude/issue-1074

claude/issue-987

claude/issue-1011

feature/internal-monologue

feature/issue-1006

feature/issue-1007

feature/issue-1008

feature/issue-1009

feature/issue-1010

feature/issue-1011

feature/issue-1012

feature/issue-1013

feature/issue-1014

feature/issue-981

feature/issue-982

feature/issue-983

feature/issue-984

feature/issue-985

feature/issue-986

feature/issue-987

feature/issue-993

claude/issue-943

claude/issue-975

claude/issue-989

claude/issue-988

fix/loop-guard-gitea-api-and-queue-validation

feature/lhf-tech-debt-fixes

kimi/issue-753

kimi/issue-714

kimi/issue-716

fix/csrf-check-before-execute

chore/migrate-gitea-to-vps

kimi/issue-640

fix/utcnow-calm-py

kimi/issue-635

kimi/issue-625

fix/router-api-truncated-param

kimi/issue-604

kimi/issue-594

review-fixes

kimi/issue-570

kimi/issue-554

kimi/issue-539

kimi/issue-540

feature/ipad-v1-api

kimi/issue-506

kimi/issue-512

refactor/airllm-doc-cleanup

kimi/issue-513

kimi/issue-514

kimi/issue-500

kimi/issue-492

kimi/issue-490

kimi/issue-459

kimi/issue-472

kimi/issue-473

kimi/issue-462

kimi/issue-463

kimi/issue-454

kimi/issue-445

kimi/issue-446

kimi/issue-431

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#966