[Core] Model Tiering Router — Local 8B / Hermes 70B / Cloud API Cascade #882

Closed
opened 2026-03-21 23:40:08 +00:00 by perplexity · 1 comment
Collaborator

Why This Is High Leverage

The feasibility guide recommends a three-tier model strategy: local 7-8B for routine actions (free, fast), Hermes 70B for complex reasoning (~8-12 tok/s on M3 Max), and Claude/GPT-4o for difficult decisions ($5-15/hr). Timmy's existing src/infrastructure/models/registry.py and src/infrastructure/router/cascade.py have stubs for this, but there's no intelligence in the routing. This ticket makes the LLM selection automatic based on task complexity, keeping costs near zero for 90% of operations.

Scope

Build an intelligent model router that selects the right LLM tier for each request.

Tier 1: Local Fast (Free)

  • Llama 3.1 8B or Hermes 3 8B via Ollama
  • For: navigation, basic interactions, perception processing, simple decisions
  • Latency: 0.3-1s

Tier 2: Local Heavy (Free, Slower)

  • Hermes 3/4 70B via Ollama on M3 Max 128GB
  • For: quest planning, dialogue strategy, combat tactics, complex reasoning
  • Latency: 8-12 tok/s (~5-10s for 200 tokens)

Tier 3: Cloud API (Paid, Best)

  • Claude or GPT-4o
  • For: recovery from failures, novel situations, multi-step planning that failed on Tier 2
  • Latency: 1-3s, cost ~$5-15/hr of heavy use

Routing Logic

  • Task complexity classifier (keyword/pattern-based initially, learnable later)
  • Automatic escalation: if Tier 1 response quality is low, retry on Tier 2
  • Budget tracking: daily/monthly cloud API spend limits
  • Fallback chain: if Tier 3 unavailable, Tier 2 handles everything

Requirements

  • src/infrastructure/models/router.py — Task classifier + tier selection
  • src/infrastructure/models/budget.py — Cloud API spend tracking + limits
  • Integrate with existing registry.py and cascade.py
  • Configurable thresholds and tier assignments
  • Logging: which tier handled which request, latency, estimated cost
  • Tests with mock LLM backends

Acceptance Criteria

  • "Walk to the next room" routes to Tier 1
  • "Plan the optimal path to become Hortator" routes to Tier 2
  • Failed Tier 1 response automatically escalates to Tier 2
  • Cloud API spend stays within configured budget
  • Routing decisions logged for analysis

Assignee: Kimi

## Why This Is High Leverage The feasibility guide recommends a three-tier model strategy: local 7-8B for routine actions (free, fast), Hermes 70B for complex reasoning (~8-12 tok/s on M3 Max), and Claude/GPT-4o for difficult decisions ($5-15/hr). Timmy's existing `src/infrastructure/models/registry.py` and `src/infrastructure/router/cascade.py` have stubs for this, but there's no intelligence in the routing. This ticket makes the LLM selection automatic based on task complexity, keeping costs near zero for 90% of operations. ## Scope Build an intelligent model router that selects the right LLM tier for each request. ### Tier 1: Local Fast (Free) - Llama 3.1 8B or Hermes 3 8B via Ollama - For: navigation, basic interactions, perception processing, simple decisions - Latency: 0.3-1s ### Tier 2: Local Heavy (Free, Slower) - Hermes 3/4 70B via Ollama on M3 Max 128GB - For: quest planning, dialogue strategy, combat tactics, complex reasoning - Latency: 8-12 tok/s (~5-10s for 200 tokens) ### Tier 3: Cloud API (Paid, Best) - Claude or GPT-4o - For: recovery from failures, novel situations, multi-step planning that failed on Tier 2 - Latency: 1-3s, cost ~$5-15/hr of heavy use ### Routing Logic - Task complexity classifier (keyword/pattern-based initially, learnable later) - Automatic escalation: if Tier 1 response quality is low, retry on Tier 2 - Budget tracking: daily/monthly cloud API spend limits - Fallback chain: if Tier 3 unavailable, Tier 2 handles everything ## Requirements - [ ] `src/infrastructure/models/router.py` — Task classifier + tier selection - [ ] `src/infrastructure/models/budget.py` — Cloud API spend tracking + limits - [ ] Integrate with existing `registry.py` and `cascade.py` - [ ] Configurable thresholds and tier assignments - [ ] Logging: which tier handled which request, latency, estimated cost - [ ] Tests with mock LLM backends ## Acceptance Criteria - "Walk to the next room" routes to Tier 1 - "Plan the optimal path to become Hortator" routes to Tier 2 - Failed Tier 1 response automatically escalates to Tier 2 - Cloud API spend stays within configured budget - Routing decisions logged for analysis ## Assignee: Kimi
claude was assigned by Rockachopa 2026-03-22 23:33:21 +00:00
claude added the harnessinferencep0-critical labels 2026-03-23 13:52:55 +00:00
Collaborator

PR created: http://143.198.27.163:3000/Rockachopa/Timmy-time-dashboard/pulls/1297

Summary of changes:

  • src/infrastructure/models/router.pyTieredModelRouter + classify_tier() heuristic classifier. Tier 1 (LOCAL_FAST, 8B) for navigation/simple actions; Tier 2 (LOCAL_HEAVY, 70B) for planning/complex reasoning; Tier 3 (CLOUD_API) for failures or explicit override. Auto-escalation from T1 on low-quality responses. Budget guard prevents cloud overspend.

  • src/infrastructure/models/budget.pyBudgetTracker with SQLite persistence and in-memory fallback. Tracks daily/monthly cloud API spend, enforces configurable limits, estimates costs for Claude/GPT-4o/Grok.

  • src/config.py — 5 new settings: tier_local_fast_model, tier_local_heavy_model, tier_cloud_model, tier_cloud_daily_budget_usd ($5 default), tier_cloud_monthly_budget_usd ($50 default).

  • 44 new unit tests. All 544 unit tests pass.

Acceptance criteria verified: "Walk to the next room" → LOCAL_FAST ✓, "Plan the optimal path to become Hortator" → LOCAL_HEAVY ✓, auto-escalation ✓, budget limits ✓, routing logged ✓.

PR created: http://143.198.27.163:3000/Rockachopa/Timmy-time-dashboard/pulls/1297 **Summary of changes:** - `src/infrastructure/models/router.py` — `TieredModelRouter` + `classify_tier()` heuristic classifier. Tier 1 (LOCAL_FAST, 8B) for navigation/simple actions; Tier 2 (LOCAL_HEAVY, 70B) for planning/complex reasoning; Tier 3 (CLOUD_API) for failures or explicit override. Auto-escalation from T1 on low-quality responses. Budget guard prevents cloud overspend. - `src/infrastructure/models/budget.py` — `BudgetTracker` with SQLite persistence and in-memory fallback. Tracks daily/monthly cloud API spend, enforces configurable limits, estimates costs for Claude/GPT-4o/Grok. - `src/config.py` — 5 new settings: `tier_local_fast_model`, `tier_local_heavy_model`, `tier_cloud_model`, `tier_cloud_daily_budget_usd` ($5 default), `tier_cloud_monthly_budget_usd` ($50 default). - 44 new unit tests. All 544 unit tests pass. Acceptance criteria verified: "Walk to the next room" → LOCAL_FAST ✓, "Plan the optimal path to become Hortator" → LOCAL_HEAVY ✓, auto-escalation ✓, budget limits ✓, routing logged ✓.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#882