Implement Qwen3-8B / Qwen3-14B dual-model routing strategy #1065

Closed
opened 2026-03-23 12:52:24 +00:00 by perplexity · 1 comment
Collaborator

Parent: #1063
Related: #966 (three-tier LLM router)

Objective

Implement the two-model architecture recommended in the PDF: Qwen3-8B as fast router for routine tasks, Qwen3-14B for complex tasks.

Architecture

  • Qwen3-8B Q6_K (~6.6GB, ~45–55 tok/s): simple tool calls, shell commands, file reads, simple JSON generation, status queries, backlog queries
  • Qwen3-14B Q5_K_M (~10.5GB, ~20–28 tok/s): issue triage, multi-step planning, code review, creative writing, backlog prioritization
  • Combined memory: ~17GB — both can stay loaded simultaneously with OLLAMA_MAX_LOADED_MODELS=2

Steps

  1. ollama pull qwen3:8b alongside the 14B model
  2. Implement task classifier that routes requests to the appropriate model
  3. Route criteria:
    • Simple/routine → 8B (latency-sensitive, low complexity)
    • Complex/creative → 14B (quality-sensitive, multi-step reasoning)
  4. Set OLLAMA_MAX_LOADED_MODELS=2 for concurrent loading
  5. Measure latency improvement on routine tasks

Acceptance Criteria

  • Router correctly classifies task complexity
  • Routine tasks complete 2x faster via 8B model
  • Complex tasks maintain quality via 14B model
  • Both models stay loaded without memory pressure

Notes

This directly implements one tier of the three-tier LLM router from #966. The third tier (cloud API fallback) is addressed separately.

Parent: #1063 Related: #966 (three-tier LLM router) ## Objective Implement the two-model architecture recommended in the PDF: Qwen3-8B as fast router for routine tasks, Qwen3-14B for complex tasks. ## Architecture - **Qwen3-8B Q6_K** (~6.6GB, ~45–55 tok/s): simple tool calls, shell commands, file reads, simple JSON generation, status queries, backlog queries - **Qwen3-14B Q5_K_M** (~10.5GB, ~20–28 tok/s): issue triage, multi-step planning, code review, creative writing, backlog prioritization - Combined memory: ~17GB — both can stay loaded simultaneously with `OLLAMA_MAX_LOADED_MODELS=2` ## Steps 1. `ollama pull qwen3:8b` alongside the 14B model 2. Implement task classifier that routes requests to the appropriate model 3. Route criteria: - Simple/routine → 8B (latency-sensitive, low complexity) - Complex/creative → 14B (quality-sensitive, multi-step reasoning) 4. Set `OLLAMA_MAX_LOADED_MODELS=2` for concurrent loading 5. Measure latency improvement on routine tasks ## Acceptance Criteria - Router correctly classifies task complexity - Routine tasks complete 2x faster via 8B model - Complex tasks maintain quality via 14B model - Both models stay loaded without memory pressure ## Notes This directly implements one tier of the three-tier LLM router from #966. The third tier (cloud API fallback) is addressed separately.
claude self-assigned this 2026-03-23 13:44:46 +00:00
claude added the harnessinferencep0-critical labels 2026-03-23 13:52:59 +00:00
Collaborator

PR created: #1157

Summary of changes:

  • classifier.pyTaskComplexity enum + classify_task() heuristic (keyword, length, depth signals). No LLM required.
  • cascade.py_get_model_for_complexity() + wired into complete(). Fixed secondary lookup bug (was returning provider default instead of None when no capability matched).
  • providers.yaml — Added qwen3:8b (routine) and qwen3:14b (complex) model entries + routine/complex fallback chains.
  • config.pyollama_max_loaded_models = 2 setting (OLLAMA_MAX_LOADED_MODELS env var).
  • 75 tests pass (21 classifier tests + 54 cascade tests including 9 new complexity routing tests).
PR created: #1157 **Summary of changes:** - `classifier.py` — `TaskComplexity` enum + `classify_task()` heuristic (keyword, length, depth signals). No LLM required. - `cascade.py` — `_get_model_for_complexity()` + wired into `complete()`. Fixed secondary lookup bug (was returning provider default instead of `None` when no capability matched). - `providers.yaml` — Added `qwen3:8b` (routine) and `qwen3:14b` (complex) model entries + `routine`/`complex` fallback chains. - `config.py` — `ollama_max_loaded_models = 2` setting (`OLLAMA_MAX_LOADED_MODELS` env var). - 75 tests pass (21 classifier tests + 54 cascade tests including 9 new complexity routing tests).
Timmy closed this issue 2026-03-23 22:58:22 +00:00
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#1065