[claude] Qwen3 two-model strategy: 14B primary + 8B fast router (#1063) (#1143)

2026-03-23 18:35:57 +00:00
parent 128aa4427f
commit ed63877f75
4 changed files with 409 additions and 11 deletions
--- a/Modelfile.qwen3-8b
+++ b/Modelfile.qwen3-8b
@@ -0,0 +1,43 @@
+# Modelfile.qwen3-8b
+#
+# Qwen3-8B Q6_K — Fast routing model for routine agent tasks (Issue #1063)
+#
+# Tool calling F1: 0.933 at ~45-55 tok/s — 2x speed of Qwen3-14B.
+# Use for: simple tool calls, shell commands, file reads, status checks, JSON ops.
+# Route complex tasks (issue triage, multi-step planning, code review) to qwen3:14b.
+#
+# Build:
+#   ollama pull qwen3:8b
+#   ollama create qwen3-8b -f Modelfile.qwen3-8b
+#
+# Memory budget: ~6.6 GB weights + ~5 GB KV cache = ~11.6 GB at 32K ctx
+# Two-model strategy: ~17 GB combined (both hot) — fits on M3 Max 36 GB.
+# Set OLLAMA_MAX_LOADED_MODELS=2 in the Ollama environment.
+#
+# Generation: ~35-45 tok/s (Ollama) / ~45-60 tok/s (MLX)
+
+FROM qwen3:8b
+
+# 32K context
+PARAMETER num_ctx 32768
+
+# Lower temperature for fast, deterministic tool execution
+PARAMETER temperature 0.2
+
+# Nucleus sampling
+PARAMETER top_p 0.9
+
+# Repeat penalty
+PARAMETER repeat_penalty 1.05
+
+SYSTEM """You are Timmy's fast-routing agent. You handle routine tasks quickly and precisely.
+
+For simple tasks (tool calls, shell commands, file reads, status checks, JSON ops): respond immediately without a think block.
+For anything requiring multi-step planning: defer to the primary agent.
+
+Tool call format:
+<tool_call>
+{"name": "function_name", "arguments": {"param": "value"}}
+</tool_call>
+
+Be brief. Be accurate. Execute."""