[claude] Qwen3 two-model strategy: 14B primary + 8B fast router (#1063) (#1143)

2026-03-23 18:35:57 +00:00
parent 128aa4427f
commit ed63877f75
4 changed files with 409 additions and 11 deletions
--- a/Modelfile.qwen3-14b
+++ b/Modelfile.qwen3-14b
@@ -0,0 +1,51 @@
+# Modelfile.qwen3-14b
+#
+# Qwen3-14B Q5_K_M — Primary local agent model (Issue #1063)
+#
+# Tool calling F1: 0.971 — GPT-4-class structured output reliability.
+# Hybrid thinking/non-thinking mode: toggle per-request via /think or /no_think
+# in the prompt for planning vs rapid execution.
+#
+# Build:
+#   ollama pull qwen3:14b            # downloads Q4_K_M (~8.2 GB) by default
+#   # For Q5_K_M (~10.5 GB, recommended):
+#   # ollama pull bartowski/Qwen3-14B-GGUF:Q5_K_M
+#   ollama create qwen3-14b -f Modelfile.qwen3-14b
+#
+# Memory budget: ~10.5 GB weights + ~7 GB KV cache = ~17.5 GB total at 32K ctx
+# Headroom on M3 Max 36 GB: ~10.5 GB free (enough to run qwen3:8b simultaneously)
+# Generation: ~20-28 tok/s (Ollama) / ~28-38 tok/s (MLX)
+# Context: 32K native, extensible to 131K with YaRN
+#
+# Two-model strategy: set OLLAMA_MAX_LOADED_MODELS=2 so qwen3:8b stays
+# hot for fast routing while qwen3:14b handles complex tasks.
+
+FROM qwen3:14b
+
+# 32K context — optimal balance of quality and memory on M3 Max 36 GB.
+# At 32K, total memory (weights + KV cache) is ~17.5 GB — well within budget.
+# Extend to 131K with YaRN if needed: PARAMETER rope_scaling_type yarn
+PARAMETER num_ctx 32768
+
+# Tool-calling temperature — lower = more reliable structured JSON output.
+# Raise to 0.7+ for creative/narrative tasks.
+PARAMETER temperature 0.3
+
+# Nucleus sampling
+PARAMETER top_p 0.9
+
+# Repeat penalty — prevents looping in structured output
+PARAMETER repeat_penalty 1.05
+
+SYSTEM """You are Timmy, Alexander's personal sovereign AI agent.
+
+You are concise, direct, and helpful. You complete tasks efficiently and report results clearly. You do not add unnecessary caveats or disclaimers.
+
+You have access to tool calling. When you need to use a tool, output a valid JSON function call:
+<tool_call>
+{"name": "function_name", "arguments": {"param": "value"}}
+</tool_call>
+
+You support hybrid reasoning. For complex planning, include <think>...</think> before your answer. For rapid execution (simple tool calls, status checks), skip the think block.
+
+You always start your responses with "Timmy here:" when acting as an agent."""