[claude] Qwen3 two-model strategy: 14B primary + 8B fast router (#1063) #1143
Reference in New Issue
Block a user
Delete Branch "claude/issue-1063"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes #1063
What this does
Translates the M3 Max 36 GB model selection study into production-ready artifacts.
New files
Modelfile.qwen3-14b— Primary agent model (Qwen3-14B Q5_K_M, 32K ctx, temp 0.3). Tool calling F1 0.971 — GPT-4-class reliability. ~17.5 GB total on M3 Max, leaving ~10.5 GB headroom. Hybrid thinking/non-thinking via/thinkand/no_think.Modelfile.qwen3-8b— Fast routing model (Qwen3-8B Q6_K, 32K ctx, temp 0.2). Tool calling F1 0.933 at ~45–55 tok/s. ~11.6 GB total. Combined with 14B: ~17 GB — both stay loaded simultaneously withOLLAMA_MAX_LOADED_MODELS=2.scripts/benchmark_local_model.sh— 5-test evaluation suite from the study:Config changes (
src/config.py)ollama_modelqwen3:30bqwen3:14bollama_num_ctx409632768ollama_fast_modelqwen3:8bollama_max_loaded_models2fallback_models[0]llama3.1:8b-instructqwen3:8bTest plan
tox -e unit— 20/20 tests passollama create qwen3-14b -f Modelfile.qwen3-14b && ollama create qwen3-8b -f Modelfile.qwen3-8b./scripts/benchmark_local_model.sh qwen3:14b qwen3:8b🤖 Generated with Claude Code