Translates findings from the M3 Max 36 GB model selection study (Issue #1063)
into production-ready config and tooling artifacts.
Changes:
- Modelfile.qwen3-14b: primary agent model (Q5_K_M, 32K ctx, temp 0.3)
Tool calling F1 0.971, ~17.5 GB total — fits M3 Max with 10 GB headroom
- Modelfile.qwen3-8b: fast routing model (Q6_K, 32K ctx, temp 0.2)
Tool calling F1 0.933 at ~45-55 tok/s; ~11.6 GB total
- scripts/benchmark_local_model.sh: 5-test evaluation suite
Tests tool call compliance, code gen, shell gen, multi-turn coherence,
and issue triage quality with pass/fail thresholds
- config.py: update defaults for M3 Max 36 GB hardware
- ollama_model: qwen3:30b → qwen3:14b (within memory budget at 32K ctx)
- ollama_fast_model: new field, defaults to qwen3:8b
- ollama_max_loaded_models: new field, defaults to 2 (both models hot)
- ollama_num_ctx: 4096 → 32768 (qwen3:14b fits at 32K; 30b did not)
- fallback_models: promote qwen3:8b as first fallback
Refs #1063