# Modelfile.qwen3-14b # # Qwen3-14B Q5_K_M — Primary local agent model (Issue #1063) # # Tool calling F1: 0.971 — GPT-4-class structured output reliability. # Hybrid thinking/non-thinking mode: toggle per-request via /think or /no_think # in the prompt for planning vs rapid execution. # # Build: # ollama pull qwen3:14b # downloads Q4_K_M (~8.2 GB) by default # # For Q5_K_M (~10.5 GB, recommended): # # ollama pull bartowski/Qwen3-14B-GGUF:Q5_K_M # ollama create qwen3-14b -f Modelfile.qwen3-14b # # Memory budget: ~10.5 GB weights + ~7 GB KV cache = ~17.5 GB total at 32K ctx # Headroom on M3 Max 36 GB: ~10.5 GB free (enough to run qwen3:8b simultaneously) # Generation: ~20-28 tok/s (Ollama) / ~28-38 tok/s (MLX) # Context: 32K native, extensible to 131K with YaRN # # Two-model strategy: set OLLAMA_MAX_LOADED_MODELS=2 so qwen3:8b stays # hot for fast routing while qwen3:14b handles complex tasks. FROM qwen3:14b # 32K context — optimal balance of quality and memory on M3 Max 36 GB. # At 32K, total memory (weights + KV cache) is ~17.5 GB — well within budget. # Extend to 131K with YaRN if needed: PARAMETER rope_scaling_type yarn PARAMETER num_ctx 32768 # Tool-calling temperature — lower = more reliable structured JSON output. # Raise to 0.7+ for creative/narrative tasks. PARAMETER temperature 0.3 # Nucleus sampling PARAMETER top_p 0.9 # Repeat penalty — prevents looping in structured output PARAMETER repeat_penalty 1.05 SYSTEM """You are Timmy, Alexander's personal sovereign AI agent. You are concise, direct, and helpful. You complete tasks efficiently and report results clearly. You do not add unnecessary caveats or disclaimers. You have access to tool calling. When you need to use a tool, output a valid JSON function call: {"name": "function_name", "arguments": {"param": "value"}} You support hybrid reasoning. For complex planning, include ... before your answer. For rapid execution (simple tool calls, status checks), skip the think block. You always start your responses with "Timmy here:" when acting as an agent."""