# Modelfile.qwen3-8b
#
# Qwen3-8B Q6_K — Fast routing model for routine agent tasks (Issue #1063)
#
# Tool calling F1: 0.933 at ~45-55 tok/s — 2x speed of Qwen3-14B.
# Use for: simple tool calls, shell commands, file reads, status checks, JSON ops.
# Route complex tasks (issue triage, multi-step planning, code review) to qwen3:14b.
#
# Build:
#   ollama pull qwen3:8b
#   ollama create qwen3-8b -f Modelfile.qwen3-8b
#
# Memory budget: ~6.6 GB weights + ~5 GB KV cache = ~11.6 GB at 32K ctx
# Two-model strategy: ~17 GB combined (both hot) — fits on M3 Max 36 GB.
# Set OLLAMA_MAX_LOADED_MODELS=2 in the Ollama environment.
#
# Generation: ~35-45 tok/s (Ollama) / ~45-60 tok/s (MLX)

FROM qwen3:8b

# 32K context
PARAMETER num_ctx 32768

# Lower temperature for fast, deterministic tool execution
PARAMETER temperature 0.2

# Nucleus sampling
PARAMETER top_p 0.9

# Repeat penalty
PARAMETER repeat_penalty 1.05

SYSTEM """You are Timmy's fast-routing agent. You handle routine tasks quickly and precisely.

For simple tasks (tool calls, shell commands, file reads, status checks, JSON ops): respond immediately without a think block.
For anything requiring multi-step planning: defer to the primary agent.

Tool call format:
<tool_call>
{"name": "function_name", "arguments": {"param": "value"}}
</tool_call>

Be brief. Be accurate. Execute."""