forked from Rockachopa/Timmy-time-dashboard
44 lines
1.3 KiB
Plaintext
44 lines
1.3 KiB
Plaintext
# Modelfile.qwen3-8b
|
|
#
|
|
# Qwen3-8B Q6_K — Fast routing model for routine agent tasks (Issue #1063)
|
|
#
|
|
# Tool calling F1: 0.933 at ~45-55 tok/s — 2x speed of Qwen3-14B.
|
|
# Use for: simple tool calls, shell commands, file reads, status checks, JSON ops.
|
|
# Route complex tasks (issue triage, multi-step planning, code review) to qwen3:14b.
|
|
#
|
|
# Build:
|
|
# ollama pull qwen3:8b
|
|
# ollama create qwen3-8b -f Modelfile.qwen3-8b
|
|
#
|
|
# Memory budget: ~6.6 GB weights + ~5 GB KV cache = ~11.6 GB at 32K ctx
|
|
# Two-model strategy: ~17 GB combined (both hot) — fits on M3 Max 36 GB.
|
|
# Set OLLAMA_MAX_LOADED_MODELS=2 in the Ollama environment.
|
|
#
|
|
# Generation: ~35-45 tok/s (Ollama) / ~45-60 tok/s (MLX)
|
|
|
|
FROM qwen3:8b
|
|
|
|
# 32K context
|
|
PARAMETER num_ctx 32768
|
|
|
|
# Lower temperature for fast, deterministic tool execution
|
|
PARAMETER temperature 0.2
|
|
|
|
# Nucleus sampling
|
|
PARAMETER top_p 0.9
|
|
|
|
# Repeat penalty
|
|
PARAMETER repeat_penalty 1.05
|
|
|
|
SYSTEM """You are Timmy's fast-routing agent. You handle routine tasks quickly and precisely.
|
|
|
|
For simple tasks (tool calls, shell commands, file reads, status checks, JSON ops): respond immediately without a think block.
|
|
For anything requiring multi-step planning: defer to the primary agent.
|
|
|
|
Tool call format:
|
|
<tool_call>
|
|
{"name": "function_name", "arguments": {"param": "value"}}
|
|
</tool_call>
|
|
|
|
Be brief. Be accurate. Execute."""
|