[Study] Best Local Uncensored Agent Model for M3 Max 36GB #1063

Closed
opened 2026-03-23 12:51:52 +00:00 by perplexity · 1 comment
Collaborator

Source

PDF: The-Best-Local-Uncensored-Agent-Model-for-M3-Max-36GB.pdf
Submitted by: rockachopa

Summary

Definitive model selection research for Timmy's local brain on Apple Silicon M3 Max with 36GB unified memory. The document evaluates quantized open-source models for agent orchestration — tool calling, code generation, shell execution, issue triage, and creative writing — under the hard constraint of 28GB usable VRAM (after ~8GB macOS/app overhead).

Key Findings

Primary Recommendation: Qwen3-14B Q5_K_M

  • Tool calling F1: 0.971 — GPT-4-class structured output reliability
  • File size: ~10.5 GB | Total memory (32K ctx): ~17.5 GB | Headroom: ~10.5 GB free
  • Generation speed: ~20–28 tok/s (Ollama), ~28–38 tok/s (MLX)
  • Context: 32K native, extensible to 131K with YaRN
  • Hybrid thinking/non-thinking mode — toggle per-request for planning vs rapid execution
  • Does NOT refuse code generation, shell commands, file operations, or tool calls
  • Ollama tag: qwen3:14b | GGUF source: bartowski/Qwen3-14B-GGUF

Runner-up: Dolphin 3.0-R1-Mistral-24B Q4_K_M

  • Explicit uncensored training + DeepSeek R1-style chain-of-thought
  • ~14GB weights, fits at 8K context (~16.4GB total)
  • Tool calling ~85–90% F1 (vs Qwen3's 97%) — less reliable for agent loops
  • 128K native context, Apache 2.0 license
  • Use case: fallback when Qwen3's residual alignment interferes with creative/security-adjacent tasks

Fast Mode: Qwen3-8B Q6_K

  • 0.933 F1 tool calling at ~45–55 tok/s (2x speed of 14B)
  • ~6.6GB model, ~11.6GB with 32K context
  • For routine tasks: simple tool calls, quick code edits, status checks, backlog queries
  • Qwen3-8B as fast router for routine tool calls (shell, file reads, simple JSON, status)
  • Qwen3-14B for complex tasks (issue triage, multi-step planning, code review, creative writing)
  • Combined: ~17GB — both can stay loaded simultaneously with OLLAMA_MAX_LOADED_MODELS=2

Critical Insight: "Uncensored" is a Red Herring

  • Abliteration (removing safety training) actively degrades structured output, tool-calling, and reasoning
  • Skywork analysis: abliterated Qwen3-30B-A3B "often underperforms smaller non-abliterated models"
  • Dolphin/abliterated models trade the capabilities an orchestrator needs most for freedoms it doesn't need
  • Qwen3-14B is permissive enough for all legitimate agent tasks; Ollama's constrained decoding forces compliance where needed

Hermes 3 8B — Notable Mention

  • NousResearch's "neutrally-aligned" model — follows system prompts faithfully without imposing morality
  • "For Hermes, there is no such thing as latent thoughtcrime"
  • Strong philosophy but 8B can't match 14B reasoning depth; Hermes 4 at 14B would be a serious contender

Ollama vs MLX Performance

Model Quant Ollama MLX
8B (Qwen3-8B) Q4_K_M ~35–45 tok/s ~45–60 tok/s
14B (Qwen3-14B) Q5_K_M ~20–28 tok/s ~28–38 tok/s
24B (Devstral) Q4_K_M ~15–20 tok/s ~20–28 tok/s
30B-A3B MoE (Qwen3) Q4_K_M ~30–40 tok/s ~50–65 tok/s

MLX is 25–50% faster than Ollama, but Ollama has the superior ecosystem for agent orchestration (built-in tool calling API, JSON mode, model management, OpenAI-compatible endpoint).

Includes: Production-Ready Artifacts

  1. Modelfile.timmy — Complete Ollama Modelfile with optimized parameters (32K ctx, temp 0.3, system prompt)
  2. API curl examples — Tool calling and schema-constrained JSON output
  3. 5-test benchmark suite — Shell scripts to evaluate any candidate model:
    • Test 1: Tool calling compliance (>90% valid JSON)
    • Test 2: Code generation correctness
    • Test 3: Shell command generation (no refusal)
    • Test 4: Multi-turn agent loop coherence
    • Test 5: Issue triage quality

MCP Integration Path

  • Qwen-Agent has native MCP support (pip install qwen-agent[mcp])
  • Multiple Ollama-MCP bridges exist (ollmcp, mcphost, ollama-mcp-bridge)
  • Qwen3 models explicitly listed among recommended Ollama models for MCP/tool use

Cross-References

  • #966 — Three-tier LLM router architecture (this PDF provides the specific model recommendations for each tier)
  • #982 — Session Crystallization Playbook (Qwen3-14B is the recommended local model for the sovereignty stack)
  • #972 — Replacing Claude spec (this research directly answers "which local model replaces Claude")

Work Suggestions

See child issues for actionable implementation tasks.

## Source PDF: `The-Best-Local-Uncensored-Agent-Model-for-M3-Max-36GB.pdf` Submitted by: rockachopa ## Summary Definitive model selection research for Timmy's local brain on Apple Silicon M3 Max with 36GB unified memory. The document evaluates quantized open-source models for agent orchestration — tool calling, code generation, shell execution, issue triage, and creative writing — under the hard constraint of 28GB usable VRAM (after ~8GB macOS/app overhead). ## Key Findings ### Primary Recommendation: Qwen3-14B Q5_K_M - **Tool calling F1: 0.971** — GPT-4-class structured output reliability - File size: ~10.5 GB | Total memory (32K ctx): ~17.5 GB | Headroom: ~10.5 GB free - Generation speed: ~20–28 tok/s (Ollama), ~28–38 tok/s (MLX) - Context: 32K native, extensible to 131K with YaRN - Hybrid thinking/non-thinking mode — toggle per-request for planning vs rapid execution - Does NOT refuse code generation, shell commands, file operations, or tool calls - Ollama tag: `qwen3:14b` | GGUF source: `bartowski/Qwen3-14B-GGUF` ### Runner-up: Dolphin 3.0-R1-Mistral-24B Q4_K_M - Explicit uncensored training + DeepSeek R1-style chain-of-thought - ~14GB weights, fits at 8K context (~16.4GB total) - Tool calling ~85–90% F1 (vs Qwen3's 97%) — less reliable for agent loops - 128K native context, Apache 2.0 license - Use case: fallback when Qwen3's residual alignment interferes with creative/security-adjacent tasks ### Fast Mode: Qwen3-8B Q6_K - 0.933 F1 tool calling at ~45–55 tok/s (2x speed of 14B) - ~6.6GB model, ~11.6GB with 32K context - For routine tasks: simple tool calls, quick code edits, status checks, backlog queries ### Two-Model Strategy (Recommended) - **Qwen3-8B** as fast router for routine tool calls (shell, file reads, simple JSON, status) - **Qwen3-14B** for complex tasks (issue triage, multi-step planning, code review, creative writing) - Combined: ~17GB — both can stay loaded simultaneously with `OLLAMA_MAX_LOADED_MODELS=2` ## Critical Insight: "Uncensored" is a Red Herring - Abliteration (removing safety training) actively **degrades** structured output, tool-calling, and reasoning - Skywork analysis: abliterated Qwen3-30B-A3B "often underperforms smaller non-abliterated models" - Dolphin/abliterated models trade the capabilities an orchestrator needs most for freedoms it doesn't need - Qwen3-14B is permissive enough for all legitimate agent tasks; Ollama's constrained decoding forces compliance where needed ## Hermes 3 8B — Notable Mention - NousResearch's "neutrally-aligned" model — follows system prompts faithfully without imposing morality - "For Hermes, there is no such thing as latent thoughtcrime" - Strong philosophy but 8B can't match 14B reasoning depth; Hermes 4 at 14B would be a serious contender ## Ollama vs MLX Performance | Model | Quant | Ollama | MLX | |---|---|---|---| | 8B (Qwen3-8B) | Q4_K_M | ~35–45 tok/s | ~45–60 tok/s | | **14B (Qwen3-14B)** | **Q5_K_M** | **~20–28 tok/s** | **~28–38 tok/s** | | 24B (Devstral) | Q4_K_M | ~15–20 tok/s | ~20–28 tok/s | | 30B-A3B MoE (Qwen3) | Q4_K_M | ~30–40 tok/s | ~50–65 tok/s | MLX is 25–50% faster than Ollama, but Ollama has the superior ecosystem for agent orchestration (built-in tool calling API, JSON mode, model management, OpenAI-compatible endpoint). ## Includes: Production-Ready Artifacts 1. **Modelfile.timmy** — Complete Ollama Modelfile with optimized parameters (32K ctx, temp 0.3, system prompt) 2. **API curl examples** — Tool calling and schema-constrained JSON output 3. **5-test benchmark suite** — Shell scripts to evaluate any candidate model: - Test 1: Tool calling compliance (>90% valid JSON) - Test 2: Code generation correctness - Test 3: Shell command generation (no refusal) - Test 4: Multi-turn agent loop coherence - Test 5: Issue triage quality ## MCP Integration Path - Qwen-Agent has native MCP support (`pip install qwen-agent[mcp]`) - Multiple Ollama-MCP bridges exist (ollmcp, mcphost, ollama-mcp-bridge) - Qwen3 models explicitly listed among recommended Ollama models for MCP/tool use ## Cross-References - #966 — Three-tier LLM router architecture (this PDF provides the specific model recommendations for each tier) - #982 — Session Crystallization Playbook (Qwen3-14B is the recommended local model for the sovereignty stack) - #972 — Replacing Claude spec (this research directly answers "which local model replaces Claude") ## Work Suggestions See child issues for actionable implementation tasks.
claude self-assigned this 2026-03-23 14:01:48 +00:00
Collaborator

PR created: #1143

Artifacts delivered from the study:

  1. Modelfile.qwen3-14b — Primary agent model (Q5_K_M, 32K ctx, temp 0.3). Tool calling F1 0.971, ~17.5 GB on M3 Max 36 GB.
  2. Modelfile.qwen3-8b — Fast routing model (Q6_K, 32K ctx, temp 0.2). F1 0.933 at ~45–55 tok/s, ~11.6 GB. Both models combined: ~17 GB — stay loaded simultaneously with OLLAMA_MAX_LOADED_MODELS=2.
  3. scripts/benchmark_local_model.sh — 5-test evaluation suite (tool call compliance, code gen, shell gen, multi-turn coherence, issue triage quality).
  4. src/config.py — Updated defaults: ollama_model → qwen3:14b, ollama_num_ctx → 32768, added ollama_fast_model = qwen3:8b and ollama_max_loaded_models = 2.

All 20 unit tests pass.

PR created: #1143 **Artifacts delivered from the study:** 1. `Modelfile.qwen3-14b` — Primary agent model (Q5_K_M, 32K ctx, temp 0.3). Tool calling F1 0.971, ~17.5 GB on M3 Max 36 GB. 2. `Modelfile.qwen3-8b` — Fast routing model (Q6_K, 32K ctx, temp 0.2). F1 0.933 at ~45–55 tok/s, ~11.6 GB. Both models combined: ~17 GB — stay loaded simultaneously with `OLLAMA_MAX_LOADED_MODELS=2`. 3. `scripts/benchmark_local_model.sh` — 5-test evaluation suite (tool call compliance, code gen, shell gen, multi-turn coherence, issue triage quality). 4. `src/config.py` — Updated defaults: `ollama_model → qwen3:14b`, `ollama_num_ctx → 32768`, added `ollama_fast_model = qwen3:8b` and `ollama_max_loaded_models = 2`. All 20 unit tests pass.
Sign in to join this conversation.
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#1063