[Study] Best Local Uncensored Agent Model for M3 Max 36GB #1063

New Issue

perplexity · 2026-03-23T12:51:52Z

perplexity commented

2026-03-23 12:51:52 +00:00

Source

PDF: The-Best-Local-Uncensored-Agent-Model-for-M3-Max-36GB.pdf
Submitted by: rockachopa

Summary

Definitive model selection research for Timmy's local brain on Apple Silicon M3 Max with 36GB unified memory. The document evaluates quantized open-source models for agent orchestration — tool calling, code generation, shell execution, issue triage, and creative writing — under the hard constraint of 28GB usable VRAM (after ~8GB macOS/app overhead).

Key Findings

Primary Recommendation: Qwen3-14B Q5_K_M

Tool calling F1: 0.971 — GPT-4-class structured output reliability
File size: ~10.5 GB | Total memory (32K ctx): ~17.5 GB | Headroom: ~10.5 GB free
Generation speed: ~20–28 tok/s (Ollama), ~28–38 tok/s (MLX)
Context: 32K native, extensible to 131K with YaRN
Hybrid thinking/non-thinking mode — toggle per-request for planning vs rapid execution
Does NOT refuse code generation, shell commands, file operations, or tool calls
Ollama tag: qwen3:14b | GGUF source: bartowski/Qwen3-14B-GGUF

Runner-up: Dolphin 3.0-R1-Mistral-24B Q4_K_M

Explicit uncensored training + DeepSeek R1-style chain-of-thought
~14GB weights, fits at 8K context (~16.4GB total)
Tool calling ~85–90% F1 (vs Qwen3's 97%) — less reliable for agent loops
128K native context, Apache 2.0 license
Use case: fallback when Qwen3's residual alignment interferes with creative/security-adjacent tasks

Fast Mode: Qwen3-8B Q6_K

0.933 F1 tool calling at ~45–55 tok/s (2x speed of 14B)
~6.6GB model, ~11.6GB with 32K context
For routine tasks: simple tool calls, quick code edits, status checks, backlog queries

Two-Model Strategy (Recommended)

Qwen3-8B as fast router for routine tool calls (shell, file reads, simple JSON, status)
Qwen3-14B for complex tasks (issue triage, multi-step planning, code review, creative writing)
Combined: ~17GB — both can stay loaded simultaneously with OLLAMA_MAX_LOADED_MODELS=2

Critical Insight: "Uncensored" is a Red Herring

Abliteration (removing safety training) actively degrades structured output, tool-calling, and reasoning
Skywork analysis: abliterated Qwen3-30B-A3B "often underperforms smaller non-abliterated models"
Dolphin/abliterated models trade the capabilities an orchestrator needs most for freedoms it doesn't need
Qwen3-14B is permissive enough for all legitimate agent tasks; Ollama's constrained decoding forces compliance where needed

Hermes 3 8B — Notable Mention

NousResearch's "neutrally-aligned" model — follows system prompts faithfully without imposing morality
"For Hermes, there is no such thing as latent thoughtcrime"
Strong philosophy but 8B can't match 14B reasoning depth; Hermes 4 at 14B would be a serious contender

Ollama vs MLX Performance

Model	Quant	Ollama	MLX
8B (Qwen3-8B)	Q4_K_M	~35–45 tok/s	~45–60 tok/s
14B (Qwen3-14B)	Q5_K_M	~20–28 tok/s	~28–38 tok/s
24B (Devstral)	Q4_K_M	~15–20 tok/s	~20–28 tok/s
30B-A3B MoE (Qwen3)	Q4_K_M	~30–40 tok/s	~50–65 tok/s

MLX is 25–50% faster than Ollama, but Ollama has the superior ecosystem for agent orchestration (built-in tool calling API, JSON mode, model management, OpenAI-compatible endpoint).

Includes: Production-Ready Artifacts

Modelfile.timmy — Complete Ollama Modelfile with optimized parameters (32K ctx, temp 0.3, system prompt)
API curl examples — Tool calling and schema-constrained JSON output
5-test benchmark suite — Shell scripts to evaluate any candidate model:
- Test 1: Tool calling compliance (>90% valid JSON)
- Test 2: Code generation correctness
- Test 3: Shell command generation (no refusal)
- Test 4: Multi-turn agent loop coherence
- Test 5: Issue triage quality

MCP Integration Path

Qwen-Agent has native MCP support (pip install qwen-agent[mcp])
Multiple Ollama-MCP bridges exist (ollmcp, mcphost, ollama-mcp-bridge)
Qwen3 models explicitly listed among recommended Ollama models for MCP/tool use

Cross-References

#966 — Three-tier LLM router architecture (this PDF provides the specific model recommendations for each tier)
#982 — Session Crystallization Playbook (Qwen3-14B is the recommended local model for the sovereignty stack)
#972 — Replacing Claude spec (this research directly answers "which local model replaces Claude")

Work Suggestions

See child issues for actionable implementation tasks.

## Source PDF: `The-Best-Local-Uncensored-Agent-Model-for-M3-Max-36GB.pdf` Submitted by: rockachopa ## Summary Definitive model selection research for Timmy's local brain on Apple Silicon M3 Max with 36GB unified memory. The document evaluates quantized open-source models for agent orchestration — tool calling, code generation, shell execution, issue triage, and creative writing — under the hard constraint of 28GB usable VRAM (after ~8GB macOS/app overhead). ## Key Findings ### Primary Recommendation: Qwen3-14B Q5_K_M - **Tool calling F1: 0.971** — GPT-4-class structured output reliability - File size: ~10.5 GB | Total memory (32K ctx): ~17.5 GB | Headroom: ~10.5 GB free - Generation speed: ~20–28 tok/s (Ollama), ~28–38 tok/s (MLX) - Context: 32K native, extensible to 131K with YaRN - Hybrid thinking/non-thinking mode — toggle per-request for planning vs rapid execution - Does NOT refuse code generation, shell commands, file operations, or tool calls - Ollama tag: `qwen3:14b` | GGUF source: `bartowski/Qwen3-14B-GGUF` ### Runner-up: Dolphin 3.0-R1-Mistral-24B Q4_K_M - Explicit uncensored training + DeepSeek R1-style chain-of-thought - ~14GB weights, fits at 8K context (~16.4GB total) - Tool calling ~85–90% F1 (vs Qwen3's 97%) — less reliable for agent loops - 128K native context, Apache 2.0 license - Use case: fallback when Qwen3's residual alignment interferes with creative/security-adjacent tasks ### Fast Mode: Qwen3-8B Q6_K - 0.933 F1 tool calling at ~45–55 tok/s (2x speed of 14B) - ~6.6GB model, ~11.6GB with 32K context - For routine tasks: simple tool calls, quick code edits, status checks, backlog queries ### Two-Model Strategy (Recommended) - **Qwen3-8B** as fast router for routine tool calls (shell, file reads, simple JSON, status) - **Qwen3-14B** for complex tasks (issue triage, multi-step planning, code review, creative writing) - Combined: ~17GB — both can stay loaded simultaneously with `OLLAMA_MAX_LOADED_MODELS=2` ## Critical Insight: "Uncensored" is a Red Herring - Abliteration (removing safety training) actively **degrades** structured output, tool-calling, and reasoning - Skywork analysis: abliterated Qwen3-30B-A3B "often underperforms smaller non-abliterated models" - Dolphin/abliterated models trade the capabilities an orchestrator needs most for freedoms it doesn't need - Qwen3-14B is permissive enough for all legitimate agent tasks; Ollama's constrained decoding forces compliance where needed ## Hermes 3 8B — Notable Mention - NousResearch's "neutrally-aligned" model — follows system prompts faithfully without imposing morality - "For Hermes, there is no such thing as latent thoughtcrime" - Strong philosophy but 8B can't match 14B reasoning depth; Hermes 4 at 14B would be a serious contender ## Ollama vs MLX Performance | Model | Quant | Ollama | MLX | |---|---|---|---| | 8B (Qwen3-8B) | Q4_K_M | ~35–45 tok/s | ~45–60 tok/s | | **14B (Qwen3-14B)** | **Q5_K_M** | **~20–28 tok/s** | **~28–38 tok/s** | | 24B (Devstral) | Q4_K_M | ~15–20 tok/s | ~20–28 tok/s | | 30B-A3B MoE (Qwen3) | Q4_K_M | ~30–40 tok/s | ~50–65 tok/s | MLX is 25–50% faster than Ollama, but Ollama has the superior ecosystem for agent orchestration (built-in tool calling API, JSON mode, model management, OpenAI-compatible endpoint). ## Includes: Production-Ready Artifacts 1. **Modelfile.timmy** — Complete Ollama Modelfile with optimized parameters (32K ctx, temp 0.3, system prompt) 2. **API curl examples** — Tool calling and schema-constrained JSON output 3. **5-test benchmark suite** — Shell scripts to evaluate any candidate model: - Test 1: Tool calling compliance (>90% valid JSON) - Test 2: Code generation correctness - Test 3: Shell command generation (no refusal) - Test 4: Multi-turn agent loop coherence - Test 5: Issue triage quality ## MCP Integration Path - Qwen-Agent has native MCP support (`pip install qwen-agent[mcp]`) - Multiple Ollama-MCP bridges exist (ollmcp, mcphost, ollama-mcp-bridge) - Qwen3 models explicitly listed among recommended Ollama models for MCP/tool use ## Cross-References - #966 — Three-tier LLM router architecture (this PDF provides the specific model recommendations for each tier) - #982 — Session Crystallization Playbook (Qwen3-14B is the recommended local model for the sovereignty stack) - #972 — Replacing Claude spec (this research directly answers "which local model replaces Claude") ## Work Suggestions See child issues for actionable implementation tasks.

The-Best-Local-Uncensored-Agent-Model-for-M3-Max-36GB.pdf

165 KiB

perplexity referenced this issue

2026-03-23 12:52:11 +00:00

Install and configure Qwen3-14B Q5_K_M as Timmy primary brain #1064

perplexity referenced this issue

2026-03-23 12:52:24 +00:00

Implement Qwen3-8B / Qwen3-14B dual-model routing strategy #1065

perplexity referenced this issue

2026-03-23 12:52:37 +00:00

Run 5-test benchmark suite against local model candidates #1066

perplexity referenced this issue

2026-03-23 12:52:49 +00:00

Set up MCP bridge for Qwen3 via Ollama #1067

perplexity referenced this issue

2026-03-23 12:53:02 +00:00

Configure Dolphin 3.0 8B as creative writing fallback #1068

perplexity referenced this issue

2026-03-23 12:53:17 +00:00

Evaluate vllm-mlx as high-performance inference backend #1069

perplexity referenced this issue

2026-03-23 12:53:37 +00:00

Implement three-tier metabolic LLM router (Qwen3-3B / Llama-8B / Qwen3-32B) #966

perplexity referenced this issue

2026-03-23 12:53:37 +00:00

[GOVERNING] Session Crystallization & Operational Playbook — Master Handoff Document #982

perplexity referenced this issue

2026-03-23 12:53:38 +00:00

[GOVERNING] Replacing Claude — Autonomous Research Pipeline Spec #972

perplexity referenced this issue

2026-03-23 13:11:25 +00:00

[GOVERNING] Timmy as Autonomous Orchestrator — Vassal Protocol #1070

perplexity referenced this issue

2026-03-23 13:23:55 +00:00

[GOVERNING] Timmy Handoff — March 23, 2026 Operational Briefing #1074

claude self-assigned this 2026-03-23 14:01:48 +00:00

claude referenced this issue from a commit

2026-03-23 18:34:59 +00:00

feat: implement Qwen3 two-model strategy from M3 Max model study

claude referenced a pull request that will close this issue

2026-03-23 18:35:20 +00:00

[claude] Qwen3 two-model strategy: 14B primary + 8B fast router (#1063) #1143

claude commented

2026-03-23 18:35:32 +00:00

PR created: #1143

Artifacts delivered from the study:

Modelfile.qwen3-14b — Primary agent model (Q5_K_M, 32K ctx, temp 0.3). Tool calling F1 0.971, ~17.5 GB on M3 Max 36 GB.
Modelfile.qwen3-8b — Fast routing model (Q6_K, 32K ctx, temp 0.2). F1 0.933 at ~45–55 tok/s, ~11.6 GB. Both models combined: ~17 GB — stay loaded simultaneously with OLLAMA_MAX_LOADED_MODELS=2.
scripts/benchmark_local_model.sh — 5-test evaluation suite (tool call compliance, code gen, shell gen, multi-turn coherence, issue triage quality).
src/config.py — Updated defaults: ollama_model → qwen3:14b, ollama_num_ctx → 32768, added ollama_fast_model = qwen3:8b and ollama_max_loaded_models = 2.

All 20 unit tests pass.

PR created: #1143 **Artifacts delivered from the study:** 1. `Modelfile.qwen3-14b` — Primary agent model (Q5_K_M, 32K ctx, temp 0.3). Tool calling F1 0.971, ~17.5 GB on M3 Max 36 GB. 2. `Modelfile.qwen3-8b` — Fast routing model (Q6_K, 32K ctx, temp 0.2). F1 0.933 at ~45–55 tok/s, ~11.6 GB. Both models combined: ~17 GB — stay loaded simultaneously with `OLLAMA_MAX_LOADED_MODELS=2`. 3. `scripts/benchmark_local_model.sh` — 5-test evaluation suite (tool call compliance, code gen, shell gen, multi-turn coherence, issue triage quality). 4. `src/config.py` — Updated defaults: `ollama_model → qwen3:14b`, `ollama_num_ctx → 32768`, added `ollama_fast_model = qwen3:8b` and `ollama_max_loaded_models = 2`. All 20 unit tests pass.

claude closed this issue

2026-03-23 18:35:58 +00:00

claude referenced this issue from a commit

2026-03-23 18:35:58 +00:00

[claude] Qwen3 two-model strategy: 14B primary + 8B fast router (#1063) (#1143)

Sign in to join this conversation.

Branches Tags

main

gemini/issue-892

claude/issue-1342

claude/issue-1346

claude/issue-1351

claude/issue-1340

fix/test-llm-triage-syntax

gemini/issue-1014

gemini/issue-932

claude/issue-1277

claude/issue-1139

claude/issue-870

claude/issue-1285

claude/issue-1292

claude/issue-1281

claude/issue-917

claude/issue-1275

claude/issue-925

claude/issue-1019

claude/issue-1094

claude/issue-1019-v3

fix/flaky-vassal-xdist-tests

fix/test-config-env-isolation

claude/issue-1019-v2

claude/issue-957-v2

claude/issue-1218

claude/issue-1217

test/chat-store-unit-tests

claude/issue-1191

claude/issue-1186

claude/issue-957

gemini/issue-936

claude/issue-1065

gemini/issue-976

gemini/issue-1149

claude/issue-1135

claude/issue-1064

gemini/issue-1012

claude/issue-1095

claude/issue-1102

claude/issue-1114

gemini/issue-978

gemini/issue-971

claude/issue-1074

claude/issue-987

claude/issue-1011

feature/internal-monologue

feature/issue-1006

feature/issue-1007

feature/issue-1008

feature/issue-1009

feature/issue-1010

feature/issue-1011

feature/issue-1012

feature/issue-1013

feature/issue-1014

feature/issue-981

feature/issue-982

feature/issue-983

feature/issue-984

feature/issue-985

feature/issue-986

feature/issue-987

feature/issue-993

claude/issue-943

claude/issue-975

claude/issue-989

claude/issue-988

fix/loop-guard-gitea-api-and-queue-validation

feature/lhf-tech-debt-fixes

kimi/issue-753

kimi/issue-714

kimi/issue-716

fix/csrf-check-before-execute

chore/migrate-gitea-to-vps

kimi/issue-640

fix/utcnow-calm-py

kimi/issue-635

kimi/issue-625

fix/router-api-truncated-param

kimi/issue-604

kimi/issue-594

review-fixes

kimi/issue-570

kimi/issue-554

kimi/issue-539

kimi/issue-540

feature/ipad-v1-api

kimi/issue-506

kimi/issue-512

refactor/airllm-doc-cleanup

kimi/issue-513

kimi/issue-514

kimi/issue-500

kimi/issue-492

kimi/issue-490

kimi/issue-459

kimi/issue-472

kimi/issue-473

kimi/issue-462

kimi/issue-463

kimi/issue-454

kimi/issue-445

kimi/issue-446

kimi/issue-431

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#1063