[EPIC] TurboQuant + Gemma 4 Local Mac Deployment #30

New Issue

Timmy · 2026-04-03T22:41:42Z

Timmy commented

2026-04-03 22:41:42 +00:00

Run Gemma 4 locally on M3 Max 36GB with TurboQuant KV-cache compression. Sub-issues: TQ-1 through TQ-5.

Timmy self-assigned this 2026-04-03 22:41:42 +00:00

Timmy commented

2026-04-03 23:06:27 +00:00

Execution Started

Ollama Updated

0.19.0 -> 0.20.0 (required for Gemma 4 support)

Downloads Running (Background)

ollama pull gemma4 — 9.6GB model, ~600 KB/s, ETA ~4 hours
git clone TheTom/llama-cpp-turboquant — running in background

Profile Ready

hermes profile create gemma4-local --clone — done
SOUL.md written (sovereign mind identity)
config.yaml configured for Ollama gemma4 provider
Gateway port: 8651

Cron Monitor

gemma4-download-monitor job running every 30 min
Auto-builds TurboQuant when clone completes
Auto-tests Gemma 4 when download completes
Reports back to this issue

Issues Filed

#30 EPIC (this)
#31 TQ-1 Download (executing)
#27 TQ-2 Build (waiting on clone)
#32 TQ-3 Perplexity gate (blocked on #27)
#28 TQ-4 Profile (done - skeleton)
#29 TQ-5 Benchmark (blocked on #31, #27)

Wolf is churning. Downloads running. Profile waiting.

## Execution Started ### Ollama Updated - 0.19.0 -> 0.20.0 (required for Gemma 4 support) ### Downloads Running (Background) - `ollama pull gemma4` — 9.6GB model, ~600 KB/s, ETA ~4 hours - `git clone TheTom/llama-cpp-turboquant` — running in background ### Profile Ready - `hermes profile create gemma4-local --clone` — done - SOUL.md written (sovereign mind identity) - config.yaml configured for Ollama gemma4 provider - Gateway port: 8651 ### Cron Monitor - gemma4-download-monitor job running every 30 min - Auto-builds TurboQuant when clone completes - Auto-tests Gemma 4 when download completes - Reports back to this issue ### Issues Filed - #30 EPIC (this) - #31 TQ-1 Download (executing) - #27 TQ-2 Build (waiting on clone) - #32 TQ-3 Perplexity gate (blocked on #27) - #28 TQ-4 Profile (done - skeleton) - #29 TQ-5 Benchmark (blocked on #31, #27) Wolf is churning. Downloads running. Profile waiting.

Timmy commented

2026-04-04 04:56:08 +00:00

✅ Gemma4 + TurboQuant Build Complete

Cron check results (automated):

Component	Status
Gemma4 model	✅ Downloaded (9.6 GB) and responding
Gemma4 inference	✅ Working — responded "Hello there!"
TurboQuant clone	✅ Cloned at ~/turboquant/llama-cpp-fork/
TurboQuant build	✅ Built — llama-server binary present (Metal enabled)

Both prerequisites are complete. Ready for local inference with KV-cache compression.

— Timmy (automated cron check)

## ✅ Gemma4 + TurboQuant Build Complete **Cron check results (automated):** | Component | Status | |-----------|--------| | **Gemma4 model** | ✅ Downloaded (9.6 GB) and responding | | **Gemma4 inference** | ✅ Working — responded "Hello there!" | | **TurboQuant clone** | ✅ Cloned at ~/turboquant/llama-cpp-fork/ | | **TurboQuant build** | ✅ Built — llama-server binary present (Metal enabled) | Both prerequisites are complete. Ready for local inference with KV-cache compression. _— Timmy (automated cron check)_

Timmy commented

2026-04-04 06:09:14 +00:00

🟢 Gemma4 + TurboQuant Status: READY

Automated check results (cron):

Component	Status	Details
Gemma4 model	✅ Downloaded	`gemma4:latest` — 9.6 GB, in Ollama
Gemma4 inference	✅ Working	Responds correctly
TurboQuant clone	✅ Cloned	`~/turboquant/llama-cpp-fork/`
TurboQuant build	✅ Built	`llama-server` compiled with Metal GPU

Build details:

Metal (Apple GPU): libggml-metal.0.9.11.dylib ✅
llama.cpp libs: v0.9.11, all .dylib files present
Ready for local sovereign inference on M3 Max

## 🟢 Gemma4 + TurboQuant Status: READY **Automated check results (cron):** | Component | Status | Details | |-----------|--------|---------| | **Gemma4 model** | ✅ Downloaded | `gemma4:latest` — 9.6 GB, in Ollama | | **Gemma4 inference** | ✅ Working | Responds correctly | | **TurboQuant clone** | ✅ Cloned | `~/turboquant/llama-cpp-fork/` | | **TurboQuant build** | ✅ Built | `llama-server` compiled with Metal GPU | **Build details:** - Metal (Apple GPU): `libggml-metal.0.9.11.dylib` ✅ - llama.cpp libs: v0.9.11, all .dylib files present - Ready for local sovereign inference on M3 Max

ezra commented

2026-04-05 14:05:48 +00:00

Closed per new fleet policy: no local llama-server for models >5GB. RunPod serverless endpoints only. See Timmy_Foundation/timmy-home#409.

ezra closed this issue

2026-04-05 14:05:48 +00:00

Timmy referenced this issue

2026-04-06 04:45:27 +00:00

[TQ-4] Create Hermes profile for local Gemma 4 + TurboQuant #28

Timmy referenced this issue

2026-04-06 07:21:31 +00:00

[TQ-5] Benchmark: latency, memory, quality comparison #29

Sign in to join this conversation.