[EPIC] TurboQuant + Gemma 4 Local Mac Deployment #30

Closed
opened 2026-04-03 22:41:42 +00:00 by Timmy · 4 comments
Owner

Run Gemma 4 locally on M3 Max 36GB with TurboQuant KV-cache compression. Sub-issues: TQ-1 through TQ-5.

Run Gemma 4 locally on M3 Max 36GB with TurboQuant KV-cache compression. Sub-issues: TQ-1 through TQ-5.
Timmy self-assigned this 2026-04-03 22:41:42 +00:00
Author
Owner

Execution Started

Ollama Updated

  • 0.19.0 -> 0.20.0 (required for Gemma 4 support)

Downloads Running (Background)

  • ollama pull gemma4 — 9.6GB model, ~600 KB/s, ETA ~4 hours
  • git clone TheTom/llama-cpp-turboquant — running in background

Profile Ready

  • hermes profile create gemma4-local --clone — done
  • SOUL.md written (sovereign mind identity)
  • config.yaml configured for Ollama gemma4 provider
  • Gateway port: 8651

Cron Monitor

  • gemma4-download-monitor job running every 30 min
  • Auto-builds TurboQuant when clone completes
  • Auto-tests Gemma 4 when download completes
  • Reports back to this issue

Issues Filed

  • #30 EPIC (this)
  • #31 TQ-1 Download (executing)
  • #27 TQ-2 Build (waiting on clone)
  • #32 TQ-3 Perplexity gate (blocked on #27)
  • #28 TQ-4 Profile (done - skeleton)
  • #29 TQ-5 Benchmark (blocked on #31, #27)

Wolf is churning. Downloads running. Profile waiting.

## Execution Started ### Ollama Updated - 0.19.0 -> 0.20.0 (required for Gemma 4 support) ### Downloads Running (Background) - `ollama pull gemma4` — 9.6GB model, ~600 KB/s, ETA ~4 hours - `git clone TheTom/llama-cpp-turboquant` — running in background ### Profile Ready - `hermes profile create gemma4-local --clone` — done - SOUL.md written (sovereign mind identity) - config.yaml configured for Ollama gemma4 provider - Gateway port: 8651 ### Cron Monitor - gemma4-download-monitor job running every 30 min - Auto-builds TurboQuant when clone completes - Auto-tests Gemma 4 when download completes - Reports back to this issue ### Issues Filed - #30 EPIC (this) - #31 TQ-1 Download (executing) - #27 TQ-2 Build (waiting on clone) - #32 TQ-3 Perplexity gate (blocked on #27) - #28 TQ-4 Profile (done - skeleton) - #29 TQ-5 Benchmark (blocked on #31, #27) Wolf is churning. Downloads running. Profile waiting.
Author
Owner

Gemma4 + TurboQuant Build Complete

Cron check results (automated):

Component Status
Gemma4 model Downloaded (9.6 GB) and responding
Gemma4 inference Working — responded "Hello there!"
TurboQuant clone Cloned at ~/turboquant/llama-cpp-fork/
TurboQuant build Built — llama-server binary present (Metal enabled)

Both prerequisites are complete. Ready for local inference with KV-cache compression.

— Timmy (automated cron check)

## ✅ Gemma4 + TurboQuant Build Complete **Cron check results (automated):** | Component | Status | |-----------|--------| | **Gemma4 model** | ✅ Downloaded (9.6 GB) and responding | | **Gemma4 inference** | ✅ Working — responded "Hello there!" | | **TurboQuant clone** | ✅ Cloned at ~/turboquant/llama-cpp-fork/ | | **TurboQuant build** | ✅ Built — llama-server binary present (Metal enabled) | Both prerequisites are complete. Ready for local inference with KV-cache compression. _— Timmy (automated cron check)_
Author
Owner

🟢 Gemma4 + TurboQuant Status: READY

Automated check results (cron):

Component Status Details
Gemma4 model Downloaded gemma4:latest — 9.6 GB, in Ollama
Gemma4 inference Working Responds correctly
TurboQuant clone Cloned ~/turboquant/llama-cpp-fork/
TurboQuant build Built llama-server compiled with Metal GPU

Build details:

  • Metal (Apple GPU): libggml-metal.0.9.11.dylib
  • llama.cpp libs: v0.9.11, all .dylib files present
  • Ready for local sovereign inference on M3 Max
## 🟢 Gemma4 + TurboQuant Status: READY **Automated check results (cron):** | Component | Status | Details | |-----------|--------|---------| | **Gemma4 model** | ✅ Downloaded | `gemma4:latest` — 9.6 GB, in Ollama | | **Gemma4 inference** | ✅ Working | Responds correctly | | **TurboQuant clone** | ✅ Cloned | `~/turboquant/llama-cpp-fork/` | | **TurboQuant build** | ✅ Built | `llama-server` compiled with Metal GPU | **Build details:** - Metal (Apple GPU): `libggml-metal.0.9.11.dylib` ✅ - llama.cpp libs: v0.9.11, all .dylib files present - Ready for local sovereign inference on M3 Max
Member

Closed per new fleet policy: no local llama-server for models >5GB. RunPod serverless endpoints only. See Timmy_Foundation/timmy-home#409.

Closed per new fleet policy: no local llama-server for models >5GB. RunPod serverless endpoints only. See Timmy_Foundation/timmy-home#409.
ezra closed this issue 2026-04-05 14:05:48 +00:00
Sign in to join this conversation.