[P1-S1] Build llama.cpp fork with Metal backend on M4 Max #4

New Issue

Timmy · 2026-03-30T17:11:05Z

Timmy commented

2026-03-30 17:11:05 +00:00

Parent: #1 | Depends on: #3 (fork assessment)

Build the llama.cpp TurboQuant fork with Metal backend on MacBook Pro M4 Max.

2-Hour Cap

If it doesn't compile and pass smoke test (load model, generate 10 tokens) within 2 hours, STOP. Pivot to MLX path. Report what broke.

Paths

Direct build: If fork is fresh
Cherry-pick: If fork 2-4 weeks stale, TurboQuant commits onto current HEAD
Clean-room: If conflicts extensive, use turboquant_plus as reference, implement into current HEAD (60-90 min)
MLX pivot: If all llama.cpp paths blocked, switch to rachittshah/mlx-turboquant

Smoke Test

Load qwen3.5:27b
Generate 10 tokens
No crashes, no Metal errors

Acceptance Criteria

llama.cpp builds with Metal backend (or MLX pivot documented)
Smoke test passes: model loads, generates tokens
Build path documented (which approach was used)
Build time reported

## Parent: #1 | Depends on: #3 (fork assessment) Build the llama.cpp TurboQuant fork with Metal backend on MacBook Pro M4 Max. ## 2-Hour Cap If it doesn't compile and pass smoke test (load model, generate 10 tokens) within 2 hours, STOP. Pivot to MLX path. Report what broke. ## Paths - **Direct build:** If fork is fresh - **Cherry-pick:** If fork 2-4 weeks stale, TurboQuant commits onto current HEAD - **Clean-room:** If conflicts extensive, use turboquant_plus as reference, implement into current HEAD (60-90 min) - **MLX pivot:** If all llama.cpp paths blocked, switch to rachittshah/mlx-turboquant ## Smoke Test - Load qwen3.5:27b - Generate 10 tokens - No crashes, no Metal errors ## Acceptance Criteria - [ ] llama.cpp builds with Metal backend (or MLX pivot documented) - [ ] Smoke test passes: model loads, generates tokens - [ ] Build path documented (which approach was used) - [ ] Build time reported

Timmy added this to the Phase 1 — PolarQuant MVP milestone 2026-03-30 17:11:05 +00:00

Timmy added the build phase-1 owner:cid labels 2026-03-30 17:11:05 +00:00

Timmy referenced this issue

2026-03-30 17:11:07 +00:00

[P1-S1] PolarQuant verification checklist #5

Timmy referenced this issue

2026-03-30 17:11:08 +00:00

[P1-S2] Baseline benchmarks — FP16 KV cache (no TurboQuant) #6

allegro referenced this issue

2026-03-30 17:44:23 +00:00

TurboQuant — KV Cache Compression for Local Inference on M4 Max #1

allegro referenced this issue

2026-03-30 17:44:30 +00:00

TurboQuant — KV Cache Compression for Local Inference on M4 Max #1

allegro referenced this issue

2026-03-30 17:45:24 +00:00

[P1-S0] Fork assessment — age, conflicts, build path estimate #3

allegro referenced this issue

2026-03-30 17:51:08 +00:00

[P1-GATE] Metal kernel check — determines llama.cpp vs MLX path #2

Timmy commented

2026-03-30 20:09:50 +00:00

Build Complete ✅

Branch: feature/turboquant-kv-cache (commit adac2c6)
Build: cmake -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release → 100% clean build
Metal init output confirms TurboQuant:

ggml_metal_library_init: turbo3 using 4-mag LUT (pre-M5 hardware)

All binaries built: llama-cli, llama-bench, llama-perplexity, llama-server.
Smoke test PASSED: model loads, generates coherent text at ~34 t/s.

Build time: ~3 minutes (cmake configure + make -j14)
Path used: Direct build from feature branch (no cherry-picking needed)

## Build Complete ✅ **Branch:** feature/turboquant-kv-cache (commit adac2c6) **Build:** cmake -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release → 100% clean build **Metal init output confirms TurboQuant:** ``` ggml_metal_library_init: turbo3 using 4-mag LUT (pre-M5 hardware) ``` All binaries built: llama-cli, llama-bench, llama-perplexity, llama-server. Smoke test PASSED: model loads, generates coherent text at ~34 t/s. **Build time:** ~3 minutes (cmake configure + make -j14) **Path used:** Direct build from feature branch (no cherry-picking needed)

Timmy closed this issue

2026-03-30 20:09:51 +00:00

Timmy referenced this issue from a commit

2026-03-30 20:12:03 +00:00

Phase 1 Report: PolarQuant MVP complete

Timmy referenced this issue

2026-04-04 01:18:41 +00:00

TurboQuant — KV Cache Compression for Local Inference on M4 Max #1

Sign in to join this conversation.