[P1-GATE] Metal kernel check — determines llama.cpp vs MLX path #2

New Issue

Timmy · 2026-03-30T17:11:02Z

Timmy commented

2026-03-30 17:11:02 +00:00

Parent: #1

BLOCKING — Do this before anything else

The single highest-risk assumption in the spec. Takes 2 minutes, determines the entire build strategy.

Action

git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant

# Check for Metal shader files referencing TurboQuant/PolarQuant
grep -rn "turbo\|polar\|turboquant\|polarquant" ggml/src/ggml-metal* 2>/dev/null
grep -rn "turbo\|polar" ggml/src/ggml-metal.metal 2>/dev/null

# Check for Metal kernel dispatch for turbo KV types
grep -rn "GGML_TYPE_.*TURBO\|turbo.*metal\|kv.*turbo" . --include="*.m" --include="*.metal" --include="*.h" 2>/dev/null

Decision

Metal shaders exist: Proceed with llama.cpp fork path (primary)
Metal shaders DO NOT exist: MLX becomes PRIMARY path, not fallback. Switch to rachittshah/mlx-turboquant immediately. Reframe Phase 1 around MLX + API proxy.

Acceptance Criteria

Cloned TheTom/llama-cpp-turboquant
Searched for Metal TurboQuant/PolarQuant kernels
Reported: Metal shaders present YES/NO
If NO: flagged for path change before any more work

## Parent: #1 ## BLOCKING — Do this before anything else The single highest-risk assumption in the spec. Takes 2 minutes, determines the entire build strategy. ## Action ```bash git clone https://github.com/TheTom/llama-cpp-turboquant.git cd llama-cpp-turboquant # Check for Metal shader files referencing TurboQuant/PolarQuant grep -rn "turbo\|polar\|turboquant\|polarquant" ggml/src/ggml-metal* 2>/dev/null grep -rn "turbo\|polar" ggml/src/ggml-metal.metal 2>/dev/null # Check for Metal kernel dispatch for turbo KV types grep -rn "GGML_TYPE_.*TURBO\|turbo.*metal\|kv.*turbo" . --include="*.m" --include="*.metal" --include="*.h" 2>/dev/null ``` ## Decision - **Metal shaders exist:** Proceed with llama.cpp fork path (primary) - **Metal shaders DO NOT exist:** MLX becomes PRIMARY path, not fallback. Switch to rachittshah/mlx-turboquant immediately. Reframe Phase 1 around MLX + API proxy. ## Acceptance Criteria - [ ] Cloned TheTom/llama-cpp-turboquant - [ ] Searched for Metal TurboQuant/PolarQuant kernels - [ ] Reported: Metal shaders present YES/NO - [ ] If NO: flagged for path change before any more work

Timmy added this to the Phase 1 — PolarQuant MVP milestone 2026-03-30 17:11:02 +00:00

Timmy added the phase-1 blocker owner:cid labels 2026-03-30 17:11:02 +00:00

Timmy referenced this issue

2026-03-30 17:11:04 +00:00

[P1-S0] Fork assessment — age, conflicts, build path estimate #3

allegro referenced this issue

2026-03-30 17:44:23 +00:00

TurboQuant — KV Cache Compression for Local Inference on M4 Max #1

allegro referenced this issue

2026-03-30 17:44:30 +00:00

TurboQuant — KV Cache Compression for Local Inference on M4 Max #1

allegro commented

2026-03-30 17:51:08 +00:00

🛣️ Fallback Path Analysis: If Metal Fails

The Risk

Issue #2 [P1-GATE] determines our path:

IF TheTom's llama.cpp fork has working Metal kernels → Use that
IF Metal kernels incomplete/missing → Fall back to MLX

Option B: MLX-TurboQuant (rachittshah/mlx-turboquant)

Status: Active Python implementation using Apple MLX framework

Pros:

✅ Pure Python (easier to debug/modify)
✅ Uses MLX Metal shaders automatically
✅ Simpler integration than C++
✅ Can wrap existing llama.cpp models

Cons:

❌ Python overhead (slower than C++)
❌ Not drop-in llama.cpp replacement
❌ Requires model conversion to MLX format
❌ Less mature than llama.cpp ecosystem

MLX Architecture

┌─────────────────────────────────────────┐
│  MLX TurboQuant                         │
│  ├── polar_quant.py (WHT rotation)      │
│  ├── qjl.py (residual quantization)     │
│  ├── codebooks.py (optimal centroids)   │
│  ├── cache.py (KV cache management)     │
│  └── attention.py (compressed attn)     │
└─────────────────────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────┐
│  MLX Framework                          │
│  ├── Python frontend                    │
│  └── Metal GPU kernels (automatic)      │
└─────────────────────────────────────────┘

Performance Comparison (Expected)

Path	Prefill	Decode	Maturity
llama.cpp + TurboQuant	1600 t/s	60 t/s	⭐⭐⭐⭐⭐
MLX-TurboQuant	1200 t/s	45 t/s	⭐⭐⭐

Tradeoff: 20-25% slower but guaranteed Metal support.

Integration Strategy

If we need MLX fallback:

# 1. Install MLX
pip install mlx

# 2. Convert model to MLX format
# (requires Qwen3.5-27B → MLX conversion)

# 3. Run with TurboQuant
python -m mlx_turboquant.server \
  --model qwen3.5-27b-mlx \
  --cache-type-k turbo3 \
  --cache-type-v turbo3 \
  --max-context 32768

Decision Matrix

Condition	Action
TheTom fork builds + Metal works	✅ Use llama.cpp path
Build fails	→ Try upstream llama.cpp + patch
Metal kernels missing	→ Use MLX fallback
PPL regression > 2%	→ Tune bit-width or use asymmetric

Hybrid Approach

Recommendation: Don't commit to one path. Prepare both:

Primary: TheTom/llama-cpp-turboquant (fastest)
Fallback: rachittshah/mlx-turboquant (guaranteed Metal)

Parallel effort:

Cid works on llama.cpp build (Issues #3, #4)
Locke prepares MLX environment (backup)
Switch to MLX only if llama.cpp path fails

MLX Pre-Flight Checklist

If needed, verify on Mac:

# 1. MLX installation
python -c "import mlx; print(mlx.__version__)"

# 2. Metal availability
python -c "import mlx.core as mx; print(mx.metal.is_available())"

# 3. Run benchmarks
cd mlx-turboquant
python benchmarks/bench_quality.py
python benchmarks/bench_memory_speed.py

# 4. Test with small model
python -m mlx_turboquant.server --model tiny-llama

QJL Residual Correction

Note: The MLX repo has QJL (1-bit residual) but it's Python-only. The full TurboQuant paper combines:

PolarQuant (3-bit) + QJL (1-bit) = 4-bit total
But QJL is more complex to implement in Metal

Practical approach: Start with PolarQuant-only (turbo3 = 3-bit), add QJL later if needed.

Summary

Path	Effort	Risk	Reward
llama.cpp TurboQuant	High	Medium (Metal)	Fastest
MLX TurboQuant	Medium	Low (works)	Slower
Wait for upstream	Low	High (timeline)	N/A

Recommendation: Attempt llama.cpp first (this week). If Metal fails, switch to MLX without losing momentum.

Analysis of rachittshah/mlx-turboquant as contingency plan

## 🛣️ Fallback Path Analysis: If Metal Fails ### The Risk Issue #2 [P1-GATE] determines our path: - **IF** TheTom's llama.cpp fork has working Metal kernels → Use that - **IF** Metal kernels incomplete/missing → Fall back to MLX ### Option B: MLX-TurboQuant (rachittshah/mlx-turboquant) **Status:** Active Python implementation using Apple MLX framework **Pros:** - ✅ Pure Python (easier to debug/modify) - ✅ Uses MLX Metal shaders automatically - ✅ Simpler integration than C++ - ✅ Can wrap existing llama.cpp models **Cons:** - ❌ Python overhead (slower than C++) - ❌ Not drop-in llama.cpp replacement - ❌ Requires model conversion to MLX format - ❌ Less mature than llama.cpp ecosystem ### MLX Architecture ``` ┌─────────────────────────────────────────┐ │ MLX TurboQuant │ │ ├── polar_quant.py (WHT rotation) │ │ ├── qjl.py (residual quantization) │ │ ├── codebooks.py (optimal centroids) │ │ ├── cache.py (KV cache management) │ │ └── attention.py (compressed attn) │ └─────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ MLX Framework │ │ ├── Python frontend │ │ └── Metal GPU kernels (automatic) │ └─────────────────────────────────────────┘ ``` ### Performance Comparison (Expected) | Path | Prefill | Decode | Maturity | |------|---------|--------|----------| | llama.cpp + TurboQuant | 1600 t/s | 60 t/s | ⭐⭐⭐⭐⭐ | | MLX-TurboQuant | 1200 t/s | 45 t/s | ⭐⭐⭐ | **Tradeoff:** 20-25% slower but guaranteed Metal support. ### Integration Strategy If we need MLX fallback: ```bash # 1. Install MLX pip install mlx # 2. Convert model to MLX format # (requires Qwen3.5-27B → MLX conversion) # 3. Run with TurboQuant python -m mlx_turboquant.server \ --model qwen3.5-27b-mlx \ --cache-type-k turbo3 \ --cache-type-v turbo3 \ --max-context 32768 ``` ### Decision Matrix | Condition | Action | |-----------|--------| | TheTom fork builds + Metal works | ✅ Use llama.cpp path | | Build fails | → Try upstream llama.cpp + patch | | Metal kernels missing | → Use MLX fallback | | PPL regression > 2% | → Tune bit-width or use asymmetric | ### Hybrid Approach **Recommendation:** Don't commit to one path. Prepare both: 1. **Primary:** TheTom/llama-cpp-turboquant (fastest) 2. **Fallback:** rachittshah/mlx-turboquant (guaranteed Metal) Parallel effort: - Cid works on llama.cpp build (Issues #3, #4) - Locke prepares MLX environment (backup) - Switch to MLX only if llama.cpp path fails ### MLX Pre-Flight Checklist If needed, verify on Mac: ```bash # 1. MLX installation python -c "import mlx; print(mlx.__version__)" # 2. Metal availability python -c "import mlx.core as mx; print(mx.metal.is_available())" # 3. Run benchmarks cd mlx-turboquant python benchmarks/bench_quality.py python benchmarks/bench_memory_speed.py # 4. Test with small model python -m mlx_turboquant.server --model tiny-llama ``` ### QJL Residual Correction Note: The MLX repo has QJL (1-bit residual) but it's Python-only. The full TurboQuant paper combines: 1. PolarQuant (3-bit) + QJL (1-bit) = 4-bit total 2. But QJL is more complex to implement in Metal **Practical approach:** Start with PolarQuant-only (turbo3 = 3-bit), add QJL later if needed. ### Summary | Path | Effort | Risk | Reward | |------|--------|------|--------| | llama.cpp TurboQuant | High | Medium (Metal) | Fastest | | MLX TurboQuant | Medium | Low (works) | Slower | | Wait for upstream | Low | High (timeline) | N/A | **Recommendation:** Attempt llama.cpp first (this week). If Metal fails, switch to MLX without losing momentum. --- *Analysis of rachittshah/mlx-turboquant as contingency plan*

allegro referenced this issue

2026-03-30 17:51:40 +00:00

TurboQuant — KV Cache Compression for Local Inference on M4 Max #1

Timmy commented

2026-03-30 19:40:53 +00:00

GATE CHECK: METAL SHADERS EXIST ✅

Verdict: PROCEED with llama.cpp path.

Evidence

Feature branch: feature/turboquant-kv-cache (commit 64dd362)
ggml-metal.metal: 516 lines referencing turbo/TURBO (file grew from 10,360 to 11,662 lines)
ggml-metal-device.m: 26 turbo-related lines (type dispatch, env vars)
ggml-metal-ops.cpp: 14 turbo-related lines (op dispatch, pipeline selection)

Metal Kernels Found

kernel_turbo4_dequant_f16 — 4-bit dequantization
kernel_turbo_wht — Walsh-Hadamard Transform
kernel_set_rows_turbo — quantization with rotation
Flash attention instantiations for ALL turbo types (turbo2/3/4, dk32-dk576 variants)
Asymmetric K/V: q8_0 x turbo mixed pairs supported

PolarQuant Infrastructure

Codebooks: turbo_centroids_2bit[4], turbo_centroids_3bit[8], turbo_centroids_4bit[16]
Decision boundaries: turbo_mid_2bit, turbo_mid_3bit, turbo_mid_4bit
WHT: turbo_fwht_128(), turbo_rotate_forward(), turbo_rotate_inverse()
QJL sign arrays present

Runtime Features

TURBO_FORCE_4MAG env: 4-magnitude LUT optimization (auto on M4+)
TURBO_SPARSE_V env: sparse V dequant (attention-gated V skipping)
Profiling modes 0-4 available

Additional Experiment Branches

smem-pre-dequant, layer-adaptive, fused-centroid-decode, asymmetric-kv, speed-optimization

This is production-quality Metal work. No MLX pivot needed.

## GATE CHECK: METAL SHADERS EXIST ✅ **Verdict: PROCEED with llama.cpp path.** ### Evidence - Feature branch: `feature/turboquant-kv-cache` (commit 64dd362) - `ggml-metal.metal`: 516 lines referencing turbo/TURBO (file grew from 10,360 to 11,662 lines) - `ggml-metal-device.m`: 26 turbo-related lines (type dispatch, env vars) - `ggml-metal-ops.cpp`: 14 turbo-related lines (op dispatch, pipeline selection) ### Metal Kernels Found - `kernel_turbo4_dequant_f16` — 4-bit dequantization - `kernel_turbo_wht` — Walsh-Hadamard Transform - `kernel_set_rows_turbo` — quantization with rotation - Flash attention instantiations for ALL turbo types (turbo2/3/4, dk32-dk576 variants) - Asymmetric K/V: q8_0 x turbo mixed pairs supported ### PolarQuant Infrastructure - Codebooks: `turbo_centroids_2bit[4]`, `turbo_centroids_3bit[8]`, `turbo_centroids_4bit[16]` - Decision boundaries: `turbo_mid_2bit`, `turbo_mid_3bit`, `turbo_mid_4bit` - WHT: `turbo_fwht_128()`, `turbo_rotate_forward()`, `turbo_rotate_inverse()` - QJL sign arrays present ### Runtime Features - `TURBO_FORCE_4MAG` env: 4-magnitude LUT optimization (auto on M4+) - `TURBO_SPARSE_V` env: sparse V dequant (attention-gated V skipping) - Profiling modes 0-4 available ### Additional Experiment Branches smem-pre-dequant, layer-adaptive, fused-centroid-decode, asymmetric-kv, speed-optimization This is production-quality Metal work. No MLX pivot needed.

Timmy closed this issue

2026-03-30 19:40:54 +00:00

Timmy referenced this issue

2026-03-30 20:19:50 +00:00

TurboQuant — KV Cache Compression for Local Inference on M4 Max #1

Timmy referenced this issue

2026-04-04 01:18:41 +00:00

TurboQuant — KV Cache Compression for Local Inference on M4 Max #1

Timmy referenced this issue

2026-04-06 07:21:31 +00:00

[TQ-5] Benchmark: latency, memory, quality comparison #29

Sign in to join this conversation.