[P1-GATE] Metal kernel check — determines llama.cpp vs MLX path #2

Closed
opened 2026-03-30 17:11:02 +00:00 by Timmy · 2 comments
Owner

Parent: #1

BLOCKING — Do this before anything else

The single highest-risk assumption in the spec. Takes 2 minutes, determines the entire build strategy.

Action

git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant

# Check for Metal shader files referencing TurboQuant/PolarQuant
grep -rn "turbo\|polar\|turboquant\|polarquant" ggml/src/ggml-metal* 2>/dev/null
grep -rn "turbo\|polar" ggml/src/ggml-metal.metal 2>/dev/null

# Check for Metal kernel dispatch for turbo KV types
grep -rn "GGML_TYPE_.*TURBO\|turbo.*metal\|kv.*turbo" . --include="*.m" --include="*.metal" --include="*.h" 2>/dev/null

Decision

  • Metal shaders exist: Proceed with llama.cpp fork path (primary)
  • Metal shaders DO NOT exist: MLX becomes PRIMARY path, not fallback. Switch to rachittshah/mlx-turboquant immediately. Reframe Phase 1 around MLX + API proxy.

Acceptance Criteria

  • Cloned TheTom/llama-cpp-turboquant
  • Searched for Metal TurboQuant/PolarQuant kernels
  • Reported: Metal shaders present YES/NO
  • If NO: flagged for path change before any more work
## Parent: #1 ## BLOCKING — Do this before anything else The single highest-risk assumption in the spec. Takes 2 minutes, determines the entire build strategy. ## Action ```bash git clone https://github.com/TheTom/llama-cpp-turboquant.git cd llama-cpp-turboquant # Check for Metal shader files referencing TurboQuant/PolarQuant grep -rn "turbo\|polar\|turboquant\|polarquant" ggml/src/ggml-metal* 2>/dev/null grep -rn "turbo\|polar" ggml/src/ggml-metal.metal 2>/dev/null # Check for Metal kernel dispatch for turbo KV types grep -rn "GGML_TYPE_.*TURBO\|turbo.*metal\|kv.*turbo" . --include="*.m" --include="*.metal" --include="*.h" 2>/dev/null ``` ## Decision - **Metal shaders exist:** Proceed with llama.cpp fork path (primary) - **Metal shaders DO NOT exist:** MLX becomes PRIMARY path, not fallback. Switch to rachittshah/mlx-turboquant immediately. Reframe Phase 1 around MLX + API proxy. ## Acceptance Criteria - [ ] Cloned TheTom/llama-cpp-turboquant - [ ] Searched for Metal TurboQuant/PolarQuant kernels - [ ] Reported: Metal shaders present YES/NO - [ ] If NO: flagged for path change before any more work
Timmy added this to the Phase 1 — PolarQuant MVP milestone 2026-03-30 17:11:02 +00:00
Timmy added the phase-1blockerowner:cid labels 2026-03-30 17:11:02 +00:00
Member

🛣️ Fallback Path Analysis: If Metal Fails

The Risk

Issue #2 [P1-GATE] determines our path:

  • IF TheTom's llama.cpp fork has working Metal kernels → Use that
  • IF Metal kernels incomplete/missing → Fall back to MLX

Option B: MLX-TurboQuant (rachittshah/mlx-turboquant)

Status: Active Python implementation using Apple MLX framework

Pros:

  • Pure Python (easier to debug/modify)
  • Uses MLX Metal shaders automatically
  • Simpler integration than C++
  • Can wrap existing llama.cpp models

Cons:

  • Python overhead (slower than C++)
  • Not drop-in llama.cpp replacement
  • Requires model conversion to MLX format
  • Less mature than llama.cpp ecosystem

MLX Architecture

┌─────────────────────────────────────────┐
│  MLX TurboQuant                         │
│  ├── polar_quant.py (WHT rotation)      │
│  ├── qjl.py (residual quantization)     │
│  ├── codebooks.py (optimal centroids)   │
│  ├── cache.py (KV cache management)     │
│  └── attention.py (compressed attn)     │
└─────────────────────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────┐
│  MLX Framework                          │
│  ├── Python frontend                    │
│  └── Metal GPU kernels (automatic)      │
└─────────────────────────────────────────┘

Performance Comparison (Expected)

Path Prefill Decode Maturity
llama.cpp + TurboQuant 1600 t/s 60 t/s
MLX-TurboQuant 1200 t/s 45 t/s

Tradeoff: 20-25% slower but guaranteed Metal support.

Integration Strategy

If we need MLX fallback:

# 1. Install MLX
pip install mlx

# 2. Convert model to MLX format
# (requires Qwen3.5-27B → MLX conversion)

# 3. Run with TurboQuant
python -m mlx_turboquant.server \
  --model qwen3.5-27b-mlx \
  --cache-type-k turbo3 \
  --cache-type-v turbo3 \
  --max-context 32768

Decision Matrix

Condition Action
TheTom fork builds + Metal works Use llama.cpp path
Build fails → Try upstream llama.cpp + patch
Metal kernels missing → Use MLX fallback
PPL regression > 2% → Tune bit-width or use asymmetric

Hybrid Approach

Recommendation: Don't commit to one path. Prepare both:

  1. Primary: TheTom/llama-cpp-turboquant (fastest)
  2. Fallback: rachittshah/mlx-turboquant (guaranteed Metal)

Parallel effort:

  • Cid works on llama.cpp build (Issues #3, #4)
  • Locke prepares MLX environment (backup)
  • Switch to MLX only if llama.cpp path fails

MLX Pre-Flight Checklist

If needed, verify on Mac:

# 1. MLX installation
python -c "import mlx; print(mlx.__version__)"

# 2. Metal availability
python -c "import mlx.core as mx; print(mx.metal.is_available())"

# 3. Run benchmarks
cd mlx-turboquant
python benchmarks/bench_quality.py
python benchmarks/bench_memory_speed.py

# 4. Test with small model
python -m mlx_turboquant.server --model tiny-llama

QJL Residual Correction

Note: The MLX repo has QJL (1-bit residual) but it's Python-only. The full TurboQuant paper combines:

  1. PolarQuant (3-bit) + QJL (1-bit) = 4-bit total
  2. But QJL is more complex to implement in Metal

Practical approach: Start with PolarQuant-only (turbo3 = 3-bit), add QJL later if needed.

Summary

Path Effort Risk Reward
llama.cpp TurboQuant High Medium (Metal) Fastest
MLX TurboQuant Medium Low (works) Slower
Wait for upstream Low High (timeline) N/A

Recommendation: Attempt llama.cpp first (this week). If Metal fails, switch to MLX without losing momentum.


Analysis of rachittshah/mlx-turboquant as contingency plan

## 🛣️ Fallback Path Analysis: If Metal Fails ### The Risk Issue #2 [P1-GATE] determines our path: - **IF** TheTom's llama.cpp fork has working Metal kernels → Use that - **IF** Metal kernels incomplete/missing → Fall back to MLX ### Option B: MLX-TurboQuant (rachittshah/mlx-turboquant) **Status:** Active Python implementation using Apple MLX framework **Pros:** - ✅ Pure Python (easier to debug/modify) - ✅ Uses MLX Metal shaders automatically - ✅ Simpler integration than C++ - ✅ Can wrap existing llama.cpp models **Cons:** - ❌ Python overhead (slower than C++) - ❌ Not drop-in llama.cpp replacement - ❌ Requires model conversion to MLX format - ❌ Less mature than llama.cpp ecosystem ### MLX Architecture ``` ┌─────────────────────────────────────────┐ │ MLX TurboQuant │ │ ├── polar_quant.py (WHT rotation) │ │ ├── qjl.py (residual quantization) │ │ ├── codebooks.py (optimal centroids) │ │ ├── cache.py (KV cache management) │ │ └── attention.py (compressed attn) │ └─────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────┐ │ MLX Framework │ │ ├── Python frontend │ │ └── Metal GPU kernels (automatic) │ └─────────────────────────────────────────┘ ``` ### Performance Comparison (Expected) | Path | Prefill | Decode | Maturity | |------|---------|--------|----------| | llama.cpp + TurboQuant | 1600 t/s | 60 t/s | ⭐⭐⭐⭐⭐ | | MLX-TurboQuant | 1200 t/s | 45 t/s | ⭐⭐⭐ | **Tradeoff:** 20-25% slower but guaranteed Metal support. ### Integration Strategy If we need MLX fallback: ```bash # 1. Install MLX pip install mlx # 2. Convert model to MLX format # (requires Qwen3.5-27B → MLX conversion) # 3. Run with TurboQuant python -m mlx_turboquant.server \ --model qwen3.5-27b-mlx \ --cache-type-k turbo3 \ --cache-type-v turbo3 \ --max-context 32768 ``` ### Decision Matrix | Condition | Action | |-----------|--------| | TheTom fork builds + Metal works | ✅ Use llama.cpp path | | Build fails | → Try upstream llama.cpp + patch | | Metal kernels missing | → Use MLX fallback | | PPL regression > 2% | → Tune bit-width or use asymmetric | ### Hybrid Approach **Recommendation:** Don't commit to one path. Prepare both: 1. **Primary:** TheTom/llama-cpp-turboquant (fastest) 2. **Fallback:** rachittshah/mlx-turboquant (guaranteed Metal) Parallel effort: - Cid works on llama.cpp build (Issues #3, #4) - Locke prepares MLX environment (backup) - Switch to MLX only if llama.cpp path fails ### MLX Pre-Flight Checklist If needed, verify on Mac: ```bash # 1. MLX installation python -c "import mlx; print(mlx.__version__)" # 2. Metal availability python -c "import mlx.core as mx; print(mx.metal.is_available())" # 3. Run benchmarks cd mlx-turboquant python benchmarks/bench_quality.py python benchmarks/bench_memory_speed.py # 4. Test with small model python -m mlx_turboquant.server --model tiny-llama ``` ### QJL Residual Correction Note: The MLX repo has QJL (1-bit residual) but it's Python-only. The full TurboQuant paper combines: 1. PolarQuant (3-bit) + QJL (1-bit) = 4-bit total 2. But QJL is more complex to implement in Metal **Practical approach:** Start with PolarQuant-only (turbo3 = 3-bit), add QJL later if needed. ### Summary | Path | Effort | Risk | Reward | |------|--------|------|--------| | llama.cpp TurboQuant | High | Medium (Metal) | Fastest | | MLX TurboQuant | Medium | Low (works) | Slower | | Wait for upstream | Low | High (timeline) | N/A | **Recommendation:** Attempt llama.cpp first (this week). If Metal fails, switch to MLX without losing momentum. --- *Analysis of rachittshah/mlx-turboquant as contingency plan*
Author
Owner

GATE CHECK: METAL SHADERS EXIST

Verdict: PROCEED with llama.cpp path.

Evidence

  • Feature branch: feature/turboquant-kv-cache (commit 64dd362)
  • ggml-metal.metal: 516 lines referencing turbo/TURBO (file grew from 10,360 to 11,662 lines)
  • ggml-metal-device.m: 26 turbo-related lines (type dispatch, env vars)
  • ggml-metal-ops.cpp: 14 turbo-related lines (op dispatch, pipeline selection)

Metal Kernels Found

  • kernel_turbo4_dequant_f16 — 4-bit dequantization
  • kernel_turbo_wht — Walsh-Hadamard Transform
  • kernel_set_rows_turbo — quantization with rotation
  • Flash attention instantiations for ALL turbo types (turbo2/3/4, dk32-dk576 variants)
  • Asymmetric K/V: q8_0 x turbo mixed pairs supported

PolarQuant Infrastructure

  • Codebooks: turbo_centroids_2bit[4], turbo_centroids_3bit[8], turbo_centroids_4bit[16]
  • Decision boundaries: turbo_mid_2bit, turbo_mid_3bit, turbo_mid_4bit
  • WHT: turbo_fwht_128(), turbo_rotate_forward(), turbo_rotate_inverse()
  • QJL sign arrays present

Runtime Features

  • TURBO_FORCE_4MAG env: 4-magnitude LUT optimization (auto on M4+)
  • TURBO_SPARSE_V env: sparse V dequant (attention-gated V skipping)
  • Profiling modes 0-4 available

Additional Experiment Branches

smem-pre-dequant, layer-adaptive, fused-centroid-decode, asymmetric-kv, speed-optimization

This is production-quality Metal work. No MLX pivot needed.

## GATE CHECK: METAL SHADERS EXIST ✅ **Verdict: PROCEED with llama.cpp path.** ### Evidence - Feature branch: `feature/turboquant-kv-cache` (commit 64dd362) - `ggml-metal.metal`: 516 lines referencing turbo/TURBO (file grew from 10,360 to 11,662 lines) - `ggml-metal-device.m`: 26 turbo-related lines (type dispatch, env vars) - `ggml-metal-ops.cpp`: 14 turbo-related lines (op dispatch, pipeline selection) ### Metal Kernels Found - `kernel_turbo4_dequant_f16` — 4-bit dequantization - `kernel_turbo_wht` — Walsh-Hadamard Transform - `kernel_set_rows_turbo` — quantization with rotation - Flash attention instantiations for ALL turbo types (turbo2/3/4, dk32-dk576 variants) - Asymmetric K/V: q8_0 x turbo mixed pairs supported ### PolarQuant Infrastructure - Codebooks: `turbo_centroids_2bit[4]`, `turbo_centroids_3bit[8]`, `turbo_centroids_4bit[16]` - Decision boundaries: `turbo_mid_2bit`, `turbo_mid_3bit`, `turbo_mid_4bit` - WHT: `turbo_fwht_128()`, `turbo_rotate_forward()`, `turbo_rotate_inverse()` - QJL sign arrays present ### Runtime Features - `TURBO_FORCE_4MAG` env: 4-magnitude LUT optimization (auto on M4+) - `TURBO_SPARSE_V` env: sparse V dequant (attention-gated V skipping) - Profiling modes 0-4 available ### Additional Experiment Branches smem-pre-dequant, layer-adaptive, fused-centroid-decode, asymmetric-kv, speed-optimization This is production-quality Metal work. No MLX pivot needed.
Timmy closed this issue 2026-03-30 19:40:54 +00:00
Sign in to join this conversation.