[P1-GATE] Metal kernel check — determines llama.cpp vs MLX path #2
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Parent: #1
BLOCKING — Do this before anything else
The single highest-risk assumption in the spec. Takes 2 minutes, determines the entire build strategy.
Action
Decision
Acceptance Criteria
🛣️ Fallback Path Analysis: If Metal Fails
The Risk
Issue #2 [P1-GATE] determines our path:
Option B: MLX-TurboQuant (rachittshah/mlx-turboquant)
Status: Active Python implementation using Apple MLX framework
Pros:
Cons:
MLX Architecture
Performance Comparison (Expected)
Tradeoff: 20-25% slower but guaranteed Metal support.
Integration Strategy
If we need MLX fallback:
Decision Matrix
Hybrid Approach
Recommendation: Don't commit to one path. Prepare both:
Parallel effort:
MLX Pre-Flight Checklist
If needed, verify on Mac:
QJL Residual Correction
Note: The MLX repo has QJL (1-bit residual) but it's Python-only. The full TurboQuant paper combines:
Practical approach: Start with PolarQuant-only (turbo3 = 3-bit), add QJL later if needed.
Summary
Recommendation: Attempt llama.cpp first (this week). If Metal fails, switch to MLX without losing momentum.
Analysis of rachittshah/mlx-turboquant as contingency plan
GATE CHECK: METAL SHADERS EXIST ✅
Verdict: PROCEED with llama.cpp path.
Evidence
feature/turboquant-kv-cache(commit 64dd362)ggml-metal.metal: 516 lines referencing turbo/TURBO (file grew from 10,360 to 11,662 lines)ggml-metal-device.m: 26 turbo-related lines (type dispatch, env vars)ggml-metal-ops.cpp: 14 turbo-related lines (op dispatch, pipeline selection)Metal Kernels Found
kernel_turbo4_dequant_f16— 4-bit dequantizationkernel_turbo_wht— Walsh-Hadamard Transformkernel_set_rows_turbo— quantization with rotationPolarQuant Infrastructure
turbo_centroids_2bit[4],turbo_centroids_3bit[8],turbo_centroids_4bit[16]turbo_mid_2bit,turbo_mid_3bit,turbo_mid_4bitturbo_fwht_128(),turbo_rotate_forward(),turbo_rotate_inverse()Runtime Features
TURBO_FORCE_4MAGenv: 4-magnitude LUT optimization (auto on M4+)TURBO_SPARSE_Venv: sparse V dequant (attention-gated V skipping)Additional Experiment Branches
smem-pre-dequant, layer-adaptive, fused-centroid-decode, asymmetric-kv, speed-optimization
This is production-quality Metal work. No MLX pivot needed.