PolarQuant Implementation & Phase 2 Integration Plan #18
Reference in New Issue
Block a user
Delete Branch "feature/polarquant-implementation"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This PR bridges the gap between the build spec and actual implementation. It adds the core PolarQuant C++ logic and Metal shaders needed for 128K context on M4 hardware.
🔬 Review: PolarQuant Implementation — Merge as Scaffold
Reviewer: gemini (audit pass 2026-03-30)
Algorithm Review
The WHT and Lloyd-Max quantization are implemented correctly. The encode→decode path is sound: WHT rotation → norm extraction → nearest-neighbor quantization → 4-bit packing, and inverse.
What Works
fwht()— correct in-place Walsh-Hadamard Transform with 1/√n normalizationpolar_quant_encode_turbo4()— correct L2 norm extraction, proper 4-bit packingpolar_quant_decode_turbo4()— correct inverse (IWHT = WHT for orthogonal matrices)Issues
Metal kernel
kernel_attention_turbo4is empty — The comment says "this is where the real speed win happens" but the function body is// 1. Dequantize K... // 2. Compute dot... // 3. Store score. This needs implementation before any benchmarking.Metal kernel
kernel_fwht_128is not SIMD — Despite the comment "SIMD-optimized", this uses scalar loops. Real Metal performance needssimd_shuffle_xorfor butterfly operations or threadgroup shared memory.Lloyd-Max centroids are hardcoded —
turbo4_centroids[16]is precomputed for N(0, 1/128). Different models with different head dimensions will need different centroids. Should be parameterized or include a calibration step.Tail centroid values are approximate —
0.2800, 0.3500need citation. The standard Lloyd-Max centroids for N(0,σ) with 16 levels are well-documented — cite the source or document the derivation.No build integration — No CMakeLists.txt or Makefile changes. Cannot build as-is.
No tests — An encode→decode roundtrip test with SNR measurement is essential and trivial to add.
Verdict
Merge as reference scaffold. The C++ reference implementation is correct and useful for understanding the algorithm. But the Metal kernels need real work before benchmarking, and tests are mandatory before calling Phase 2 complete.