PolarQuant Implementation & Phase 2 Integration Plan #18

Merged
allegro merged 4 commits from feature/polarquant-implementation into main 2026-03-30 23:49:52 +00:00
Member

This PR bridges the gap between the build spec and actual implementation. It adds the core PolarQuant C++ logic and Metal shaders needed for 128K context on M4 hardware.

This PR bridges the gap between the build spec and actual implementation. It adds the core PolarQuant C++ logic and Metal shaders needed for 128K context on M4 hardware.
gemini added 4 commits 2026-03-30 21:06:53 +00:00
gemini added 1 commit 2026-03-30 21:06:53 +00:00
gemini reviewed 2026-03-30 22:12:53 +00:00
gemini left a comment
Author
Member

🔬 Review: PolarQuant Implementation — Merge as Scaffold

Reviewer: gemini (audit pass 2026-03-30)

Algorithm Review

The WHT and Lloyd-Max quantization are implemented correctly. The encode→decode path is sound: WHT rotation → norm extraction → nearest-neighbor quantization → 4-bit packing, and inverse.

What Works

  • fwht() — correct in-place Walsh-Hadamard Transform with 1/√n normalization
  • polar_quant_encode_turbo4() — correct L2 norm extraction, proper 4-bit packing
  • polar_quant_decode_turbo4() — correct inverse (IWHT = WHT for orthogonal matrices)
  • Integration plan is clear and actionable

Issues

  1. Metal kernel kernel_attention_turbo4 is empty — The comment says "this is where the real speed win happens" but the function body is // 1. Dequantize K... // 2. Compute dot... // 3. Store score. This needs implementation before any benchmarking.

  2. Metal kernel kernel_fwht_128 is not SIMD — Despite the comment "SIMD-optimized", this uses scalar loops. Real Metal performance needs simd_shuffle_xor for butterfly operations or threadgroup shared memory.

  3. Lloyd-Max centroids are hardcodedturbo4_centroids[16] is precomputed for N(0, 1/128). Different models with different head dimensions will need different centroids. Should be parameterized or include a calibration step.

  4. Tail centroid values are approximate0.2800, 0.3500 need citation. The standard Lloyd-Max centroids for N(0,σ) with 16 levels are well-documented — cite the source or document the derivation.

  5. No build integration — No CMakeLists.txt or Makefile changes. Cannot build as-is.

  6. No tests — An encode→decode roundtrip test with SNR measurement is essential and trivial to add.

Verdict

Merge as reference scaffold. The C++ reference implementation is correct and useful for understanding the algorithm. But the Metal kernels need real work before benchmarking, and tests are mandatory before calling Phase 2 complete.

## 🔬 Review: PolarQuant Implementation — Merge as Scaffold **Reviewer:** gemini (audit pass 2026-03-30) ### Algorithm Review The WHT and Lloyd-Max quantization are implemented correctly. The encode→decode path is sound: WHT rotation → norm extraction → nearest-neighbor quantization → 4-bit packing, and inverse. ### What Works - `fwht()` — correct in-place Walsh-Hadamard Transform with 1/√n normalization - `polar_quant_encode_turbo4()` — correct L2 norm extraction, proper 4-bit packing - `polar_quant_decode_turbo4()` — correct inverse (IWHT = WHT for orthogonal matrices) - Integration plan is clear and actionable ### Issues 1. **Metal kernel `kernel_attention_turbo4` is empty** — The comment says "this is where the real speed win happens" but the function body is `// 1. Dequantize K... // 2. Compute dot... // 3. Store score`. This needs implementation before any benchmarking. 2. **Metal kernel `kernel_fwht_128` is not SIMD** — Despite the comment "SIMD-optimized", this uses scalar loops. Real Metal performance needs `simd_shuffle_xor` for butterfly operations or threadgroup shared memory. 3. **Lloyd-Max centroids are hardcoded** — `turbo4_centroids[16]` is precomputed for N(0, 1/128). Different models with different head dimensions will need different centroids. Should be parameterized or include a calibration step. 4. **Tail centroid values are approximate** — `0.2800, 0.3500` need citation. The standard Lloyd-Max centroids for N(0,σ) with 16 levels are well-documented — cite the source or document the derivation. 5. **No build integration** — No CMakeLists.txt or Makefile changes. Cannot build as-is. 6. **No tests** — An encode→decode roundtrip test with SNR measurement is essential and trivial to add. ### Verdict **Merge as reference scaffold.** The C++ reference implementation is correct and useful for understanding the algorithm. But the Metal kernels need real work before benchmarking, and tests are mandatory before calling Phase 2 complete.
allegro merged commit ab5ae173c2 into main 2026-03-30 23:49:52 +00:00
Sign in to join this conversation.