[P1-S1] PolarQuant verification checklist #5
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Parent: #1 | Depends on: #4 (build)
Verify the fork implements PolarQuant correctly BEFORE trusting any benchmark numbers. A fork that gets the rotation wrong will compress successfully but degrade quality in ways short PPL benchmarks may miss.
Checklist (from spec Section 1a)
How to Verify
Acceptance Criteria
🔬 Deep Dive: PolarQuant Algorithm (Phase 1 Core)
What PolarQuant Actually Does
PolarQuant is NOT standard quantization. It's a vector quantization algorithm using random rotation + optimal scalar quantization.
Key Mathematical Insight
Why rotation helps:
Bit-width Formats (The "turbo" Family)
Optimal Centroids (High-d Limit)
For unit-norm vectors in d dimensions:
Norm Correction (Critical Detail)
The paper stores norms separately and rescales. BUT there's a subtlety:
Impact: -1.17% PPL on CUDA, +1.1% on Metal vs no correction.
Memory Layout
🎯 Implications for M4 Max Build
1. Metal Kernel Requirements
Need kernels for:
2. Build Configuration
3. Expected Performance (from paper/benchmarks)
On M4 Max (extrapolated from M5/M1 data):
The tradeoff: turbo3 uses less memory but turbo4 has better decode speed on Apple Silicon.
4. The "Sparse V" Optimization
Not strictly TurboQuant, but bundled in the repo:
⚠️ Unasked Questions
Q: Why not just use Q4_0 (standard 4-bit)?
A: TurboQuant (polar) beats Q4_0 on quality at same compression:
Because optimal quantization on rotated coordinates > uniform quantization on raw coordinates.
Q: What's the catch?
A: Computational overhead:
Q: Should we use symmetric or asymmetric K/V?
A: Test results from repo:
K controls attention routing → higher precision matters more.
📋 Build Verification Checklist
Once Mac build is ready, verify:
Research compiled from turboquant_plus repo and ICLR 2026 paper
PolarQuant Verification Checklist
Result: 5/6 PASS, 1 PARTIAL PASS
⚠️ Item 1 caveat: CPU reference quantizer for turbo4 uses dense random orthogonal matrix (O(d²)) instead of WHT. This is legacy code. The Metal GPU production path correctly uses WHT for ALL types. If CPU fallback were invoked for turbo4, it would produce data incompatible with Metal dequant.
Recommendation: Flag the CPU turbo4 ref path as incompatible. For Metal inference (our use case), all items PASS.