[P1-S1] PolarQuant verification checklist #5

Closed
opened 2026-03-30 17:11:07 +00:00 by Timmy · 2 comments
Owner

Parent: #1 | Depends on: #4 (build)

Verify the fork implements PolarQuant correctly BEFORE trusting any benchmark numbers. A fork that gets the rotation wrong will compress successfully but degrade quality in ways short PPL benchmarks may miss.

Checklist (from spec Section 1a)

  • Rotation is WHT or equivalent structured orthogonal (NOT learned, NOT dense random)
  • Same rotation matrix used for quantization and dequantization
  • Codebook is Lloyd-Max (NOT uniform) — boundaries precomputed for post-WHT distribution
  • Radius stored separately at FP16+ precision
  • No per-vector normalization constants stored (this is the whole point of PolarQuant)
  • Dequant path in Metal shader matches the quantization path exactly

How to Verify

  • Read the fork's PolarQuant implementation source
  • Compare against turboquant_plus reference implementation (511+ tests)
  • Check Metal kernel dequant matches CPU path
  • If using uniform quantization instead of Lloyd-Max: flag as quality regression

Acceptance Criteria

  • All 6 checklist items verified YES/NO with evidence
  • Any deviations from paper documented
## Parent: #1 | Depends on: #4 (build) Verify the fork implements PolarQuant correctly BEFORE trusting any benchmark numbers. A fork that gets the rotation wrong will compress successfully but degrade quality in ways short PPL benchmarks may miss. ## Checklist (from spec Section 1a) - [ ] Rotation is WHT or equivalent structured orthogonal (NOT learned, NOT dense random) - [ ] Same rotation matrix used for quantization and dequantization - [ ] Codebook is Lloyd-Max (NOT uniform) — boundaries precomputed for post-WHT distribution - [ ] Radius stored separately at FP16+ precision - [ ] No per-vector normalization constants stored (this is the whole point of PolarQuant) - [ ] Dequant path in Metal shader matches the quantization path exactly ## How to Verify - Read the fork's PolarQuant implementation source - Compare against turboquant_plus reference implementation (511+ tests) - Check Metal kernel dequant matches CPU path - If using uniform quantization instead of Lloyd-Max: flag as quality regression ## Acceptance Criteria - All 6 checklist items verified YES/NO with evidence - Any deviations from paper documented
Timmy added this to the Phase 1 — PolarQuant MVP milestone 2026-03-30 17:11:07 +00:00
Timmy added the researchphase-1owner:cid labels 2026-03-30 17:11:07 +00:00
Member

🔬 Deep Dive: PolarQuant Algorithm (Phase 1 Core)

What PolarQuant Actually Does

PolarQuant is NOT standard quantization. It's a vector quantization algorithm using random rotation + optimal scalar quantization.

Input: Vector x ∈ R^d (e.g., 128-dim attention head)
       │
       ▼
┌─────────────────────────────────────┐
│ 1. EXTRACT NORM                     │
│    norm = ||x||₂                    │
│    x_unit = x / norm                │
├─────────────────────────────────────┤
│ 2. RANDOM ROTATION (WHT)            │
│    y = Π @ x_unit                   │
│    (Walsh-Hadamard + random signs)  │
├─────────────────────────────────────┤
│ 3. OPTIMAL SCALAR QUANTIZATION      │
│    After rotation, coordinates       │
│    follow Beta(d/2, d/2) dist       │
│    → known optimal centroids         │
├─────────────────────────────────────┤
│ 4. INDEX ENCODING                   │
│    Store: indices (b-bit) + norm    │
└─────────────────────────────────────┘

Key Mathematical Insight

Why rotation helps:

  • Raw KV cache coordinates are correlated and non-Gaussian
  • Random rotation Gaussianizes the distribution (by CLT)
  • After rotation, each coordinate is i.i.d. Beta(d/2, d/2)
  • Beta distribution has KNOWN optimal quantization points
  • This enables per-coordinate optimal quantization

Bit-width Formats (The "turbo" Family)

Format Bits Compression Use Case
turbo2 2-bit 6.4x Extreme memory pressure
turbo3 3-bit 4.6x Balanced (recommended)
turbo4 4-bit 3.8x Quality-critical

Optimal Centroids (High-d Limit)

For unit-norm vectors in d dimensions:

# turbo2 (2-bit = 4 centroids)
centroids = ±0.453/d, ±1.51/d

# turbo3 (3-bit = 8 centroids)
centroids = computed via Lloyd's algorithm on Beta(d/2, d/2)

# turbo4 (4-bit = 16 centroids)
centroids = computed via Lloyd's algorithm on Beta(d/2, d/2)

Norm Correction (Critical Detail)

The paper stores norms separately and rescales. BUT there's a subtlety:

# After dequantization, the reconstructed y_hat may have 
# norm ≠ 1 due to quantization error. Re-normalizing before
# inverse rotation improves quality significantly.

if norm_correction:
    y_hat = y_hat / ||y_hat||  # Re-normalize
    
x_hat = Π.T @ y_hat * norm    # Inverse rotation + rescale

Impact: -1.17% PPL on CUDA, +1.1% on Metal vs no correction.

Memory Layout

For turbo3 (3-bit) on 128-dim head:
- Indices: 128 × 3 bits = 48 bytes
- Norm: 4 bytes (fp32)
- Total: 52 bytes vs 256 bytes (fp16) = 4.9x compression

With 4-mag LUT optimization:
- Store block-max magnitude (4 values)
- Reduces norm overhead

🎯 Implications for M4 Max Build

1. Metal Kernel Requirements

Need kernels for:

  • WHT rotation (forward and inverse)
  • Centroid lookup (table-driven dequant)
  • Block-wise norm computation
  • 4-mag LUT (M1/M2/M3/M4 optimization)

2. Build Configuration

# Critical CMake flags
cmake -B build \
  -DLLAMA_METAL=ON \
  -DLLAMA_METAL_EMBED_LIBRARY=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_OSX_ARCHITECTURES=arm64

3. Expected Performance (from paper/benchmarks)

On M4 Max (extrapolated from M5/M1 data):

Config Prefill (tok/s) Decode (tok/s) Context
q8_0 ~1600 ~65 32K
turbo4 ~1600 (+0%) ~60 (-7%) 32K
turbo3 ~1700 (+6%) ~50 (-23%) 32K

The tradeoff: turbo3 uses less memory but turbo4 has better decode speed on Apple Silicon.

4. The "Sparse V" Optimization

Not strictly TurboQuant, but bundled in the repo:

  • Skip V dequantization where attention weight < 1e-6
  • At long context, ~50% of positions are negligible
  • +22.8% decode speed at 32K context
  • No PPL degradation (validated)

⚠️ Unasked Questions

Q: Why not just use Q4_0 (standard 4-bit)?

A: TurboQuant (polar) beats Q4_0 on quality at same compression:

  • turbo4: +0.23% PPL vs q8_0
  • q4_0: +0.52% PPL vs q8_0

Because optimal quantization on rotated coordinates > uniform quantization on raw coordinates.

Q: What's the catch?

A: Computational overhead:

  • Must do WHT rotation every forward pass
  • Extra memory bandwidth for rotation matrices
  • Worth it for long context (bandwidth-bound), less so for short

Q: Should we use symmetric or asymmetric K/V?

A: Test results from repo:

  • Symmetric (turbo3/turbo3): Best for Q8_0+ models
  • Asymmetric (q8_0-K / turbo4-V): Best for Q4_K_M models

K controls attention routing → higher precision matters more.


📋 Build Verification Checklist

Once Mac build is ready, verify:

  1. WHT rotation produces Gaussianized output (kurtosis ~3)
  2. Centroid lookup matches Python reference
  3. Norm correction active (check PPL diff)
  4. 4-mag LUT auto-detected on M4
  5. Sparse V provides speedup at 16K+ context
  6. PPL regression < 1.0% vs q8_0

Research compiled from turboquant_plus repo and ICLR 2026 paper

## 🔬 Deep Dive: PolarQuant Algorithm (Phase 1 Core) ### What PolarQuant Actually Does PolarQuant is **NOT** standard quantization. It's a vector quantization algorithm using random rotation + optimal scalar quantization. ``` Input: Vector x ∈ R^d (e.g., 128-dim attention head) │ ▼ ┌─────────────────────────────────────┐ │ 1. EXTRACT NORM │ │ norm = ||x||₂ │ │ x_unit = x / norm │ ├─────────────────────────────────────┤ │ 2. RANDOM ROTATION (WHT) │ │ y = Π @ x_unit │ │ (Walsh-Hadamard + random signs) │ ├─────────────────────────────────────┤ │ 3. OPTIMAL SCALAR QUANTIZATION │ │ After rotation, coordinates │ │ follow Beta(d/2, d/2) dist │ │ → known optimal centroids │ ├─────────────────────────────────────┤ │ 4. INDEX ENCODING │ │ Store: indices (b-bit) + norm │ └─────────────────────────────────────┘ ``` ### Key Mathematical Insight **Why rotation helps:** - Raw KV cache coordinates are correlated and non-Gaussian - Random rotation Gaussianizes the distribution (by CLT) - After rotation, each coordinate is i.i.d. Beta(d/2, d/2) - Beta distribution has KNOWN optimal quantization points - This enables per-coordinate optimal quantization ### Bit-width Formats (The "turbo" Family) | Format | Bits | Compression | Use Case | |--------|------|-------------|----------| | **turbo2** | 2-bit | 6.4x | Extreme memory pressure | | **turbo3** | 3-bit | 4.6x | Balanced (recommended) | | **turbo4** | 4-bit | 3.8x | Quality-critical | ### Optimal Centroids (High-d Limit) For unit-norm vectors in d dimensions: ```python # turbo2 (2-bit = 4 centroids) centroids = ±0.453/√d, ±1.51/√d # turbo3 (3-bit = 8 centroids) centroids = computed via Lloyd's algorithm on Beta(d/2, d/2) # turbo4 (4-bit = 16 centroids) centroids = computed via Lloyd's algorithm on Beta(d/2, d/2) ``` ### Norm Correction (Critical Detail) The paper stores norms separately and rescales. BUT there's a subtlety: ```python # After dequantization, the reconstructed y_hat may have # norm ≠ 1 due to quantization error. Re-normalizing before # inverse rotation improves quality significantly. if norm_correction: y_hat = y_hat / ||y_hat|| # Re-normalize x_hat = Π.T @ y_hat * norm # Inverse rotation + rescale ``` **Impact:** -1.17% PPL on CUDA, +1.1% on Metal vs no correction. ### Memory Layout ``` For turbo3 (3-bit) on 128-dim head: - Indices: 128 × 3 bits = 48 bytes - Norm: 4 bytes (fp32) - Total: 52 bytes vs 256 bytes (fp16) = 4.9x compression With 4-mag LUT optimization: - Store block-max magnitude (4 values) - Reduces norm overhead ``` --- ## 🎯 Implications for M4 Max Build ### 1. Metal Kernel Requirements Need kernels for: - **WHT rotation** (forward and inverse) - **Centroid lookup** (table-driven dequant) - **Block-wise norm computation** - **4-mag LUT** (M1/M2/M3/M4 optimization) ### 2. Build Configuration ```bash # Critical CMake flags cmake -B build \ -DLLAMA_METAL=ON \ -DLLAMA_METAL_EMBED_LIBRARY=ON \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_OSX_ARCHITECTURES=arm64 ``` ### 3. Expected Performance (from paper/benchmarks) On M4 Max (extrapolated from M5/M1 data): | Config | Prefill (tok/s) | Decode (tok/s) | Context | |--------|----------------|----------------|---------| | q8_0 | ~1600 | ~65 | 32K | | turbo4 | ~1600 (+0%) | ~60 (-7%) | 32K | | turbo3 | ~1700 (+6%) | ~50 (-23%) | 32K | **The tradeoff:** turbo3 uses less memory but turbo4 has better decode speed on Apple Silicon. ### 4. The "Sparse V" Optimization Not strictly TurboQuant, but bundled in the repo: - Skip V dequantization where attention weight < 1e-6 - At long context, ~50% of positions are negligible - **+22.8% decode speed** at 32K context - No PPL degradation (validated) --- ## ⚠️ Unasked Questions ### Q: Why not just use Q4_0 (standard 4-bit)? A: TurboQuant (polar) beats Q4_0 on quality at same compression: - turbo4: +0.23% PPL vs q8_0 - q4_0: +0.52% PPL vs q8_0 Because optimal quantization on rotated coordinates > uniform quantization on raw coordinates. ### Q: What's the catch? A: Computational overhead: - Must do WHT rotation every forward pass - Extra memory bandwidth for rotation matrices - Worth it for long context (bandwidth-bound), less so for short ### Q: Should we use symmetric or asymmetric K/V? A: Test results from repo: - **Symmetric** (turbo3/turbo3): Best for Q8_0+ models - **Asymmetric** (q8_0-K / turbo4-V): Best for Q4_K_M models K controls attention routing → higher precision matters more. --- ## 📋 Build Verification Checklist Once Mac build is ready, verify: 1. [ ] WHT rotation produces Gaussianized output (kurtosis ~3) 2. [ ] Centroid lookup matches Python reference 3. [ ] Norm correction active (check PPL diff) 4. [ ] 4-mag LUT auto-detected on M4 5. [ ] Sparse V provides speedup at 16K+ context 6. [ ] PPL regression < 1.0% vs q8_0 --- *Research compiled from turboquant_plus repo and ICLR 2026 paper*
Author
Owner

PolarQuant Verification Checklist

# Item Verdict Evidence
1 WHT rotation (not learned/dense) PARTIAL PASS Metal GPU: WHT butterfly (turbo_fwht_128). CPU turbo4 ref: dense random (legacy, not production path)
2 Same rotation quant/dequant PASS turbo_rotate_forward() ↔ turbo_rotate_inverse() use same sign arrays, butterfly is self-inverse
3 Lloyd-Max codebook (not uniform) PASS Centroids are non-uniform (dense near 0, sparse at tails). Comment: "Lloyd-Max for N(0, 1/128)"
4 Radius at FP16+ PASS block_turboN_0.norm is ggml_half (FP16), 1 per 128-element group
5 No per-vector normalization PASS Only group-level L2 norm. No scale/offset/min/max. Static asserts enforce block sizes.
6 Dequant matches quant in Metal PASS SET_ROWS quantize → centroid LUT → WHT inverse dequant. Same centroids, signs, butterfly.

Result: 5/6 PASS, 1 PARTIAL PASS

⚠️ Item 1 caveat: CPU reference quantizer for turbo4 uses dense random orthogonal matrix (O(d²)) instead of WHT. This is legacy code. The Metal GPU production path correctly uses WHT for ALL types. If CPU fallback were invoked for turbo4, it would produce data incompatible with Metal dequant.

Recommendation: Flag the CPU turbo4 ref path as incompatible. For Metal inference (our use case), all items PASS.

## PolarQuant Verification Checklist | # | Item | Verdict | Evidence | |---|------|---------|----------| | 1 | WHT rotation (not learned/dense) | **PARTIAL PASS** | Metal GPU: WHT butterfly (turbo_fwht_128). CPU turbo4 ref: dense random (legacy, not production path) | | 2 | Same rotation quant/dequant | **PASS** | turbo_rotate_forward() ↔ turbo_rotate_inverse() use same sign arrays, butterfly is self-inverse | | 3 | Lloyd-Max codebook (not uniform) | **PASS** | Centroids are non-uniform (dense near 0, sparse at tails). Comment: "Lloyd-Max for N(0, 1/128)" | | 4 | Radius at FP16+ | **PASS** | block_turboN_0.norm is ggml_half (FP16), 1 per 128-element group | | 5 | No per-vector normalization | **PASS** | Only group-level L2 norm. No scale/offset/min/max. Static asserts enforce block sizes. | | 6 | Dequant matches quant in Metal | **PASS** | SET_ROWS quantize → centroid LUT → WHT inverse dequant. Same centroids, signs, butterfly. | **Result: 5/6 PASS, 1 PARTIAL PASS** ⚠️ Item 1 caveat: CPU reference quantizer for turbo4 uses dense random orthogonal matrix (O(d²)) instead of WHT. This is legacy code. The Metal GPU production path correctly uses WHT for ALL types. If CPU fallback were invoked for turbo4, it would produce data incompatible with Metal dequant. **Recommendation:** Flag the CPU turbo4 ref path as incompatible. For Metal inference (our use case), all items PASS.
Timmy closed this issue 2026-03-30 20:09:53 +00:00
Sign in to join this conversation.