[P1-S1] PolarQuant verification checklist #5

New Issue

Timmy · 2026-03-30T17:11:07Z

Timmy commented

2026-03-30 17:11:07 +00:00

Parent: #1 | Depends on: #4 (build)

Verify the fork implements PolarQuant correctly BEFORE trusting any benchmark numbers. A fork that gets the rotation wrong will compress successfully but degrade quality in ways short PPL benchmarks may miss.

Checklist (from spec Section 1a)

Rotation is WHT or equivalent structured orthogonal (NOT learned, NOT dense random)
Same rotation matrix used for quantization and dequantization
Codebook is Lloyd-Max (NOT uniform) — boundaries precomputed for post-WHT distribution
Radius stored separately at FP16+ precision
No per-vector normalization constants stored (this is the whole point of PolarQuant)
Dequant path in Metal shader matches the quantization path exactly

How to Verify

Read the fork's PolarQuant implementation source
Compare against turboquant_plus reference implementation (511+ tests)
Check Metal kernel dequant matches CPU path
If using uniform quantization instead of Lloyd-Max: flag as quality regression

Acceptance Criteria

All 6 checklist items verified YES/NO with evidence
Any deviations from paper documented

## Parent: #1 | Depends on: #4 (build) Verify the fork implements PolarQuant correctly BEFORE trusting any benchmark numbers. A fork that gets the rotation wrong will compress successfully but degrade quality in ways short PPL benchmarks may miss. ## Checklist (from spec Section 1a) - [ ] Rotation is WHT or equivalent structured orthogonal (NOT learned, NOT dense random) - [ ] Same rotation matrix used for quantization and dequantization - [ ] Codebook is Lloyd-Max (NOT uniform) — boundaries precomputed for post-WHT distribution - [ ] Radius stored separately at FP16+ precision - [ ] No per-vector normalization constants stored (this is the whole point of PolarQuant) - [ ] Dequant path in Metal shader matches the quantization path exactly ## How to Verify - Read the fork's PolarQuant implementation source - Compare against turboquant_plus reference implementation (511+ tests) - Check Metal kernel dequant matches CPU path - If using uniform quantization instead of Lloyd-Max: flag as quality regression ## Acceptance Criteria - All 6 checklist items verified YES/NO with evidence - Any deviations from paper documented

Timmy added this to the Phase 1 — PolarQuant MVP milestone 2026-03-30 17:11:07 +00:00

Timmy added the research phase-1 owner:cid labels 2026-03-30 17:11:07 +00:00

Timmy referenced this issue

2026-03-30 17:11:09 +00:00

[P1-S2] PolarQuant benchmarks — turbo4 KV cache + asymmetric test #7

allegro commented

2026-03-30 17:49:47 +00:00

🔬 Deep Dive: PolarQuant Algorithm (Phase 1 Core)

What PolarQuant Actually Does

PolarQuant is NOT standard quantization. It's a vector quantization algorithm using random rotation + optimal scalar quantization.

Input: Vector x ∈ R^d (e.g., 128-dim attention head)
       │
       ▼
┌─────────────────────────────────────┐
│ 1. EXTRACT NORM                     │
│    norm = ||x||₂                    │
│    x_unit = x / norm                │
├─────────────────────────────────────┤
│ 2. RANDOM ROTATION (WHT)            │
│    y = Π @ x_unit                   │
│    (Walsh-Hadamard + random signs)  │
├─────────────────────────────────────┤
│ 3. OPTIMAL SCALAR QUANTIZATION      │
│    After rotation, coordinates       │
│    follow Beta(d/2, d/2) dist       │
│    → known optimal centroids         │
├─────────────────────────────────────┤
│ 4. INDEX ENCODING                   │
│    Store: indices (b-bit) + norm    │
└─────────────────────────────────────┘

Key Mathematical Insight

Why rotation helps:

Raw KV cache coordinates are correlated and non-Gaussian
Random rotation Gaussianizes the distribution (by CLT)
After rotation, each coordinate is i.i.d. Beta(d/2, d/2)
Beta distribution has KNOWN optimal quantization points
This enables per-coordinate optimal quantization

Bit-width Formats (The "turbo" Family)

Format	Bits	Compression	Use Case
turbo2	2-bit	6.4x	Extreme memory pressure
turbo3	3-bit	4.6x	Balanced (recommended)
turbo4	4-bit	3.8x	Quality-critical

Optimal Centroids (High-d Limit)

For unit-norm vectors in d dimensions:

# turbo2 (2-bit = 4 centroids)
centroids = ±0.453/√d, ±1.51/√d

# turbo3 (3-bit = 8 centroids)
centroids = computed via Lloyd's algorithm on Beta(d/2, d/2)

# turbo4 (4-bit = 16 centroids)
centroids = computed via Lloyd's algorithm on Beta(d/2, d/2)

Norm Correction (Critical Detail)

The paper stores norms separately and rescales. BUT there's a subtlety:

# After dequantization, the reconstructed y_hat may have 
# norm ≠ 1 due to quantization error. Re-normalizing before
# inverse rotation improves quality significantly.

if norm_correction:
    y_hat = y_hat / ||y_hat||  # Re-normalize
    
x_hat = Π.T @ y_hat * norm    # Inverse rotation + rescale

Impact: -1.17% PPL on CUDA, +1.1% on Metal vs no correction.

Memory Layout

For turbo3 (3-bit) on 128-dim head:
- Indices: 128 × 3 bits = 48 bytes
- Norm: 4 bytes (fp32)
- Total: 52 bytes vs 256 bytes (fp16) = 4.9x compression

With 4-mag LUT optimization:
- Store block-max magnitude (4 values)
- Reduces norm overhead

🎯 Implications for M4 Max Build

1. Metal Kernel Requirements

Need kernels for:

WHT rotation (forward and inverse)
Centroid lookup (table-driven dequant)
Block-wise norm computation
4-mag LUT (M1/M2/M3/M4 optimization)

2. Build Configuration

# Critical CMake flags
cmake -B build \
  -DLLAMA_METAL=ON \
  -DLLAMA_METAL_EMBED_LIBRARY=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_OSX_ARCHITECTURES=arm64

3. Expected Performance (from paper/benchmarks)

On M4 Max (extrapolated from M5/M1 data):

Config	Prefill (tok/s)	Decode (tok/s)	Context
q8_0	~1600	~65	32K
turbo4	~1600 (+0%)	~60 (-7%)	32K
turbo3	~1700 (+6%)	~50 (-23%)	32K

The tradeoff: turbo3 uses less memory but turbo4 has better decode speed on Apple Silicon.

4. The "Sparse V" Optimization

Not strictly TurboQuant, but bundled in the repo:

Skip V dequantization where attention weight < 1e-6
At long context, ~50% of positions are negligible
+22.8% decode speed at 32K context
No PPL degradation (validated)

⚠️ Unasked Questions

Q: Why not just use Q4_0 (standard 4-bit)?

A: TurboQuant (polar) beats Q4_0 on quality at same compression:

turbo4: +0.23% PPL vs q8_0
q4_0: +0.52% PPL vs q8_0

Because optimal quantization on rotated coordinates > uniform quantization on raw coordinates.

Q: What's the catch?

A: Computational overhead:

Must do WHT rotation every forward pass
Extra memory bandwidth for rotation matrices
Worth it for long context (bandwidth-bound), less so for short

Q: Should we use symmetric or asymmetric K/V?

A: Test results from repo:

Symmetric (turbo3/turbo3): Best for Q8_0+ models
Asymmetric (q8_0-K / turbo4-V): Best for Q4_K_M models

K controls attention routing → higher precision matters more.

📋 Build Verification Checklist

Once Mac build is ready, verify:

WHT rotation produces Gaussianized output (kurtosis ~3)
Centroid lookup matches Python reference
Norm correction active (check PPL diff)
4-mag LUT auto-detected on M4
Sparse V provides speedup at 16K+ context
PPL regression < 1.0% vs q8_0

Research compiled from turboquant_plus repo and ICLR 2026 paper

## 🔬 Deep Dive: PolarQuant Algorithm (Phase 1 Core) ### What PolarQuant Actually Does PolarQuant is **NOT** standard quantization. It's a vector quantization algorithm using random rotation + optimal scalar quantization. ``` Input: Vector x ∈ R^d (e.g., 128-dim attention head) │ ▼ ┌─────────────────────────────────────┐ │ 1. EXTRACT NORM │ │ norm = ||x||₂ │ │ x_unit = x / norm │ ├─────────────────────────────────────┤ │ 2. RANDOM ROTATION (WHT) │ │ y = Π @ x_unit │ │ (Walsh-Hadamard + random signs) │ ├─────────────────────────────────────┤ │ 3. OPTIMAL SCALAR QUANTIZATION │ │ After rotation, coordinates │ │ follow Beta(d/2, d/2) dist │ │ → known optimal centroids │ ├─────────────────────────────────────┤ │ 4. INDEX ENCODING │ │ Store: indices (b-bit) + norm │ └─────────────────────────────────────┘ ``` ### Key Mathematical Insight **Why rotation helps:** - Raw KV cache coordinates are correlated and non-Gaussian - Random rotation Gaussianizes the distribution (by CLT) - After rotation, each coordinate is i.i.d. Beta(d/2, d/2) - Beta distribution has KNOWN optimal quantization points - This enables per-coordinate optimal quantization ### Bit-width Formats (The "turbo" Family) | Format | Bits | Compression | Use Case | |--------|------|-------------|----------| | **turbo2** | 2-bit | 6.4x | Extreme memory pressure | | **turbo3** | 3-bit | 4.6x | Balanced (recommended) | | **turbo4** | 4-bit | 3.8x | Quality-critical | ### Optimal Centroids (High-d Limit) For unit-norm vectors in d dimensions: ```python # turbo2 (2-bit = 4 centroids) centroids = ±0.453/√d, ±1.51/√d # turbo3 (3-bit = 8 centroids) centroids = computed via Lloyd's algorithm on Beta(d/2, d/2) # turbo4 (4-bit = 16 centroids) centroids = computed via Lloyd's algorithm on Beta(d/2, d/2) ``` ### Norm Correction (Critical Detail) The paper stores norms separately and rescales. BUT there's a subtlety: ```python # After dequantization, the reconstructed y_hat may have # norm ≠ 1 due to quantization error. Re-normalizing before # inverse rotation improves quality significantly. if norm_correction: y_hat = y_hat / ||y_hat|| # Re-normalize x_hat = Π.T @ y_hat * norm # Inverse rotation + rescale ``` **Impact:** -1.17% PPL on CUDA, +1.1% on Metal vs no correction. ### Memory Layout ``` For turbo3 (3-bit) on 128-dim head: - Indices: 128 × 3 bits = 48 bytes - Norm: 4 bytes (fp32) - Total: 52 bytes vs 256 bytes (fp16) = 4.9x compression With 4-mag LUT optimization: - Store block-max magnitude (4 values) - Reduces norm overhead ``` --- ## 🎯 Implications for M4 Max Build ### 1. Metal Kernel Requirements Need kernels for: - **WHT rotation** (forward and inverse) - **Centroid lookup** (table-driven dequant) - **Block-wise norm computation** - **4-mag LUT** (M1/M2/M3/M4 optimization) ### 2. Build Configuration ```bash # Critical CMake flags cmake -B build \ -DLLAMA_METAL=ON \ -DLLAMA_METAL_EMBED_LIBRARY=ON \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_OSX_ARCHITECTURES=arm64 ``` ### 3. Expected Performance (from paper/benchmarks) On M4 Max (extrapolated from M5/M1 data): | Config | Prefill (tok/s) | Decode (tok/s) | Context | |--------|----------------|----------------|---------| | q8_0 | ~1600 | ~65 | 32K | | turbo4 | ~1600 (+0%) | ~60 (-7%) | 32K | | turbo3 | ~1700 (+6%) | ~50 (-23%) | 32K | **The tradeoff:** turbo3 uses less memory but turbo4 has better decode speed on Apple Silicon. ### 4. The "Sparse V" Optimization Not strictly TurboQuant, but bundled in the repo: - Skip V dequantization where attention weight < 1e-6 - At long context, ~50% of positions are negligible - **+22.8% decode speed** at 32K context - No PPL degradation (validated) --- ## ⚠️ Unasked Questions ### Q: Why not just use Q4_0 (standard 4-bit)? A: TurboQuant (polar) beats Q4_0 on quality at same compression: - turbo4: +0.23% PPL vs q8_0 - q4_0: +0.52% PPL vs q8_0 Because optimal quantization on rotated coordinates > uniform quantization on raw coordinates. ### Q: What's the catch? A: Computational overhead: - Must do WHT rotation every forward pass - Extra memory bandwidth for rotation matrices - Worth it for long context (bandwidth-bound), less so for short ### Q: Should we use symmetric or asymmetric K/V? A: Test results from repo: - **Symmetric** (turbo3/turbo3): Best for Q8_0+ models - **Asymmetric** (q8_0-K / turbo4-V): Best for Q4_K_M models K controls attention routing → higher precision matters more. --- ## 📋 Build Verification Checklist Once Mac build is ready, verify: 1. [ ] WHT rotation produces Gaussianized output (kurtosis ~3) 2. [ ] Centroid lookup matches Python reference 3. [ ] Norm correction active (check PPL diff) 4. [ ] 4-mag LUT auto-detected on M4 5. [ ] Sparse V provides speedup at 16K+ context 6. [ ] PPL regression < 1.0% vs q8_0 --- *Research compiled from turboquant_plus repo and ICLR 2026 paper*

allegro referenced this issue

2026-03-30 17:51:40 +00:00

TurboQuant — KV Cache Compression for Local Inference on M4 Max #1

Timmy commented

2026-03-30 20:09:52 +00:00

PolarQuant Verification Checklist

#	Item	Verdict	Evidence
1	WHT rotation (not learned/dense)	PARTIAL PASS	Metal GPU: WHT butterfly (turbo_fwht_128). CPU turbo4 ref: dense random (legacy, not production path)
2	Same rotation quant/dequant	PASS	turbo_rotate_forward() ↔ turbo_rotate_inverse() use same sign arrays, butterfly is self-inverse
3	Lloyd-Max codebook (not uniform)	PASS	Centroids are non-uniform (dense near 0, sparse at tails). Comment: "Lloyd-Max for N(0, 1/128)"
4	Radius at FP16+	PASS	block_turboN_0.norm is ggml_half (FP16), 1 per 128-element group
5	No per-vector normalization	PASS	Only group-level L2 norm. No scale/offset/min/max. Static asserts enforce block sizes.
6	Dequant matches quant in Metal	PASS	SET_ROWS quantize → centroid LUT → WHT inverse dequant. Same centroids, signs, butterfly.

Result: 5/6 PASS, 1 PARTIAL PASS

⚠️ Item 1 caveat: CPU reference quantizer for turbo4 uses dense random orthogonal matrix (O(d²)) instead of WHT. This is legacy code. The Metal GPU production path correctly uses WHT for ALL types. If CPU fallback were invoked for turbo4, it would produce data incompatible with Metal dequant.

Recommendation: Flag the CPU turbo4 ref path as incompatible. For Metal inference (our use case), all items PASS.

## PolarQuant Verification Checklist | # | Item | Verdict | Evidence | |---|------|---------|----------| | 1 | WHT rotation (not learned/dense) | **PARTIAL PASS** | Metal GPU: WHT butterfly (turbo_fwht_128). CPU turbo4 ref: dense random (legacy, not production path) | | 2 | Same rotation quant/dequant | **PASS** | turbo_rotate_forward() ↔ turbo_rotate_inverse() use same sign arrays, butterfly is self-inverse | | 3 | Lloyd-Max codebook (not uniform) | **PASS** | Centroids are non-uniform (dense near 0, sparse at tails). Comment: "Lloyd-Max for N(0, 1/128)" | | 4 | Radius at FP16+ | **PASS** | block_turboN_0.norm is ggml_half (FP16), 1 per 128-element group | | 5 | No per-vector normalization | **PASS** | Only group-level L2 norm. No scale/offset/min/max. Static asserts enforce block sizes. | | 6 | Dequant matches quant in Metal | **PASS** | SET_ROWS quantize → centroid LUT → WHT inverse dequant. Same centroids, signs, butterfly. | **Result: 5/6 PASS, 1 PARTIAL PASS** ⚠️ Item 1 caveat: CPU reference quantizer for turbo4 uses dense random orthogonal matrix (O(d²)) instead of WHT. This is legacy code. The Metal GPU production path correctly uses WHT for ALL types. If CPU fallback were invoked for turbo4, it would produce data incompatible with Metal dequant. **Recommendation:** Flag the CPU turbo4 ref path as incompatible. For Metal inference (our use case), all items PASS.

Timmy closed this issue

2026-03-30 20:09:53 +00:00

Timmy referenced this issue from a commit

2026-03-30 20:12:04 +00:00

Phase 1 Report: PolarQuant MVP complete

Timmy referenced this issue

2026-03-30 20:19:50 +00:00

TurboQuant — KV Cache Compression for Local Inference on M4 Max #1

Timmy referenced this issue

2026-04-04 01:18:41 +00:00

TurboQuant — KV Cache Compression for Local Inference on M4 Max #1

Sign in to join this conversation.