[P3] QJL residual correction — Metal port #14

Closed
opened 2026-03-30 17:11:18 +00:00 by Timmy · 1 comment
Owner

Parent: #1 | Only if PolarQuant insufficient at < 3 bits/channel

Add QJL 1-bit residual correction for full TurboQuant behavior.

Source: amirzandieh/QJL (CUDA -> Metal port needed)

Estimated Time: 30-60 min (real engineering work)

Decision Gate

Only proceed if PolarQuant alone doesn't meet quality bar at target compression.

Acceptance Criteria

  • QJL Metal kernels implemented
  • Benchmarked vs PolarQuant-only
  • Quality improvement measured at 2.5 bits/channel
## Parent: #1 | Only if PolarQuant insufficient at < 3 bits/channel Add QJL 1-bit residual correction for full TurboQuant behavior. ## Source: amirzandieh/QJL (CUDA -> Metal port needed) ## Estimated Time: 30-60 min (real engineering work) ## Decision Gate Only proceed if PolarQuant alone doesn't meet quality bar at target compression. ## Acceptance Criteria - [ ] QJL Metal kernels implemented - [ ] Benchmarked vs PolarQuant-only - [ ] Quality improvement measured at 2.5 bits/channel
Timmy added this to the Phase 3+ — Optimization & QJL milestone 2026-03-30 17:11:18 +00:00
Timmy added the phase-3buildowner:lockeowner:cid labels 2026-03-30 17:11:18 +00:00
Author
Owner

QJL Residual Correction — Assessment Complete

Critical Finding

turbo4 is NOT using QJL. turbo4 = pure 4-bit PolarQuant (16 centroids, nibble-packed).

The TURBO4_USE_4BIT flag defaults to 1 in ggml-common.h:291. With this flag:

  • turbo4 = 4-bit PolarQuant only
  • Legacy 3-bit+QJL path exists in an #else block but is DISABLED

QJL Infrastructure Status

Present but unused:

  • turbo_qjl_wht_signs1/2 — 128 floats each (Metal + CPU)
  • turbo_qjl_mtl — 128x128 dense QJL projection matrix
  • turbo_qjl_t_mtl — transposed version
  • None referenced by any active kernel — dead code

What Would Be Needed for Standalone QJL

  1. New GGML type (e.g., GGML_TYPE_QJL1_0)
  2. New Metal kernels: quantize (JL transform + sign extraction), dequant, FA integration
  3. Reference: amirzandieh/QJL has CUDA kernels (~1500 lines total)
  4. WHT infrastructure already in Metal is a head start

Recommendation

Not needed for current goals. turbo4 (pure PolarQuant, 4-bit) already delivers 73% KV savings with minimal quality impact. QJL would only matter if pushing below 3 bits/channel, which isn't needed on 36GB hardware.

Save for Phase 4 / upstream watch.

## QJL Residual Correction — Assessment Complete ### Critical Finding **turbo4 is NOT using QJL.** turbo4 = pure 4-bit PolarQuant (16 centroids, nibble-packed). The `TURBO4_USE_4BIT` flag defaults to 1 in `ggml-common.h:291`. With this flag: - turbo4 = 4-bit PolarQuant only - Legacy 3-bit+QJL path exists in an `#else` block but is **DISABLED** ### QJL Infrastructure Status Present but unused: - `turbo_qjl_wht_signs1/2` — 128 floats each (Metal + CPU) - `turbo_qjl_mtl` — 128x128 dense QJL projection matrix - `turbo_qjl_t_mtl` — transposed version - **None referenced by any active kernel — dead code** ### What Would Be Needed for Standalone QJL 1. New GGML type (e.g., GGML_TYPE_QJL1_0) 2. New Metal kernels: quantize (JL transform + sign extraction), dequant, FA integration 3. Reference: `amirzandieh/QJL` has CUDA kernels (~1500 lines total) 4. WHT infrastructure already in Metal is a head start ### Recommendation **Not needed for current goals.** turbo4 (pure PolarQuant, 4-bit) already delivers 73% KV savings with minimal quality impact. QJL would only matter if pushing below 3 bits/channel, which isn't needed on 36GB hardware. Save for Phase 4 / upstream watch.
Timmy closed this issue 2026-03-30 21:04:08 +00:00
Sign in to join this conversation.