[P3] QJL residual correction — Metal port #14

New Issue

Timmy · 2026-03-30T17:11:18Z

Timmy commented

2026-03-30 17:11:18 +00:00

Parent: #1 | Only if PolarQuant insufficient at < 3 bits/channel

Add QJL 1-bit residual correction for full TurboQuant behavior.

Source: amirzandieh/QJL (CUDA -> Metal port needed)

Estimated Time: 30-60 min (real engineering work)

Decision Gate

Only proceed if PolarQuant alone doesn't meet quality bar at target compression.

Acceptance Criteria

QJL Metal kernels implemented
Benchmarked vs PolarQuant-only
Quality improvement measured at 2.5 bits/channel

## Parent: #1 | Only if PolarQuant insufficient at < 3 bits/channel Add QJL 1-bit residual correction for full TurboQuant behavior. ## Source: amirzandieh/QJL (CUDA -> Metal port needed) ## Estimated Time: 30-60 min (real engineering work) ## Decision Gate Only proceed if PolarQuant alone doesn't meet quality bar at target compression. ## Acceptance Criteria - [ ] QJL Metal kernels implemented - [ ] Benchmarked vs PolarQuant-only - [ ] Quality improvement measured at 2.5 bits/channel

Timmy added this to the Phase 3+ — Optimization & QJL milestone 2026-03-30 17:11:18 +00:00

Timmy added the phase-3 build owner:locke owner:cid labels 2026-03-30 17:11:18 +00:00

allegro referenced this issue

2026-03-30 17:45:24 +00:00

[P1-S0] Fork assessment — age, conflicts, build path estimate #3

Timmy commented

2026-03-30 21:04:07 +00:00

QJL Residual Correction — Assessment Complete

Critical Finding

turbo4 is NOT using QJL. turbo4 = pure 4-bit PolarQuant (16 centroids, nibble-packed).

The TURBO4_USE_4BIT flag defaults to 1 in ggml-common.h:291. With this flag:

turbo4 = 4-bit PolarQuant only
Legacy 3-bit+QJL path exists in an #else block but is DISABLED

QJL Infrastructure Status

Present but unused:

turbo_qjl_wht_signs1/2 — 128 floats each (Metal + CPU)
turbo_qjl_mtl — 128x128 dense QJL projection matrix
turbo_qjl_t_mtl — transposed version
None referenced by any active kernel — dead code

What Would Be Needed for Standalone QJL

New GGML type (e.g., GGML_TYPE_QJL1_0)
New Metal kernels: quantize (JL transform + sign extraction), dequant, FA integration
Reference: amirzandieh/QJL has CUDA kernels (~1500 lines total)
WHT infrastructure already in Metal is a head start

Recommendation

Not needed for current goals. turbo4 (pure PolarQuant, 4-bit) already delivers 73% KV savings with minimal quality impact. QJL would only matter if pushing below 3 bits/channel, which isn't needed on 36GB hardware.

Save for Phase 4 / upstream watch.

## QJL Residual Correction — Assessment Complete ### Critical Finding **turbo4 is NOT using QJL.** turbo4 = pure 4-bit PolarQuant (16 centroids, nibble-packed). The `TURBO4_USE_4BIT` flag defaults to 1 in `ggml-common.h:291`. With this flag: - turbo4 = 4-bit PolarQuant only - Legacy 3-bit+QJL path exists in an `#else` block but is **DISABLED** ### QJL Infrastructure Status Present but unused: - `turbo_qjl_wht_signs1/2` — 128 floats each (Metal + CPU) - `turbo_qjl_mtl` — 128x128 dense QJL projection matrix - `turbo_qjl_t_mtl` — transposed version - **None referenced by any active kernel — dead code** ### What Would Be Needed for Standalone QJL 1. New GGML type (e.g., GGML_TYPE_QJL1_0) 2. New Metal kernels: quantize (JL transform + sign extraction), dequant, FA integration 3. Reference: `amirzandieh/QJL` has CUDA kernels (~1500 lines total) 4. WHT infrastructure already in Metal is a head start ### Recommendation **Not needed for current goals.** turbo4 (pure PolarQuant, 4-bit) already delivers 73% KV savings with minimal quality impact. QJL would only matter if pushing below 3 bits/channel, which isn't needed on 36GB hardware. Save for Phase 4 / upstream watch.

Timmy closed this issue

2026-03-30 21:04:08 +00:00

Timmy referenced this issue

2026-04-04 01:18:41 +00:00

TurboQuant — KV Cache Compression for Local Inference on M4 Max #1

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/turboquant#14