[P3] QJL residual correction — Metal port #14
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Parent: #1 | Only if PolarQuant insufficient at < 3 bits/channel
Add QJL 1-bit residual correction for full TurboQuant behavior.
Source: amirzandieh/QJL (CUDA -> Metal port needed)
Estimated Time: 30-60 min (real engineering work)
Decision Gate
Only proceed if PolarQuant alone doesn't meet quality bar at target compression.
Acceptance Criteria
QJL Residual Correction — Assessment Complete
Critical Finding
turbo4 is NOT using QJL. turbo4 = pure 4-bit PolarQuant (16 centroids, nibble-packed).
The
TURBO4_USE_4BITflag defaults to 1 inggml-common.h:291. With this flag:#elseblock but is DISABLEDQJL Infrastructure Status
Present but unused:
turbo_qjl_wht_signs1/2— 128 floats each (Metal + CPU)turbo_qjl_mtl— 128x128 dense QJL projection matrixturbo_qjl_t_mtl— transposed versionWhat Would Be Needed for Standalone QJL
amirzandieh/QJLhas CUDA kernels (~1500 lines total)Recommendation
Not needed for current goals. turbo4 (pure PolarQuant, 4-bit) already delivers 73% KV savings with minimal quality impact. QJL would only matter if pushing below 3 bits/channel, which isn't needed on 36GB hardware.
Save for Phase 4 / upstream watch.