04fa60a53d8eb4811b4858cddb0b8f4224b6c453
All checks were successful
Smoke Test / smoke (pull_request) Successful in 20s
The test asserted strictly descending bits_per_channel across ALL levels, but q4_0 (4.0 bits) is a standard fallback that comes after turbo2 (1.5 bits) despite having more bits. These are different quantization schemes: TurboQuant vs standard GGUF. Fix: separate the assertion into two checks: 1. TurboQuant levels (turbo4 > turbo3 > turbo2) must have strictly descending bits_per_channel 2. Standard fallback(s) must come after all TurboQuant levels in the list
TurboQuant
KV cache compression for local inference on M4 Max MacBook Pro.
What
TurboQuant (Google, ICLR 2026) is a three-stage KV cache compression method:
- PolarQuant — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
- QJL — 1-bit quantized Johnson-Lindenstrauss residual correction
- TurboQuant — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss
Why
Unlock 64K-128K context on qwen3.5:27b within 32GB unified memory. A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context.
Status
See issues for current progress.
Roles
- Strago: Build spec author
- Cid: Implementation, benchmarks, deployment
- Locke: Research support, upstream watch
- John: Quality review
- Frankie: Coordination
Source Repos
- TheTom/llama-cpp-turboquant — llama.cpp fork with Metal
- TheTom/turboquant_plus — Reference impl, 511+ tests
- amirzandieh/QJL — Author QJL code (CUDA)
- rachittshah/mlx-turboquant — MLX fallback
Docs
- Project Status — Full project status and build specification
Languages
Python
90.5%
C++
6.2%
Metal
2.4%
CMake
0.9%