GENOME.md — TurboQuant (Timmy_Foundation/turboquant)
Codebase Genome v1.1 | Refreshed 2026-04-18 | Repo 12/16 | Ref: #679
Project Overview
TurboQuant is a KV cache compression system for local inference on Apple Silicon. Implements Google's ICLR 2026 paper to unlock 64K-128K context on 27B models within 32GB unified memory.
Three-stage compression:
- PolarQuant — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
- QJL — 1-bit quantized Johnson-Lindenstrauss residual correction
- TurboQuant — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss
Key result: 73% KV memory savings with 1% prompt processing overhead, 11% generation overhead.
Architecture
Entry Points
| Entry Point |
File |
Purpose |
polar_quant_encode_turbo4() |
llama-turbo.cpp |
Encode float KV → 4-bit packed |
polar_quant_decode_turbo4() |
llama-turbo.cpp |
Decode 4-bit packed → float KV |
cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON |
CMakeLists.txt |
Build static library + CTest suite |
ctest --test-dir build --output-on-failure |
build/ |
Run C++ roundtrip tests |
run_benchmarks.py |
benchmarks/ |
Run perplexity benchmarks |
quant_selector.py |
quant_selector/ |
Hardware-aware quantization selection |
Key Abstractions
| Symbol |
File |
Purpose |
polar_quant_encode_turbo4() |
llama-turbo.h/.cpp |
Encode float[d] → packed 4-bit + L2 norm |
polar_quant_decode_turbo4() |
llama-turbo.h/.cpp |
Decode packed 4-bit + norm → float[d] |
turbo_dequantize_k() |
ggml-metal-turbo.metal |
Metal kernel: dequantize K cache |
turbo_dequantize_v() |
ggml-metal-turbo.metal |
Metal kernel: dequantize V cache |
turbo_fwht_128() |
ggml-metal-turbo.metal |
Fast Walsh-Hadamard Transform |
run_perplexity.py |
benchmarks/ |
Measure perplexity impact |
run_benchmarks.py |
benchmarks/ |
Full benchmark suite (speed + quality) |
select_quantization() |
quant_selector.py |
Pick quant scheme from hardware profile |
Data Flow
API Surface
| Function |
Signature |
Notes |
polar_quant_encode_turbo4 |
(const float*, uint8_t*, float*, int) |
Core encode path |
polar_quant_decode_turbo4 |
(const uint8_t*, float, float*, int) |
Core decode path |
select_quantization |
(HardwareProfile) -> QuantConfig |
Python quant selector |
File Index
| File |
LOC |
Purpose |
llama-turbo.h |
24 |
C API: encode/decode function declarations |
llama-turbo.cpp |
78 |
Implementation: PolarQuant encode/decode |
ggml-metal-turbo.metal |
76 |
Metal shader: dequantize + FWHT kernels |
CMakeLists.txt |
42 |
Standalone build: lib + test targets |
quant_selector.py |
~120 |
Python: hardware profile → quant decision |
tests/test_quant_selector.py |
~90 |
Pytest: quant selector (currently failing) |
benchmarks/run_benchmarks.py |
~85 |
Perplexity + speed benchmarking |
CI / Runtime Drift
| Dimension |
Status |
Notes |
| CMake/CTest standalone build |
✅ Passing |
cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON && ctest --test-dir build works on current main |
| Python quant selector tests |
❌ Failing |
tests/test_quant_selector.py fails on current main — tracked in turboquant #139 |
| CI lane: quant_selector |
❌ Broken |
The quant selector CI lane is non-blocking due to persistent failures |
| CI lane: cmake roundtrip |
✅ Green |
C++ roundtrip test passes in CI |
| Metal shader compilation |
⚠️ Apple Silicon only |
Cannot be tested in CI runners; validated manually on M-series hardware |
Test Coverage Gaps
tests/test_quant_selector.py is currently broken — selector returns wrong quantization for edge-case hardware profiles (see turboquant #139)
- No CI coverage for Metal shader correctness (Apple Silicon only)
- Benchmark regression detection is manual; no automated threshold enforcement
Security Considerations
- C API operates on caller-allocated buffers — no internal bounds checking on
d parameter
- Python quant selector reads hardware profile from filesystem; path traversal risk if profile dir is user-controllable
Dependencies
| Dependency |
Version |
Purpose |
| CMake |
≥3.20 |
Build system |
| Python |
≥3.10 |
Benchmarks + quant selector |
| pytest |
any |
Test runner for Python tests |
| Metal (macOS) |
14+ |
GPU shader compilation |
| llama.cpp |
fork |
Integration layer |
Deployment
- Static library
turboquant.a linked into llama.cpp fork
- Python quant selector invoked at model-load time to pick compression scheme
- No standalone server component; embedded in inference runtime
Technical Debt
turboquant #139 — quant selector test failures not yet resolved; CI lane is non-blocking
- No automated benchmark regression detection
- Metal shaders untestable in CI — manual validation on Apple Silicon required
- Stale genome (v1.0, 2026-04-15) did not reflect quant selector addition or CI drift