Some checks failed
Smoke Test / smoke (pull_request) Failing after 17s
Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679.
4.9 KiB
4.9 KiB
GENOME.md — TurboQuant (Timmy_Foundation/turboquant)
Codebase Genome v1.0 | Generated 2026-04-15 | Repo 12/16
Project Overview
TurboQuant is a KV cache compression system for local inference on Apple Silicon. Implements Google's ICLR 2026 paper to unlock 64K-128K context on 27B models within 32GB unified memory.
Three-stage compression:
- PolarQuant — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
- QJL — 1-bit quantized Johnson-Lindenstrauss residual correction
- TurboQuant — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss
Key result: 73% KV memory savings with 1% prompt processing overhead, 11% generation overhead.
Architecture
graph TD
subgraph "Compression Pipeline"
KV[Raw KV Cache fp16] --> WHT[WHT Rotation]
WHT --> POLAR[PolarQuant 4-bit]
POLAR --> QJL[QJL Residual]
QJL --> PACKED[Packed KV ~3.5bit]
end
subgraph "Metal Shaders"
PACKED --> DECODE[Polar Decode Kernel]
DECODE --> ATTEN[Flash Attention]
ATTEN --> OUTPUT[Model Output]
end
subgraph "Build System"
CMAKE[CMakeLists.txt] --> LIB[turboquant.a]
LIB --> TEST[turboquant_roundtrip_test]
LIB --> LLAMA[llama.cpp fork integration]
end
Entry Points
| Entry Point | File | Purpose |
|---|---|---|
polar_quant_encode_turbo4() |
llama-turbo.cpp | Encode float KV → 4-bit packed |
polar_quant_decode_turbo4() |
llama-turbo.cpp | Decode 4-bit packed → float KV |
cmake build |
CMakeLists.txt | Build static library + tests |
run_benchmarks.py |
benchmarks/ | Run perplexity benchmarks |
Key Abstractions
| Symbol | File | Purpose |
|---|---|---|
polar_quant_encode_turbo4() |
llama-turbo.h/.cpp | Encode float[d] → packed 4-bit + L2 norm |
polar_quant_decode_turbo4() |
llama-turbo.h/.cpp | Decode packed 4-bit + norm → float[d] |
turbo_dequantize_k() |
ggml-metal-turbo.metal | Metal kernel: dequantize K cache |
turbo_dequantize_v() |
ggml-metal-turbo.metal | Metal kernel: dequantize V cache |
turbo_fwht_128() |
ggml-metal-turbo.metal | Fast Walsh-Hadamard Transform |
run_perplexity.py |
benchmarks/ | Measure perplexity impact |
run_benchmarks.py |
benchmarks/ | Full benchmark suite (speed + quality) |
Data Flow
Input: float KV vectors [d=128 per head]
↓
1. WHT rotation (in-place, O(d log d))
↓
2. Convert to polar coords (radius, angles)
↓
3. Lloyd-Max quantize angles → 4-bit indices
↓
4. Store: packed indices [d/2 bytes] + float norm [4 bytes]
↓
Decode: indices → codebook lookup → polar → cartesian → inverse WHT
↓
Output: reconstructed float KV [d=128]
File Index
| File | LOC | Purpose |
|---|---|---|
llama-turbo.h |
24 | C API: encode/decode function declarations |
llama-turbo.cpp |
78 | Implementation: PolarQuant encode/decode |
ggml-metal-turbo.metal |
76 | Metal shaders: dequantize + flash attention |
CMakeLists.txt |
44 | Build system: static lib + tests |
tests/roundtrip_test.cpp |
104 | Roundtrip encode→decode validation |
benchmarks/run_benchmarks.py |
227 | Benchmark suite |
benchmarks/run_perplexity.py |
~100 | Perplexity measurement |
evolution/hardware_optimizer.py |
5 | Hardware detection stub |
Total: ~660 LOC | C++ core: 206 LOC | Python benchmarks: 232 LOC
Dependencies
| Dependency | Purpose |
|---|---|
| CMake 3.16+ | Build system |
| C++17 compiler | Core implementation |
| Metal (macOS) | GPU shader execution |
| Python 3.11+ | Benchmarks |
| llama.cpp fork | Integration target |
Source Repos (Upstream)
| Repo | Role |
|---|---|
| TheTom/llama-cpp-turboquant | llama.cpp fork with Metal shaders |
| TheTom/turboquant_plus | Reference impl, 511+ tests |
| amirzandieh/QJL | Author QJL code (CUDA) |
| rachittshah/mlx-turboquant | MLX fallback |
Test Coverage
| Test | File | Validates |
|---|---|---|
turboquant_roundtrip |
tests/roundtrip_test.cpp | Encode→decode roundtrip fidelity |
| Perplexity benchmarks | benchmarks/run_perplexity.py | Quality preservation across prompts |
| Speed benchmarks | benchmarks/run_benchmarks.py | Compression overhead measurement |
Security Considerations
- No network calls — Pure local computation, no telemetry
- Memory safety — C++ code uses raw pointers; roundtrip tests validate correctness
- Build isolation — CMake builds static library; no dynamic linking
Sovereignty Assessment
- Fully local — No cloud dependencies, no API calls
- Open source — All code on Gitea, upstream repos public
- No telemetry — Pure computation
- Hardware-specific — Metal shaders target Apple Silicon; CUDA upstream for other GPUs
Verdict: Fully sovereign. No corporate lock-in. Pure local inference enhancement.
"A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context."