# GENOME.md — TurboQuant (Timmy_Foundation/turboquant) > Codebase Genome v1.1 | Refreshed 2026-04-18 | Repo 12/16 | Ref: #679 ## Project Overview **TurboQuant** is a KV cache compression system for local inference on Apple Silicon. Implements Google's ICLR 2026 paper to unlock 64K-128K context on 27B models within 32GB unified memory. **Three-stage compression:** 1. **PolarQuant** — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression) 2. **QJL** — 1-bit quantized Johnson-Lindenstrauss residual correction 3. **TurboQuant** — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss **Key result:** 73% KV memory savings with 1% prompt processing overhead, 11% generation overhead. ## Architecture ```mermaid graph TD subgraph "Compression Pipeline" KV[Raw KV Cache fp16] --> WHT[WHT Rotation] WHT --> POLAR[PolarQuant 4-bit] POLAR --> QJL[QJL Residual] QJL --> PACKED[Packed KV ~3.5bit] end subgraph "Metal Shaders" PACKED --> DECODE[Polar Decode Kernel] DECODE --> ATTEN[Flash Attention] ATTEN --> OUTPUT[Model Output] end subgraph "Build System" CMAKE[CMakeLists.txt] --> LIB[turboquant.a] LIB --> TEST[turboquant_roundtrip_test] LIB --> LLAMA[llama.cpp fork integration] end subgraph "Python Layer" SELECTOR[quant_selector.py] --> MODELS[model_registry/] MODELS --> PROFILE[hardware_profiles.py] PROFILE --> DECISION[quantization decision] end ``` ## Entry Points | Entry Point | File | Purpose | |-------------|------|---------| | `polar_quant_encode_turbo4()` | llama-turbo.cpp | Encode float KV → 4-bit packed | | `polar_quant_decode_turbo4()` | llama-turbo.cpp | Decode 4-bit packed → float KV | | `cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON` | CMakeLists.txt | Build static library + CTest suite | | `ctest --test-dir build --output-on-failure` | build/ | Run C++ roundtrip tests | | `run_benchmarks.py` | benchmarks/ | Run perplexity benchmarks | | `quant_selector.py` | quant_selector/ | Hardware-aware quantization selection | ## Key Abstractions | Symbol | File | Purpose | |--------|------|---------| | `polar_quant_encode_turbo4()` | llama-turbo.h/.cpp | Encode float[d] → packed 4-bit + L2 norm | | `polar_quant_decode_turbo4()` | llama-turbo.h/.cpp | Decode packed 4-bit + norm → float[d] | | `turbo_dequantize_k()` | ggml-metal-turbo.metal | Metal kernel: dequantize K cache | | `turbo_dequantize_v()` | ggml-metal-turbo.metal | Metal kernel: dequantize V cache | | `turbo_fwht_128()` | ggml-metal-turbo.metal | Fast Walsh-Hadamard Transform | | `run_perplexity.py` | benchmarks/ | Measure perplexity impact | | `run_benchmarks.py` | benchmarks/ | Full benchmark suite (speed + quality) | | `select_quantization()` | quant_selector.py | Pick quant scheme from hardware profile | ## Data Flow ``` Input: float KV vectors [d=128 per head] ↓ 1. WHT rotation (in-place, O(d log d)) ↓ 2. Convert to polar coords (radius, angles) ↓ 3. Lloyd-Max quantize angles → 4-bit indices ↓ 4. Store: packed indices [d/2 bytes] + float norm [4 bytes] ↓ Decode: indices → codebook lookup → polar → cartesian → inverse WHT ↓ Output: reconstructed float KV [d=128] ``` ## API Surface | Function | Signature | Notes | |----------|-----------|-------| | `polar_quant_encode_turbo4` | `(const float*, uint8_t*, float*, int)` | Core encode path | | `polar_quant_decode_turbo4` | `(const uint8_t*, float, float*, int)` | Core decode path | | `select_quantization` | `(HardwareProfile) -> QuantConfig` | Python quant selector | ## File Index | File | LOC | Purpose | |------|-----|---------| | `llama-turbo.h` | 24 | C API: encode/decode function declarations | | `llama-turbo.cpp` | 78 | Implementation: PolarQuant encode/decode | | `ggml-metal-turbo.metal` | 76 | Metal shader: dequantize + FWHT kernels | | `CMakeLists.txt` | 42 | Standalone build: lib + test targets | | `quant_selector.py` | ~120 | Python: hardware profile → quant decision | | `tests/test_quant_selector.py` | ~90 | Pytest: quant selector (currently failing) | | `benchmarks/run_benchmarks.py` | ~85 | Perplexity + speed benchmarking | ## CI / Runtime Drift | Dimension | Status | Notes | |-----------|--------|-------| | **CMake/CTest standalone build** | ✅ Passing | `cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON && ctest --test-dir build` works on current main | | **Python quant selector tests** | ❌ Failing | `tests/test_quant_selector.py` fails on current main — tracked in `turboquant #139` | | **CI lane: quant_selector** | ❌ Broken | The quant selector CI lane is non-blocking due to persistent failures | | **CI lane: cmake roundtrip** | ✅ Green | C++ roundtrip test passes in CI | | **Metal shader compilation** | ⚠️ Apple Silicon only | Cannot be tested in CI runners; validated manually on M-series hardware | ## Test Coverage Gaps - `tests/test_quant_selector.py` is currently broken — selector returns wrong quantization for edge-case hardware profiles (see `turboquant #139`) - No CI coverage for Metal shader correctness (Apple Silicon only) - Benchmark regression detection is manual; no automated threshold enforcement ## Security Considerations - C API operates on caller-allocated buffers — no internal bounds checking on `d` parameter - Python quant selector reads hardware profile from filesystem; path traversal risk if profile dir is user-controllable ## Dependencies | Dependency | Version | Purpose | |------------|---------|---------| | CMake | ≥3.20 | Build system | | Python | ≥3.10 | Benchmarks + quant selector | | pytest | any | Test runner for Python tests | | Metal (macOS) | 14+ | GPU shader compilation | | llama.cpp | fork | Integration layer | ## Deployment - Static library `turboquant.a` linked into llama.cpp fork - Python quant selector invoked at model-load time to pick compression scheme - No standalone server component; embedded in inference runtime ## Technical Debt - `turboquant #139` — quant selector test failures not yet resolved; CI lane is non-blocking - No automated benchmark regression detection - Metal shaders untestable in CI — manual validation on Apple Silicon required - Stale genome (v1.0, 2026-04-15) did not reflect quant selector addition or CI drift