2026-04-15 20:59:55 -04:00
# GENOME.md — TurboQuant (Timmy_Foundation/turboquant)
2026-04-20 21:42:33 -04:00
> Codebase Genome v1.1 | Refreshed 2026-04-18 | Repo 12/16 | Ref: #679
2026-04-15 20:59:55 -04:00
## Project Overview
**TurboQuant** is a KV cache compression system for local inference on Apple Silicon. Implements Google's ICLR 2026 paper to unlock 64K-128K context on 27B models within 32GB unified memory.
**Three-stage compression:**
1. **PolarQuant ** — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
2. **QJL ** — 1-bit quantized Johnson-Lindenstrauss residual correction
3. **TurboQuant ** — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss
**Key result:** 73% KV memory savings with 1% prompt processing overhead, 11% generation overhead.
## Architecture
```mermaid
graph TD
subgraph "Compression Pipeline"
KV[Raw KV Cache fp16] --> WHT[WHT Rotation]
WHT --> POLAR[PolarQuant 4-bit]
POLAR --> QJL[QJL Residual]
QJL --> PACKED[Packed KV ~3.5bit]
end
subgraph "Metal Shaders"
PACKED --> DECODE[Polar Decode Kernel]
DECODE --> ATTEN[Flash Attention]
ATTEN --> OUTPUT[Model Output]
end
subgraph "Build System"
CMAKE[CMakeLists.txt] --> LIB[turboquant.a]
LIB --> TEST[turboquant_roundtrip_test]
LIB --> LLAMA[llama.cpp fork integration]
end
2026-04-20 21:42:33 -04:00
subgraph "Python Layer"
SELECTOR[quant_selector.py] --> MODELS[model_registry/]
MODELS --> PROFILE[hardware_profiles.py]
PROFILE --> DECISION[quantization decision]
end
2026-04-15 20:59:55 -04:00
```
## Entry Points
| Entry Point | File | Purpose |
|-------------|------|---------|
| `polar_quant_encode_turbo4()` | llama-turbo.cpp | Encode float KV → 4-bit packed |
| `polar_quant_decode_turbo4()` | llama-turbo.cpp | Decode 4-bit packed → float KV |
2026-04-20 21:42:33 -04:00
| `cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON` | CMakeLists.txt | Build static library + CTest suite |
| `ctest --test-dir build --output-on-failure` | build/ | Run C++ roundtrip tests |
2026-04-15 20:59:55 -04:00
| `run_benchmarks.py` | benchmarks/ | Run perplexity benchmarks |
2026-04-20 21:42:33 -04:00
| `quant_selector.py` | quant_selector/ | Hardware-aware quantization selection |
2026-04-15 20:59:55 -04:00
## Key Abstractions
| Symbol | File | Purpose |
|--------|------|---------|
| `polar_quant_encode_turbo4()` | llama-turbo.h/.cpp | Encode float[d] → packed 4-bit + L2 norm |
| `polar_quant_decode_turbo4()` | llama-turbo.h/.cpp | Decode packed 4-bit + norm → float[d] |
| `turbo_dequantize_k()` | ggml-metal-turbo.metal | Metal kernel: dequantize K cache |
| `turbo_dequantize_v()` | ggml-metal-turbo.metal | Metal kernel: dequantize V cache |
| `turbo_fwht_128()` | ggml-metal-turbo.metal | Fast Walsh-Hadamard Transform |
| `run_perplexity.py` | benchmarks/ | Measure perplexity impact |
| `run_benchmarks.py` | benchmarks/ | Full benchmark suite (speed + quality) |
2026-04-20 21:42:33 -04:00
| `select_quantization()` | quant_selector.py | Pick quant scheme from hardware profile |
2026-04-15 20:59:55 -04:00
## Data Flow
```
Input: float KV vectors [d=128 per head]
↓
1. WHT rotation (in-place, O(d log d))
↓
2. Convert to polar coords (radius, angles)
↓
3. Lloyd-Max quantize angles → 4-bit indices
↓
4. Store: packed indices [d/2 bytes] + float norm [4 bytes]
↓
Decode: indices → codebook lookup → polar → cartesian → inverse WHT
↓
Output: reconstructed float KV [d=128]
```
2026-04-20 21:42:33 -04:00
## API Surface
| Function | Signature | Notes |
|----------|-----------|-------|
| `polar_quant_encode_turbo4` | `(const float*, uint8_t*, float*, int)` | Core encode path |
| `polar_quant_decode_turbo4` | `(const uint8_t*, float, float*, int)` | Core decode path |
| `select_quantization` | `(HardwareProfile) -> QuantConfig` | Python quant selector |
2026-04-15 20:59:55 -04:00
## File Index
| File | LOC | Purpose |
|------|-----|---------|
| `llama-turbo.h` | 24 | C API: encode/decode function declarations |
| `llama-turbo.cpp` | 78 | Implementation: PolarQuant encode/decode |
2026-04-20 21:42:33 -04:00
| `ggml-metal-turbo.metal` | 76 | Metal shader: dequantize + FWHT kernels |
| `CMakeLists.txt` | 42 | Standalone build: lib + test targets |
| `quant_selector.py` | ~120 | Python: hardware profile → quant decision |
| `tests/test_quant_selector.py` | ~90 | Pytest: quant selector (currently failing) |
| `benchmarks/run_benchmarks.py` | ~85 | Perplexity + speed benchmarking |
2026-04-15 20:59:55 -04:00
2026-04-20 21:42:33 -04:00
## CI / Runtime Drift
2026-04-15 20:59:55 -04:00
2026-04-20 21:42:33 -04:00
| Dimension | Status | Notes |
|-----------|--------|-------|
| **CMake/CTest standalone build ** | ✅ Passing | `cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON && ctest --test-dir build` works on current main |
| **Python quant selector tests ** | ❌ Failing | `tests/test_quant_selector.py` fails on current main — tracked in `turboquant #139` |
| **CI lane: quant_selector ** | ❌ Broken | The quant selector CI lane is non-blocking due to persistent failures |
| **CI lane: cmake roundtrip ** | ✅ Green | C++ roundtrip test passes in CI |
| **Metal shader compilation ** | ⚠️ Apple Silicon only | Cannot be tested in CI runners; validated manually on M-series hardware |
2026-04-15 20:59:55 -04:00
2026-04-20 21:42:33 -04:00
## Test Coverage Gaps
2026-04-15 20:59:55 -04:00
2026-04-20 21:42:33 -04:00
- `tests/test_quant_selector.py` is currently broken — selector returns wrong quantization for edge-case hardware profiles (see `turboquant #139` )
- No CI coverage for Metal shader correctness (Apple Silicon only)
- Benchmark regression detection is manual; no automated threshold enforcement
2026-04-15 20:59:55 -04:00
## Security Considerations
2026-04-20 21:42:33 -04:00
- C API operates on caller-allocated buffers — no internal bounds checking on `d` parameter
- Python quant selector reads hardware profile from filesystem; path traversal risk if profile dir is user-controllable
## Dependencies
2026-04-15 20:59:55 -04:00
2026-04-20 21:42:33 -04:00
| Dependency | Version | Purpose |
|------------|---------|---------|
| CMake | ≥3.20 | Build system |
| Python | ≥3.10 | Benchmarks + quant selector |
| pytest | any | Test runner for Python tests |
| Metal (macOS) | 14+ | GPU shader compilation |
| llama.cpp | fork | Integration layer |
2026-04-15 20:59:55 -04:00
2026-04-20 21:42:33 -04:00
## Deployment
2026-04-15 20:59:55 -04:00
2026-04-20 21:42:33 -04:00
- Static library `turboquant.a` linked into llama.cpp fork
- Python quant selector invoked at model-load time to pick compression scheme
- No standalone server component; embedded in inference runtime
2026-04-15 20:59:55 -04:00
2026-04-20 21:42:33 -04:00
## Technical Debt
2026-04-15 20:59:55 -04:00
2026-04-20 21:42:33 -04:00
- `turboquant #139` — quant selector test failures not yet resolved; CI lane is non-blocking
- No automated benchmark regression detection
- Metal shaders untestable in CI — manual validation on Apple Silicon required
- Stale genome (v1.0, 2026-04-15) did not reflect quant selector addition or CI drift