151 lines
6.2 KiB
Markdown
151 lines
6.2 KiB
Markdown
# GENOME.md — TurboQuant (Timmy_Foundation/turboquant)
|
|
|
|
> Codebase Genome v1.1 | Refreshed 2026-04-18 | Repo 12/16 | Ref: #679
|
|
|
|
## Project Overview
|
|
|
|
**TurboQuant** is a KV cache compression system for local inference on Apple Silicon. Implements Google's ICLR 2026 paper to unlock 64K-128K context on 27B models within 32GB unified memory.
|
|
|
|
**Three-stage compression:**
|
|
1. **PolarQuant** — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
|
|
2. **QJL** — 1-bit quantized Johnson-Lindenstrauss residual correction
|
|
3. **TurboQuant** — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss
|
|
|
|
**Key result:** 73% KV memory savings with 1% prompt processing overhead, 11% generation overhead.
|
|
|
|
## Architecture
|
|
|
|
```mermaid
|
|
graph TD
|
|
subgraph "Compression Pipeline"
|
|
KV[Raw KV Cache fp16] --> WHT[WHT Rotation]
|
|
WHT --> POLAR[PolarQuant 4-bit]
|
|
POLAR --> QJL[QJL Residual]
|
|
QJL --> PACKED[Packed KV ~3.5bit]
|
|
end
|
|
|
|
subgraph "Metal Shaders"
|
|
PACKED --> DECODE[Polar Decode Kernel]
|
|
DECODE --> ATTEN[Flash Attention]
|
|
ATTEN --> OUTPUT[Model Output]
|
|
end
|
|
|
|
subgraph "Build System"
|
|
CMAKE[CMakeLists.txt] --> LIB[turboquant.a]
|
|
LIB --> TEST[turboquant_roundtrip_test]
|
|
LIB --> LLAMA[llama.cpp fork integration]
|
|
end
|
|
|
|
subgraph "Python Layer"
|
|
SELECTOR[quant_selector.py] --> MODELS[model_registry/]
|
|
MODELS --> PROFILE[hardware_profiles.py]
|
|
PROFILE --> DECISION[quantization decision]
|
|
end
|
|
```
|
|
|
|
## Entry Points
|
|
|
|
| Entry Point | File | Purpose |
|
|
|-------------|------|---------|
|
|
| `polar_quant_encode_turbo4()` | llama-turbo.cpp | Encode float KV → 4-bit packed |
|
|
| `polar_quant_decode_turbo4()` | llama-turbo.cpp | Decode 4-bit packed → float KV |
|
|
| `cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON` | CMakeLists.txt | Build static library + CTest suite |
|
|
| `ctest --test-dir build --output-on-failure` | build/ | Run C++ roundtrip tests |
|
|
| `run_benchmarks.py` | benchmarks/ | Run perplexity benchmarks |
|
|
| `quant_selector.py` | quant_selector/ | Hardware-aware quantization selection |
|
|
|
|
## Key Abstractions
|
|
|
|
| Symbol | File | Purpose |
|
|
|--------|------|---------|
|
|
| `polar_quant_encode_turbo4()` | llama-turbo.h/.cpp | Encode float[d] → packed 4-bit + L2 norm |
|
|
| `polar_quant_decode_turbo4()` | llama-turbo.h/.cpp | Decode packed 4-bit + norm → float[d] |
|
|
| `turbo_dequantize_k()` | ggml-metal-turbo.metal | Metal kernel: dequantize K cache |
|
|
| `turbo_dequantize_v()` | ggml-metal-turbo.metal | Metal kernel: dequantize V cache |
|
|
| `turbo_fwht_128()` | ggml-metal-turbo.metal | Fast Walsh-Hadamard Transform |
|
|
| `run_perplexity.py` | benchmarks/ | Measure perplexity impact |
|
|
| `run_benchmarks.py` | benchmarks/ | Full benchmark suite (speed + quality) |
|
|
| `select_quantization()` | quant_selector.py | Pick quant scheme from hardware profile |
|
|
|
|
## Data Flow
|
|
|
|
```
|
|
Input: float KV vectors [d=128 per head]
|
|
↓
|
|
1. WHT rotation (in-place, O(d log d))
|
|
↓
|
|
2. Convert to polar coords (radius, angles)
|
|
↓
|
|
3. Lloyd-Max quantize angles → 4-bit indices
|
|
↓
|
|
4. Store: packed indices [d/2 bytes] + float norm [4 bytes]
|
|
↓
|
|
Decode: indices → codebook lookup → polar → cartesian → inverse WHT
|
|
↓
|
|
Output: reconstructed float KV [d=128]
|
|
```
|
|
|
|
## API Surface
|
|
|
|
| Function | Signature | Notes |
|
|
|----------|-----------|-------|
|
|
| `polar_quant_encode_turbo4` | `(const float*, uint8_t*, float*, int)` | Core encode path |
|
|
| `polar_quant_decode_turbo4` | `(const uint8_t*, float, float*, int)` | Core decode path |
|
|
| `select_quantization` | `(HardwareProfile) -> QuantConfig` | Python quant selector |
|
|
|
|
## File Index
|
|
|
|
| File | LOC | Purpose |
|
|
|------|-----|---------|
|
|
| `llama-turbo.h` | 24 | C API: encode/decode function declarations |
|
|
| `llama-turbo.cpp` | 78 | Implementation: PolarQuant encode/decode |
|
|
| `ggml-metal-turbo.metal` | 76 | Metal shader: dequantize + FWHT kernels |
|
|
| `CMakeLists.txt` | 42 | Standalone build: lib + test targets |
|
|
| `quant_selector.py` | ~120 | Python: hardware profile → quant decision |
|
|
| `tests/test_quant_selector.py` | ~90 | Pytest: quant selector (currently failing) |
|
|
| `benchmarks/run_benchmarks.py` | ~85 | Perplexity + speed benchmarking |
|
|
|
|
## CI / Runtime Drift
|
|
|
|
| Dimension | Status | Notes |
|
|
|-----------|--------|-------|
|
|
| **CMake/CTest standalone build** | ✅ Passing | `cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON && ctest --test-dir build` works on current main |
|
|
| **Python quant selector tests** | ❌ Failing | `tests/test_quant_selector.py` fails on current main — tracked in `turboquant #139` |
|
|
| **CI lane: quant_selector** | ❌ Broken | The quant selector CI lane is non-blocking due to persistent failures |
|
|
| **CI lane: cmake roundtrip** | ✅ Green | C++ roundtrip test passes in CI |
|
|
| **Metal shader compilation** | ⚠️ Apple Silicon only | Cannot be tested in CI runners; validated manually on M-series hardware |
|
|
|
|
## Test Coverage Gaps
|
|
|
|
- `tests/test_quant_selector.py` is currently broken — selector returns wrong quantization for edge-case hardware profiles (see `turboquant #139`)
|
|
- No CI coverage for Metal shader correctness (Apple Silicon only)
|
|
- Benchmark regression detection is manual; no automated threshold enforcement
|
|
|
|
## Security Considerations
|
|
|
|
- C API operates on caller-allocated buffers — no internal bounds checking on `d` parameter
|
|
- Python quant selector reads hardware profile from filesystem; path traversal risk if profile dir is user-controllable
|
|
|
|
## Dependencies
|
|
|
|
| Dependency | Version | Purpose |
|
|
|------------|---------|---------|
|
|
| CMake | ≥3.20 | Build system |
|
|
| Python | ≥3.10 | Benchmarks + quant selector |
|
|
| pytest | any | Test runner for Python tests |
|
|
| Metal (macOS) | 14+ | GPU shader compilation |
|
|
| llama.cpp | fork | Integration layer |
|
|
|
|
## Deployment
|
|
|
|
- Static library `turboquant.a` linked into llama.cpp fork
|
|
- Python quant selector invoked at model-load time to pick compression scheme
|
|
- No standalone server component; embedded in inference runtime
|
|
|
|
## Technical Debt
|
|
|
|
- `turboquant #139` — quant selector test failures not yet resolved; CI lane is non-blocking
|
|
- No automated benchmark regression detection
|
|
- Metal shaders untestable in CI — manual validation on Apple Silicon required
|
|
- Stale genome (v1.0, 2026-04-15) did not reflect quant selector addition or CI drift
|