Some checks failed
Smoke Test / smoke (pull_request) Failing after 17s
Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679.
139 lines
4.9 KiB
Markdown
139 lines
4.9 KiB
Markdown
# GENOME.md — TurboQuant (Timmy_Foundation/turboquant)
|
|
|
|
> Codebase Genome v1.0 | Generated 2026-04-15 | Repo 12/16
|
|
|
|
## Project Overview
|
|
|
|
**TurboQuant** is a KV cache compression system for local inference on Apple Silicon. Implements Google's ICLR 2026 paper to unlock 64K-128K context on 27B models within 32GB unified memory.
|
|
|
|
**Three-stage compression:**
|
|
1. **PolarQuant** — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
|
|
2. **QJL** — 1-bit quantized Johnson-Lindenstrauss residual correction
|
|
3. **TurboQuant** — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss
|
|
|
|
**Key result:** 73% KV memory savings with 1% prompt processing overhead, 11% generation overhead.
|
|
|
|
## Architecture
|
|
|
|
```mermaid
|
|
graph TD
|
|
subgraph "Compression Pipeline"
|
|
KV[Raw KV Cache fp16] --> WHT[WHT Rotation]
|
|
WHT --> POLAR[PolarQuant 4-bit]
|
|
POLAR --> QJL[QJL Residual]
|
|
QJL --> PACKED[Packed KV ~3.5bit]
|
|
end
|
|
|
|
subgraph "Metal Shaders"
|
|
PACKED --> DECODE[Polar Decode Kernel]
|
|
DECODE --> ATTEN[Flash Attention]
|
|
ATTEN --> OUTPUT[Model Output]
|
|
end
|
|
|
|
subgraph "Build System"
|
|
CMAKE[CMakeLists.txt] --> LIB[turboquant.a]
|
|
LIB --> TEST[turboquant_roundtrip_test]
|
|
LIB --> LLAMA[llama.cpp fork integration]
|
|
end
|
|
```
|
|
|
|
## Entry Points
|
|
|
|
| Entry Point | File | Purpose |
|
|
|-------------|------|---------|
|
|
| `polar_quant_encode_turbo4()` | llama-turbo.cpp | Encode float KV → 4-bit packed |
|
|
| `polar_quant_decode_turbo4()` | llama-turbo.cpp | Decode 4-bit packed → float KV |
|
|
| `cmake build` | CMakeLists.txt | Build static library + tests |
|
|
| `run_benchmarks.py` | benchmarks/ | Run perplexity benchmarks |
|
|
|
|
## Key Abstractions
|
|
|
|
| Symbol | File | Purpose |
|
|
|--------|------|---------|
|
|
| `polar_quant_encode_turbo4()` | llama-turbo.h/.cpp | Encode float[d] → packed 4-bit + L2 norm |
|
|
| `polar_quant_decode_turbo4()` | llama-turbo.h/.cpp | Decode packed 4-bit + norm → float[d] |
|
|
| `turbo_dequantize_k()` | ggml-metal-turbo.metal | Metal kernel: dequantize K cache |
|
|
| `turbo_dequantize_v()` | ggml-metal-turbo.metal | Metal kernel: dequantize V cache |
|
|
| `turbo_fwht_128()` | ggml-metal-turbo.metal | Fast Walsh-Hadamard Transform |
|
|
| `run_perplexity.py` | benchmarks/ | Measure perplexity impact |
|
|
| `run_benchmarks.py` | benchmarks/ | Full benchmark suite (speed + quality) |
|
|
|
|
## Data Flow
|
|
|
|
```
|
|
Input: float KV vectors [d=128 per head]
|
|
↓
|
|
1. WHT rotation (in-place, O(d log d))
|
|
↓
|
|
2. Convert to polar coords (radius, angles)
|
|
↓
|
|
3. Lloyd-Max quantize angles → 4-bit indices
|
|
↓
|
|
4. Store: packed indices [d/2 bytes] + float norm [4 bytes]
|
|
↓
|
|
Decode: indices → codebook lookup → polar → cartesian → inverse WHT
|
|
↓
|
|
Output: reconstructed float KV [d=128]
|
|
```
|
|
|
|
## File Index
|
|
|
|
| File | LOC | Purpose |
|
|
|------|-----|---------|
|
|
| `llama-turbo.h` | 24 | C API: encode/decode function declarations |
|
|
| `llama-turbo.cpp` | 78 | Implementation: PolarQuant encode/decode |
|
|
| `ggml-metal-turbo.metal` | 76 | Metal shaders: dequantize + flash attention |
|
|
| `CMakeLists.txt` | 44 | Build system: static lib + tests |
|
|
| `tests/roundtrip_test.cpp` | 104 | Roundtrip encode→decode validation |
|
|
| `benchmarks/run_benchmarks.py` | 227 | Benchmark suite |
|
|
| `benchmarks/run_perplexity.py` | ~100 | Perplexity measurement |
|
|
| `evolution/hardware_optimizer.py` | 5 | Hardware detection stub |
|
|
|
|
**Total: ~660 LOC | C++ core: 206 LOC | Python benchmarks: 232 LOC**
|
|
|
|
## Dependencies
|
|
|
|
| Dependency | Purpose |
|
|
|------------|---------|
|
|
| CMake 3.16+ | Build system |
|
|
| C++17 compiler | Core implementation |
|
|
| Metal (macOS) | GPU shader execution |
|
|
| Python 3.11+ | Benchmarks |
|
|
| llama.cpp fork | Integration target |
|
|
|
|
## Source Repos (Upstream)
|
|
|
|
| Repo | Role |
|
|
|------|------|
|
|
| TheTom/llama-cpp-turboquant | llama.cpp fork with Metal shaders |
|
|
| TheTom/turboquant_plus | Reference impl, 511+ tests |
|
|
| amirzandieh/QJL | Author QJL code (CUDA) |
|
|
| rachittshah/mlx-turboquant | MLX fallback |
|
|
|
|
## Test Coverage
|
|
|
|
| Test | File | Validates |
|
|
|------|------|-----------|
|
|
| `turboquant_roundtrip` | tests/roundtrip_test.cpp | Encode→decode roundtrip fidelity |
|
|
| Perplexity benchmarks | benchmarks/run_perplexity.py | Quality preservation across prompts |
|
|
| Speed benchmarks | benchmarks/run_benchmarks.py | Compression overhead measurement |
|
|
|
|
## Security Considerations
|
|
|
|
1. **No network calls** — Pure local computation, no telemetry
|
|
2. **Memory safety** — C++ code uses raw pointers; roundtrip tests validate correctness
|
|
3. **Build isolation** — CMake builds static library; no dynamic linking
|
|
|
|
## Sovereignty Assessment
|
|
|
|
- **Fully local** — No cloud dependencies, no API calls
|
|
- **Open source** — All code on Gitea, upstream repos public
|
|
- **No telemetry** — Pure computation
|
|
- **Hardware-specific** — Metal shaders target Apple Silicon; CUDA upstream for other GPUs
|
|
|
|
**Verdict: Fully sovereign. No corporate lock-in. Pure local inference enhancement.**
|
|
|
|
---
|
|
|
|
*"A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context."*
|