# GENOME.md: turboquant **Generated:** 2026-04-14 **Repo:** Timmy_Foundation/turboquant **Phase:** 1 Complete (PolarQuant MVP) **Status:** Production-ready Metal shaders, benchmarks passing --- ## Project Overview TurboQuant is a KV cache compression system for local LLM inference on Apple Silicon. It enables 64K-128K context windows on 27B-parameter models within 36GB unified memory. Three-stage compression pipeline: 1. **PolarQuant** -- WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression) 2. **QJL** -- 1-bit quantized Johnson-Lindenstrauss residual correction 3. **TurboQuant** -- PolarQuant + QJL combined = ~3.5 bits/channel, zero accuracy loss **Key result:** turbo4 delivers 73% KV memory savings with 1% prompt processing overhead and 11% generation overhead. ## Architecture ```mermaid graph TD A[LLM Inference] --> B[KV Cache Layer] B --> C{TurboQuant Mode} C -->|turbo2| D[2-bit PolarQuant] C -->|turbo3| E[3-bit PolarQuant + QJL] C -->|turbo4| F[4-bit PolarQuant] D --> G[Metal Shader: encode/decode] E --> G F --> G G --> H[Compressed KV Storage] H --> I[Decompress on Attention] I --> A J[llama-turbo.h] --> K[polar_quant_encode_turbo4] K --> L[WHT Rotation] L --> M[Polar Transform] M --> N[Lloyd-Max Quantize] N --> O[Packed 4-bit Output] ``` ## Entry Points | Entry Point | Type | Purpose | |-------------|------|---------| | `llama-turbo.cpp` | C++ library | Core encode/decode functions | | `llama-turbo.h` | C header | Public API: `polar_quant_encode_turbo4`, `polar_quant_decode_turbo4` | | `ggml-metal-turbo.metal` | Metal shader | GPU-accelerated encode/decode for Apple Silicon | | `benchmarks/run_benchmarks.py` | Python | Benchmark suite: perplexity, speed, memory | | `benchmarks/run_perplexity.py` | Python | Perplexity evaluation across context lengths | | `evolution/hardware_optimizer.py` | Python | Hardware-aware parameter tuning | | `.gitea/workflows/smoke.yml` | CI | Smoke test on push | ## Data Flow ``` Input: float array [d] (d=128, one KV head) | v WHT Rotation (structured orthogonal transform) | v Polar Transform (cartesian -> polar coordinates) | v Lloyd-Max Quantization (non-uniform codebook, 4-bit) | v Output: packed uint8_t [d/2] + float norm (radius) ``` Decode is the inverse: unpack -> dequantize -> inverse polar -> inverse WHT. ## Key Abstractions | Abstraction | Description | |-------------|-------------| | **Turbo4** | 4-bit PolarQuant mode. Best quality. 4.2x compression. | | **Turbo3** | 3-bit mode with QJL residual. ~3.5 bits/channel. | | **Turbo2** | 2-bit mode. Maximum compression. Quality tradeoff. | | **WHT** | Walsh-Hadamard Transform. Structured orthogonal rotation. | | **Lloyd-Max** | Non-uniform codebook optimized for N(0, 1/sqrt(128)) distribution. | | **QJL** | Quantized Johnson-Lindenstrauss. 1-bit residual correction. | ## API Surface ### C API (llama-turbo.h) ```c // Encode: float [d] -> packed 4-bit [d/2] + norm void polar_quant_encode_turbo4(const float* src, uint8_t* dst, float* norm, int d); // Decode: packed 4-bit [d/2] + norm -> float [d] void polar_quant_decode_turbo4(const uint8_t* src, float* dst, float norm, int d); ``` ### Integration TurboQuant integrates with llama.cpp via: - Mixed quantization pairs: `q8_0 x turbo` for K/V asymmetric compression - Metal shader dispatch in `ggml-metal.metal` (turbo kernels) - Build flag: `-DGGML_TURBOQUANT=ON` ### Hermes Profile `profiles/hermes-profile-gemma4-turboquant.yaml` defines deployment config: - Model: gemma4 with turbo4 KV compression - Target hardware: M3/M4 Max, 36GB+ - Context window: up to 128K with compression ## Test Coverage | Area | Coverage | Notes | |------|----------|-------| | WHT rotation | Partial | Metal GPU uses WHT. CPU ref uses dense random (legacy). | | Encode/decode symmetry | Full | `turbo_rotate_forward()` == inverse of `turbo_rotate_inverse()` | | Lloyd-Max codebook | Full | Non-uniform centroids verified | | Radius precision | Full | FP16+ norm per 128-element block | | Metal shader correctness | Full | All dk32-dk576 variants tested | | Perplexity benchmarks | Full | WikiText results in `benchmarks/perplexity_results.json` | ### Gaps - No CI integration for Metal shader tests (smoke test only covers build) - CPU reference implementation uses dense random, not WHT (legacy) - No long-session stress tests beyond 128K - QJL implementation not yet verified against CUDA reference ## Security Considerations - **No network access.** All inference is local. - **No user data in repo.** Benchmarks use public WikiText corpus. - **Binary blobs.** `llama-turbo.cpp` compiles to native code. No sandboxing. - **Upstream dependency.** Fork of TheTom/llama-cpp-turboquant. Trust boundary at upstream. ## Dependencies | Dependency | Type | Source | |------------|------|--------| | llama.cpp | Fork | TheTom/llama-cpp-turboquant | | Metal | System | Apple GPU framework | | CMake | Build | Standard build system | | Python 3.10+ | Scripts | Benchmarks and optimizer | ## Key Files ``` turboquant/ llama-turbo.h # C API header llama-turbo.cpp # Core encode/decode implementation ggml-metal-turbo.metal # Metal GPU shaders benchmarks/ # Perplexity and speed benchmarks evolution/ # Hardware optimizer profiles/ # Hermes deployment profile docs/ # Project status and build spec .gitea/workflows/ # CI smoke test ```