timmy-home/genomes/turboquant/GENOME.md

# GENOME.md — TurboQuant (Timmy_Foundation/turboquant)

> Codebase Genome v1.1 | Refreshed 2026-04-18 | Repo 12/16 | Ref: #679

## Project Overview

**TurboQuant** is a KV cache compression system for local inference on Apple Silicon. Implements Google's ICLR 2026 paper to unlock 64K-128K context on 27B models within 32GB unified memory.

**Three-stage compression:**
1. **PolarQuant** — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
2. **QJL** — 1-bit quantized Johnson-Lindenstrauss residual correction
3. **TurboQuant** — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss

**Key result:** 73% KV memory savings with 1% prompt processing overhead, 11% generation overhead.

## Architecture

```mermaid
graph TD
    subgraph "Compression Pipeline"
        KV[Raw KV Cache fp16] --> WHT[WHT Rotation]
        WHT --> POLAR[PolarQuant 4-bit]
        POLAR --> QJL[QJL Residual]
        QJL --> PACKED[Packed KV ~3.5bit]
    end

    subgraph "Metal Shaders"
        PACKED --> DECODE[Polar Decode Kernel]
        DECODE --> ATTEN[Flash Attention]
        ATTEN --> OUTPUT[Model Output]
    end

    subgraph "Build System"
        CMAKE[CMakeLists.txt] --> LIB[turboquant.a]
        LIB --> TEST[turboquant_roundtrip_test]
        LIB --> LLAMA[llama.cpp fork integration]
    end

    subgraph "Python Layer"
        SELECTOR[quant_selector.py] --> MODELS[model_registry/]
        MODELS --> PROFILE[hardware_profiles.py]
        PROFILE --> DECISION[quantization decision]
    end
```

## Entry Points

| Entry Point | File | Purpose |
|-------------|------|---------|
| `polar_quant_encode_turbo4()` | llama-turbo.cpp | Encode float KV → 4-bit packed |
| `polar_quant_decode_turbo4()` | llama-turbo.cpp | Decode 4-bit packed → float KV |
| `cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON` | CMakeLists.txt | Build static library + CTest suite |
| `ctest --test-dir build --output-on-failure` | build/ | Run C++ roundtrip tests |
| `run_benchmarks.py` | benchmarks/ | Run perplexity benchmarks |
| `quant_selector.py` | quant_selector/ | Hardware-aware quantization selection |

## Key Abstractions

| Symbol | File | Purpose |
|--------|------|---------|
| `polar_quant_encode_turbo4()` | llama-turbo.h/.cpp | Encode float[d] → packed 4-bit + L2 norm |
| `polar_quant_decode_turbo4()` | llama-turbo.h/.cpp | Decode packed 4-bit + norm → float[d] |
| `turbo_dequantize_k()` | ggml-metal-turbo.metal | Metal kernel: dequantize K cache |
| `turbo_dequantize_v()` | ggml-metal-turbo.metal | Metal kernel: dequantize V cache |
| `turbo_fwht_128()` | ggml-metal-turbo.metal | Fast Walsh-Hadamard Transform |
| `run_perplexity.py` | benchmarks/ | Measure perplexity impact |
| `run_benchmarks.py` | benchmarks/ | Full benchmark suite (speed + quality) |
| `select_quantization()` | quant_selector.py | Pick quant scheme from hardware profile |

## Data Flow

```
Input: float KV vectors [d=128 per head]
  ↓
1. WHT rotation (in-place, O(d log d))
  ↓
2. Convert to polar coords (radius, angles)
  ↓
3. Lloyd-Max quantize angles → 4-bit indices
  ↓
4. Store: packed indices [d/2 bytes] + float norm [4 bytes]
  ↓
Decode: indices → codebook lookup → polar → cartesian → inverse WHT
  ↓
Output: reconstructed float KV [d=128]
```

## API Surface

| Function | Signature | Notes |
|----------|-----------|-------|
| `polar_quant_encode_turbo4` | `(const float*, uint8_t*, float*, int)` | Core encode path |
| `polar_quant_decode_turbo4` | `(const uint8_t*, float, float*, int)` | Core decode path |
| `select_quantization` | `(HardwareProfile) -> QuantConfig` | Python quant selector |

## File Index

| File | LOC | Purpose |
|------|-----|---------|
| `llama-turbo.h` | 24 | C API: encode/decode function declarations |
| `llama-turbo.cpp` | 78 | Implementation: PolarQuant encode/decode |
| `ggml-metal-turbo.metal` | 76 | Metal shader: dequantize + FWHT kernels |
| `CMakeLists.txt` | 42 | Standalone build: lib + test targets |
| `quant_selector.py` | ~120 | Python: hardware profile → quant decision |
| `tests/test_quant_selector.py` | ~90 | Pytest: quant selector (currently failing) |
| `benchmarks/run_benchmarks.py` | ~85 | Perplexity + speed benchmarking |

## CI / Runtime Drift

| Dimension | Status | Notes |
|-----------|--------|-------|
| **CMake/CTest standalone build** | ✅ Passing | `cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON && ctest --test-dir build` works on current main |
| **Python quant selector tests** | ❌ Failing | `tests/test_quant_selector.py` fails on current main — tracked in `turboquant #139` |
| **CI lane: quant_selector** | ❌ Broken | The quant selector CI lane is non-blocking due to persistent failures |
| **CI lane: cmake roundtrip** | ✅ Green | C++ roundtrip test passes in CI |
| **Metal shader compilation** | ⚠️ Apple Silicon only | Cannot be tested in CI runners; validated manually on M-series hardware |

## Test Coverage Gaps

- `tests/test_quant_selector.py` is currently broken — selector returns wrong quantization for edge-case hardware profiles (see `turboquant #139`)
- No CI coverage for Metal shader correctness (Apple Silicon only)
- Benchmark regression detection is manual; no automated threshold enforcement

## Security Considerations

- C API operates on caller-allocated buffers — no internal bounds checking on `d` parameter
- Python quant selector reads hardware profile from filesystem; path traversal risk if profile dir is user-controllable

## Dependencies

| Dependency | Version | Purpose |
|------------|---------|---------|
| CMake | ≥3.20 | Build system |
| Python | ≥3.10 | Benchmarks + quant selector |
| pytest | any | Test runner for Python tests |
| Metal (macOS) | 14+ | GPU shader compilation |
| llama.cpp | fork | Integration layer |

## Deployment

- Static library `turboquant.a` linked into llama.cpp fork
- Python quant selector invoked at model-load time to pick compression scheme
- No standalone server component; embedded in inference runtime

## Technical Debt

- `turboquant #139` — quant selector test failures not yet resolved; CI lane is non-blocking
- No automated benchmark regression detection
- Metal shaders untestable in CI — manual validation on Apple Silicon required
- Stale genome (v1.0, 2026-04-15) did not reflect quant selector addition or CI drift