timmy-home/genomes/turboquant/GENOME.md

# GENOME.md — TurboQuant (Timmy_Foundation/turboquant)

> Codebase Genome v1.0 | Generated 2026-04-15 | Repo 12/16

## Project Overview

**TurboQuant** is a KV cache compression system for local inference on Apple Silicon. Implements Google's ICLR 2026 paper to unlock 64K-128K context on 27B models within 32GB unified memory.

**Three-stage compression:**
1. **PolarQuant** — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
2. **QJL** — 1-bit quantized Johnson-Lindenstrauss residual correction
3. **TurboQuant** — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss

**Key result:** 73% KV memory savings with 1% prompt processing overhead, 11% generation overhead.

## Architecture

```mermaid
graph TD
    subgraph "Compression Pipeline"
        KV[Raw KV Cache fp16] --> WHT[WHT Rotation]
        WHT --> POLAR[PolarQuant 4-bit]
        POLAR --> QJL[QJL Residual]
        QJL --> PACKED[Packed KV ~3.5bit]
    end

    subgraph "Metal Shaders"
        PACKED --> DECODE[Polar Decode Kernel]
        DECODE --> ATTEN[Flash Attention]
        ATTEN --> OUTPUT[Model Output]
    end

    subgraph "Build System"
        CMAKE[CMakeLists.txt] --> LIB[turboquant.a]
        LIB --> TEST[turboquant_roundtrip_test]
        LIB --> LLAMA[llama.cpp fork integration]
    end
```

## Entry Points

| Entry Point | File | Purpose |
|-------------|------|---------|
| `polar_quant_encode_turbo4()` | llama-turbo.cpp | Encode float KV → 4-bit packed |
| `polar_quant_decode_turbo4()` | llama-turbo.cpp | Decode 4-bit packed → float KV |
| `cmake build` | CMakeLists.txt | Build static library + tests |
| `run_benchmarks.py` | benchmarks/ | Run perplexity benchmarks |

## Key Abstractions

| Symbol | File | Purpose |
|--------|------|---------|
| `polar_quant_encode_turbo4()` | llama-turbo.h/.cpp | Encode float[d] → packed 4-bit + L2 norm |
| `polar_quant_decode_turbo4()` | llama-turbo.h/.cpp | Decode packed 4-bit + norm → float[d] |
| `turbo_dequantize_k()` | ggml-metal-turbo.metal | Metal kernel: dequantize K cache |
| `turbo_dequantize_v()` | ggml-metal-turbo.metal | Metal kernel: dequantize V cache |
| `turbo_fwht_128()` | ggml-metal-turbo.metal | Fast Walsh-Hadamard Transform |
| `run_perplexity.py` | benchmarks/ | Measure perplexity impact |
| `run_benchmarks.py` | benchmarks/ | Full benchmark suite (speed + quality) |

## Data Flow

```
Input: float KV vectors [d=128 per head]
  ↓
1. WHT rotation (in-place, O(d log d))
  ↓
2. Convert to polar coords (radius, angles)
  ↓
3. Lloyd-Max quantize angles → 4-bit indices
  ↓
4. Store: packed indices [d/2 bytes] + float norm [4 bytes]
  ↓
Decode: indices → codebook lookup → polar → cartesian → inverse WHT
  ↓
Output: reconstructed float KV [d=128]
```

## File Index

| File | LOC | Purpose |
|------|-----|---------|
| `llama-turbo.h` | 24 | C API: encode/decode function declarations |
| `llama-turbo.cpp` | 78 | Implementation: PolarQuant encode/decode |
| `ggml-metal-turbo.metal` | 76 | Metal shaders: dequantize + flash attention |
| `CMakeLists.txt` | 44 | Build system: static lib + tests |
| `tests/roundtrip_test.cpp` | 104 | Roundtrip encode→decode validation |
| `benchmarks/run_benchmarks.py` | 227 | Benchmark suite |
| `benchmarks/run_perplexity.py` | ~100 | Perplexity measurement |
| `evolution/hardware_optimizer.py` | 5 | Hardware detection stub |

**Total: ~660 LOC | C++ core: 206 LOC | Python benchmarks: 232 LOC**

## Dependencies

| Dependency | Purpose |
|------------|---------|
| CMake 3.16+ | Build system |
| C++17 compiler | Core implementation |
| Metal (macOS) | GPU shader execution |
| Python 3.11+ | Benchmarks |
| llama.cpp fork | Integration target |

## Source Repos (Upstream)

| Repo | Role |
|------|------|
| TheTom/llama-cpp-turboquant | llama.cpp fork with Metal shaders |
| TheTom/turboquant_plus | Reference impl, 511+ tests |
| amirzandieh/QJL | Author QJL code (CUDA) |
| rachittshah/mlx-turboquant | MLX fallback |

## Test Coverage

| Test | File | Validates |
|------|------|-----------|
| `turboquant_roundtrip` | tests/roundtrip_test.cpp | Encode→decode roundtrip fidelity |
| Perplexity benchmarks | benchmarks/run_perplexity.py | Quality preservation across prompts |
| Speed benchmarks | benchmarks/run_benchmarks.py | Compression overhead measurement |

## Security Considerations

1. **No network calls** — Pure local computation, no telemetry
2. **Memory safety** — C++ code uses raw pointers; roundtrip tests validate correctness
3. **Build isolation** — CMake builds static library; no dynamic linking

## Sovereignty Assessment

- **Fully local** — No cloud dependencies, no API calls
- **Open source** — All code on Gitea, upstream repos public
- **No telemetry** — Pure computation
- **Hardware-specific** — Metal shaders target Apple Silicon; CUDA upstream for other GPUs

**Verdict: Fully sovereign. No corporate lock-in. Pure local inference enhancement.**

---

*"A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context."*