Files
timmy-home/genomes/turboquant/GENOME.md
Timmy f684b0deb8
Some checks failed
Smoke Test / smoke (pull_request) Failing after 17s
feat: Codebase Genome for turboquant (#679)
Complete GENOME.md for turboquant (KV cache compression):
- Project overview: PolarQuant + QJL = 3.5bit/channel
- Architecture diagram (Mermaid)
- Entry points and data flow
- Key abstractions (encode/decode/Metal shaders)
- File index (~660 LOC)
- Upstream source repos
- Test coverage
- Sovereignty assessment

Repo 12/16. Closes #679.
2026-04-15 20:59:55 -04:00

4.9 KiB

GENOME.md — TurboQuant (Timmy_Foundation/turboquant)

Codebase Genome v1.0 | Generated 2026-04-15 | Repo 12/16

Project Overview

TurboQuant is a KV cache compression system for local inference on Apple Silicon. Implements Google's ICLR 2026 paper to unlock 64K-128K context on 27B models within 32GB unified memory.

Three-stage compression:

  1. PolarQuant — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
  2. QJL — 1-bit quantized Johnson-Lindenstrauss residual correction
  3. TurboQuant — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss

Key result: 73% KV memory savings with 1% prompt processing overhead, 11% generation overhead.

Architecture

graph TD
    subgraph "Compression Pipeline"
        KV[Raw KV Cache fp16] --> WHT[WHT Rotation]
        WHT --> POLAR[PolarQuant 4-bit]
        POLAR --> QJL[QJL Residual]
        QJL --> PACKED[Packed KV ~3.5bit]
    end

    subgraph "Metal Shaders"
        PACKED --> DECODE[Polar Decode Kernel]
        DECODE --> ATTEN[Flash Attention]
        ATTEN --> OUTPUT[Model Output]
    end

    subgraph "Build System"
        CMAKE[CMakeLists.txt] --> LIB[turboquant.a]
        LIB --> TEST[turboquant_roundtrip_test]
        LIB --> LLAMA[llama.cpp fork integration]
    end

Entry Points

Entry Point File Purpose
polar_quant_encode_turbo4() llama-turbo.cpp Encode float KV → 4-bit packed
polar_quant_decode_turbo4() llama-turbo.cpp Decode 4-bit packed → float KV
cmake build CMakeLists.txt Build static library + tests
run_benchmarks.py benchmarks/ Run perplexity benchmarks

Key Abstractions

Symbol File Purpose
polar_quant_encode_turbo4() llama-turbo.h/.cpp Encode float[d] → packed 4-bit + L2 norm
polar_quant_decode_turbo4() llama-turbo.h/.cpp Decode packed 4-bit + norm → float[d]
turbo_dequantize_k() ggml-metal-turbo.metal Metal kernel: dequantize K cache
turbo_dequantize_v() ggml-metal-turbo.metal Metal kernel: dequantize V cache
turbo_fwht_128() ggml-metal-turbo.metal Fast Walsh-Hadamard Transform
run_perplexity.py benchmarks/ Measure perplexity impact
run_benchmarks.py benchmarks/ Full benchmark suite (speed + quality)

Data Flow

Input: float KV vectors [d=128 per head]
  ↓
1. WHT rotation (in-place, O(d log d))
  ↓
2. Convert to polar coords (radius, angles)
  ↓
3. Lloyd-Max quantize angles → 4-bit indices
  ↓
4. Store: packed indices [d/2 bytes] + float norm [4 bytes]
  ↓
Decode: indices → codebook lookup → polar → cartesian → inverse WHT
  ↓
Output: reconstructed float KV [d=128]

File Index

File LOC Purpose
llama-turbo.h 24 C API: encode/decode function declarations
llama-turbo.cpp 78 Implementation: PolarQuant encode/decode
ggml-metal-turbo.metal 76 Metal shaders: dequantize + flash attention
CMakeLists.txt 44 Build system: static lib + tests
tests/roundtrip_test.cpp 104 Roundtrip encode→decode validation
benchmarks/run_benchmarks.py 227 Benchmark suite
benchmarks/run_perplexity.py ~100 Perplexity measurement
evolution/hardware_optimizer.py 5 Hardware detection stub

Total: ~660 LOC | C++ core: 206 LOC | Python benchmarks: 232 LOC

Dependencies

Dependency Purpose
CMake 3.16+ Build system
C++17 compiler Core implementation
Metal (macOS) GPU shader execution
Python 3.11+ Benchmarks
llama.cpp fork Integration target

Source Repos (Upstream)

Repo Role
TheTom/llama-cpp-turboquant llama.cpp fork with Metal shaders
TheTom/turboquant_plus Reference impl, 511+ tests
amirzandieh/QJL Author QJL code (CUDA)
rachittshah/mlx-turboquant MLX fallback

Test Coverage

Test File Validates
turboquant_roundtrip tests/roundtrip_test.cpp Encode→decode roundtrip fidelity
Perplexity benchmarks benchmarks/run_perplexity.py Quality preservation across prompts
Speed benchmarks benchmarks/run_benchmarks.py Compression overhead measurement

Security Considerations

  1. No network calls — Pure local computation, no telemetry
  2. Memory safety — C++ code uses raw pointers; roundtrip tests validate correctness
  3. Build isolation — CMake builds static library; no dynamic linking

Sovereignty Assessment

  • Fully local — No cloud dependencies, no API calls
  • Open source — All code on Gitea, upstream repos public
  • No telemetry — Pure computation
  • Hardware-specific — Metal shaders target Apple Silicon; CUDA upstream for other GPUs

Verdict: Fully sovereign. No corporate lock-in. Pure local inference enhancement.


"A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context."