Files

Smoke Test / smoke (pull_request) Failing after 17s

Details

feat: Codebase Genome for turboquant (#679 )

Complete GENOME.md for turboquant (KV cache compression):
- Project overview: PolarQuant + QJL = 3.5bit/channel
- Architecture diagram (Mermaid)
- Entry points and data flow
- Key abstractions (encode/decode/Metal shaders)
- File index (~660 LOC)
- Upstream source repos
- Test coverage
- Sovereignty assessment

Repo 12/16. Closes #679.

2026-04-15 20:59:55 -04:00

4.9 KiB

Raw Blame History

GENOME.md — TurboQuant (Timmy_Foundation/turboquant)

Codebase Genome v1.0 | Generated 2026-04-15 | Repo 12/16

Project Overview

TurboQuant is a KV cache compression system for local inference on Apple Silicon. Implements Google's ICLR 2026 paper to unlock 64K-128K context on 27B models within 32GB unified memory.

Three-stage compression:

PolarQuant — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
QJL — 1-bit quantized Johnson-Lindenstrauss residual correction
TurboQuant — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss

Key result: 73% KV memory savings with 1% prompt processing overhead, 11% generation overhead.

Architecture

graph TD
    subgraph "Compression Pipeline"
        KV[Raw KV Cache fp16] --> WHT[WHT Rotation]
        WHT --> POLAR[PolarQuant 4-bit]
        POLAR --> QJL[QJL Residual]
        QJL --> PACKED[Packed KV ~3.5bit]
    end

    subgraph "Metal Shaders"
        PACKED --> DECODE[Polar Decode Kernel]
        DECODE --> ATTEN[Flash Attention]
        ATTEN --> OUTPUT[Model Output]
    end

    subgraph "Build System"
        CMAKE[CMakeLists.txt] --> LIB[turboquant.a]
        LIB --> TEST[turboquant_roundtrip_test]
        LIB --> LLAMA[llama.cpp fork integration]
    end

Entry Points

Entry Point	File	Purpose
`polar_quant_encode_turbo4()`	llama-turbo.cpp	Encode float KV → 4-bit packed
`polar_quant_decode_turbo4()`	llama-turbo.cpp	Decode 4-bit packed → float KV
`cmake build`	CMakeLists.txt	Build static library + tests
`run_benchmarks.py`	benchmarks/	Run perplexity benchmarks

Key Abstractions

Symbol	File	Purpose
`polar_quant_encode_turbo4()`	llama-turbo.h/.cpp	Encode float[d] → packed 4-bit + L2 norm
`polar_quant_decode_turbo4()`	llama-turbo.h/.cpp	Decode packed 4-bit + norm → float[d]
`turbo_dequantize_k()`	ggml-metal-turbo.metal	Metal kernel: dequantize K cache
`turbo_dequantize_v()`	ggml-metal-turbo.metal	Metal kernel: dequantize V cache
`turbo_fwht_128()`	ggml-metal-turbo.metal	Fast Walsh-Hadamard Transform
`run_perplexity.py`	benchmarks/	Measure perplexity impact
`run_benchmarks.py`	benchmarks/	Full benchmark suite (speed + quality)

Data Flow

Input: float KV vectors [d=128 per head]
  ↓
1. WHT rotation (in-place, O(d log d))
  ↓
2. Convert to polar coords (radius, angles)
  ↓
3. Lloyd-Max quantize angles → 4-bit indices
  ↓
4. Store: packed indices [d/2 bytes] + float norm [4 bytes]
  ↓
Decode: indices → codebook lookup → polar → cartesian → inverse WHT
  ↓
Output: reconstructed float KV [d=128]

File Index

File	LOC	Purpose
`llama-turbo.h`	24	C API: encode/decode function declarations
`llama-turbo.cpp`	78	Implementation: PolarQuant encode/decode
`ggml-metal-turbo.metal`	76	Metal shaders: dequantize + flash attention
`CMakeLists.txt`	44	Build system: static lib + tests
`tests/roundtrip_test.cpp`	104	Roundtrip encode→decode validation
`benchmarks/run_benchmarks.py`	227	Benchmark suite
`benchmarks/run_perplexity.py`	~100	Perplexity measurement
`evolution/hardware_optimizer.py`	5	Hardware detection stub

Total: ~660 LOC | C++ core: 206 LOC | Python benchmarks: 232 LOC

Dependencies

Dependency	Purpose
CMake 3.16+	Build system
C++17 compiler	Core implementation
Metal (macOS)	GPU shader execution
Python 3.11+	Benchmarks
llama.cpp fork	Integration target

Source Repos (Upstream)

Repo	Role
TheTom/llama-cpp-turboquant	llama.cpp fork with Metal shaders
TheTom/turboquant_plus	Reference impl, 511+ tests
amirzandieh/QJL	Author QJL code (CUDA)
rachittshah/mlx-turboquant	MLX fallback

Test Coverage

Test	File	Validates
`turboquant_roundtrip`	tests/roundtrip_test.cpp	Encode→decode roundtrip fidelity
Perplexity benchmarks	benchmarks/run_perplexity.py	Quality preservation across prompts
Speed benchmarks	benchmarks/run_benchmarks.py	Compression overhead measurement

Security Considerations

No network calls — Pure local computation, no telemetry
Memory safety — C++ code uses raw pointers; roundtrip tests validate correctness
Build isolation — CMake builds static library; no dynamic linking

Sovereignty Assessment

Fully local — No cloud dependencies, no API calls
Open source — All code on Gitea, upstream repos public
No telemetry — Pure computation
Hardware-specific — Metal shaders target Apple Silicon; CUDA upstream for other GPUs

Verdict: Fully sovereign. No corporate lock-in. Pure local inference enhancement.

"A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context."

4.9 KiB Raw Blame History