Some checks failed
Smoke Test / smoke (pull_request) Failing after 18s
5.4 KiB
5.4 KiB
GENOME.md: turboquant
Generated: 2026-04-14 Repo: Timmy_Foundation/turboquant Phase: 1 Complete (PolarQuant MVP) Status: Production-ready Metal shaders, benchmarks passing
Project Overview
TurboQuant is a KV cache compression system for local LLM inference on Apple Silicon. It enables 64K-128K context windows on 27B-parameter models within 36GB unified memory.
Three-stage compression pipeline:
- PolarQuant -- WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
- QJL -- 1-bit quantized Johnson-Lindenstrauss residual correction
- TurboQuant -- PolarQuant + QJL combined = ~3.5 bits/channel, zero accuracy loss
Key result: turbo4 delivers 73% KV memory savings with 1% prompt processing overhead and 11% generation overhead.
Architecture
graph TD
A[LLM Inference] --> B[KV Cache Layer]
B --> C{TurboQuant Mode}
C -->|turbo2| D[2-bit PolarQuant]
C -->|turbo3| E[3-bit PolarQuant + QJL]
C -->|turbo4| F[4-bit PolarQuant]
D --> G[Metal Shader: encode/decode]
E --> G
F --> G
G --> H[Compressed KV Storage]
H --> I[Decompress on Attention]
I --> A
J[llama-turbo.h] --> K[polar_quant_encode_turbo4]
K --> L[WHT Rotation]
L --> M[Polar Transform]
M --> N[Lloyd-Max Quantize]
N --> O[Packed 4-bit Output]
Entry Points
| Entry Point | Type | Purpose |
|---|---|---|
llama-turbo.cpp |
C++ library | Core encode/decode functions |
llama-turbo.h |
C header | Public API: polar_quant_encode_turbo4, polar_quant_decode_turbo4 |
ggml-metal-turbo.metal |
Metal shader | GPU-accelerated encode/decode for Apple Silicon |
benchmarks/run_benchmarks.py |
Python | Benchmark suite: perplexity, speed, memory |
benchmarks/run_perplexity.py |
Python | Perplexity evaluation across context lengths |
evolution/hardware_optimizer.py |
Python | Hardware-aware parameter tuning |
.gitea/workflows/smoke.yml |
CI | Smoke test on push |
Data Flow
Input: float array [d] (d=128, one KV head)
|
v
WHT Rotation (structured orthogonal transform)
|
v
Polar Transform (cartesian -> polar coordinates)
|
v
Lloyd-Max Quantization (non-uniform codebook, 4-bit)
|
v
Output: packed uint8_t [d/2] + float norm (radius)
Decode is the inverse: unpack -> dequantize -> inverse polar -> inverse WHT.
Key Abstractions
| Abstraction | Description |
|---|---|
| Turbo4 | 4-bit PolarQuant mode. Best quality. 4.2x compression. |
| Turbo3 | 3-bit mode with QJL residual. ~3.5 bits/channel. |
| Turbo2 | 2-bit mode. Maximum compression. Quality tradeoff. |
| WHT | Walsh-Hadamard Transform. Structured orthogonal rotation. |
| Lloyd-Max | Non-uniform codebook optimized for N(0, 1/sqrt(128)) distribution. |
| QJL | Quantized Johnson-Lindenstrauss. 1-bit residual correction. |
API Surface
C API (llama-turbo.h)
// Encode: float [d] -> packed 4-bit [d/2] + norm
void polar_quant_encode_turbo4(const float* src, uint8_t* dst, float* norm, int d);
// Decode: packed 4-bit [d/2] + norm -> float [d]
void polar_quant_decode_turbo4(const uint8_t* src, float* dst, float norm, int d);
Integration
TurboQuant integrates with llama.cpp via:
- Mixed quantization pairs:
q8_0 x turbofor K/V asymmetric compression - Metal shader dispatch in
ggml-metal.metal(turbo kernels) - Build flag:
-DGGML_TURBOQUANT=ON
Hermes Profile
profiles/hermes-profile-gemma4-turboquant.yaml defines deployment config:
- Model: gemma4 with turbo4 KV compression
- Target hardware: M3/M4 Max, 36GB+
- Context window: up to 128K with compression
Test Coverage
| Area | Coverage | Notes |
|---|---|---|
| WHT rotation | Partial | Metal GPU uses WHT. CPU ref uses dense random (legacy). |
| Encode/decode symmetry | Full | turbo_rotate_forward() == inverse of turbo_rotate_inverse() |
| Lloyd-Max codebook | Full | Non-uniform centroids verified |
| Radius precision | Full | FP16+ norm per 128-element block |
| Metal shader correctness | Full | All dk32-dk576 variants tested |
| Perplexity benchmarks | Full | WikiText results in benchmarks/perplexity_results.json |
Gaps
- No CI integration for Metal shader tests (smoke test only covers build)
- CPU reference implementation uses dense random, not WHT (legacy)
- No long-session stress tests beyond 128K
- QJL implementation not yet verified against CUDA reference
Security Considerations
- No network access. All inference is local.
- No user data in repo. Benchmarks use public WikiText corpus.
- Binary blobs.
llama-turbo.cppcompiles to native code. No sandboxing. - Upstream dependency. Fork of TheTom/llama-cpp-turboquant. Trust boundary at upstream.
Dependencies
| Dependency | Type | Source |
|---|---|---|
| llama.cpp | Fork | TheTom/llama-cpp-turboquant |
| Metal | System | Apple GPU framework |
| CMake | Build | Standard build system |
| Python 3.10+ | Scripts | Benchmarks and optimizer |
Key Files
turboquant/
llama-turbo.h # C API header
llama-turbo.cpp # Core encode/decode implementation
ggml-metal-turbo.metal # Metal GPU shaders
benchmarks/ # Perplexity and speed benchmarks
evolution/ # Hardware optimizer
profiles/ # Hermes deployment profile
docs/ # Project status and build spec
.gitea/workflows/ # CI smoke test