Compare commits
1 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| ac25f2f9d4 |
159
genomes/turboquant-GENOME.md
Normal file
159
genomes/turboquant-GENOME.md
Normal file
@@ -0,0 +1,159 @@
|
||||
# GENOME.md: turboquant
|
||||
|
||||
**Generated:** 2026-04-14
|
||||
**Repo:** Timmy_Foundation/turboquant
|
||||
**Phase:** 1 Complete (PolarQuant MVP)
|
||||
**Status:** Production-ready Metal shaders, benchmarks passing
|
||||
|
||||
---
|
||||
|
||||
## Project Overview
|
||||
|
||||
TurboQuant is a KV cache compression system for local LLM inference on Apple Silicon. It enables 64K-128K context windows on 27B-parameter models within 36GB unified memory.
|
||||
|
||||
Three-stage compression pipeline:
|
||||
1. **PolarQuant** -- WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
|
||||
2. **QJL** -- 1-bit quantized Johnson-Lindenstrauss residual correction
|
||||
3. **TurboQuant** -- PolarQuant + QJL combined = ~3.5 bits/channel, zero accuracy loss
|
||||
|
||||
**Key result:** turbo4 delivers 73% KV memory savings with 1% prompt processing overhead and 11% generation overhead.
|
||||
|
||||
## Architecture
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
A[LLM Inference] --> B[KV Cache Layer]
|
||||
B --> C{TurboQuant Mode}
|
||||
C -->|turbo2| D[2-bit PolarQuant]
|
||||
C -->|turbo3| E[3-bit PolarQuant + QJL]
|
||||
C -->|turbo4| F[4-bit PolarQuant]
|
||||
D --> G[Metal Shader: encode/decode]
|
||||
E --> G
|
||||
F --> G
|
||||
G --> H[Compressed KV Storage]
|
||||
H --> I[Decompress on Attention]
|
||||
I --> A
|
||||
|
||||
J[llama-turbo.h] --> K[polar_quant_encode_turbo4]
|
||||
K --> L[WHT Rotation]
|
||||
L --> M[Polar Transform]
|
||||
M --> N[Lloyd-Max Quantize]
|
||||
N --> O[Packed 4-bit Output]
|
||||
```
|
||||
|
||||
## Entry Points
|
||||
|
||||
| Entry Point | Type | Purpose |
|
||||
|-------------|------|---------|
|
||||
| `llama-turbo.cpp` | C++ library | Core encode/decode functions |
|
||||
| `llama-turbo.h` | C header | Public API: `polar_quant_encode_turbo4`, `polar_quant_decode_turbo4` |
|
||||
| `ggml-metal-turbo.metal` | Metal shader | GPU-accelerated encode/decode for Apple Silicon |
|
||||
| `benchmarks/run_benchmarks.py` | Python | Benchmark suite: perplexity, speed, memory |
|
||||
| `benchmarks/run_perplexity.py` | Python | Perplexity evaluation across context lengths |
|
||||
| `evolution/hardware_optimizer.py` | Python | Hardware-aware parameter tuning |
|
||||
| `.gitea/workflows/smoke.yml` | CI | Smoke test on push |
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
Input: float array [d] (d=128, one KV head)
|
||||
|
|
||||
v
|
||||
WHT Rotation (structured orthogonal transform)
|
||||
|
|
||||
v
|
||||
Polar Transform (cartesian -> polar coordinates)
|
||||
|
|
||||
v
|
||||
Lloyd-Max Quantization (non-uniform codebook, 4-bit)
|
||||
|
|
||||
v
|
||||
Output: packed uint8_t [d/2] + float norm (radius)
|
||||
```
|
||||
|
||||
Decode is the inverse: unpack -> dequantize -> inverse polar -> inverse WHT.
|
||||
|
||||
## Key Abstractions
|
||||
|
||||
| Abstraction | Description |
|
||||
|-------------|-------------|
|
||||
| **Turbo4** | 4-bit PolarQuant mode. Best quality. 4.2x compression. |
|
||||
| **Turbo3** | 3-bit mode with QJL residual. ~3.5 bits/channel. |
|
||||
| **Turbo2** | 2-bit mode. Maximum compression. Quality tradeoff. |
|
||||
| **WHT** | Walsh-Hadamard Transform. Structured orthogonal rotation. |
|
||||
| **Lloyd-Max** | Non-uniform codebook optimized for N(0, 1/sqrt(128)) distribution. |
|
||||
| **QJL** | Quantized Johnson-Lindenstrauss. 1-bit residual correction. |
|
||||
|
||||
## API Surface
|
||||
|
||||
### C API (llama-turbo.h)
|
||||
|
||||
```c
|
||||
// Encode: float [d] -> packed 4-bit [d/2] + norm
|
||||
void polar_quant_encode_turbo4(const float* src, uint8_t* dst, float* norm, int d);
|
||||
|
||||
// Decode: packed 4-bit [d/2] + norm -> float [d]
|
||||
void polar_quant_decode_turbo4(const uint8_t* src, float* dst, float norm, int d);
|
||||
```
|
||||
|
||||
### Integration
|
||||
|
||||
TurboQuant integrates with llama.cpp via:
|
||||
- Mixed quantization pairs: `q8_0 x turbo` for K/V asymmetric compression
|
||||
- Metal shader dispatch in `ggml-metal.metal` (turbo kernels)
|
||||
- Build flag: `-DGGML_TURBOQUANT=ON`
|
||||
|
||||
### Hermes Profile
|
||||
|
||||
`profiles/hermes-profile-gemma4-turboquant.yaml` defines deployment config:
|
||||
- Model: gemma4 with turbo4 KV compression
|
||||
- Target hardware: M3/M4 Max, 36GB+
|
||||
- Context window: up to 128K with compression
|
||||
|
||||
## Test Coverage
|
||||
|
||||
| Area | Coverage | Notes |
|
||||
|------|----------|-------|
|
||||
| WHT rotation | Partial | Metal GPU uses WHT. CPU ref uses dense random (legacy). |
|
||||
| Encode/decode symmetry | Full | `turbo_rotate_forward()` == inverse of `turbo_rotate_inverse()` |
|
||||
| Lloyd-Max codebook | Full | Non-uniform centroids verified |
|
||||
| Radius precision | Full | FP16+ norm per 128-element block |
|
||||
| Metal shader correctness | Full | All dk32-dk576 variants tested |
|
||||
| Perplexity benchmarks | Full | WikiText results in `benchmarks/perplexity_results.json` |
|
||||
|
||||
### Gaps
|
||||
|
||||
- No CI integration for Metal shader tests (smoke test only covers build)
|
||||
- CPU reference implementation uses dense random, not WHT (legacy)
|
||||
- No long-session stress tests beyond 128K
|
||||
- QJL implementation not yet verified against CUDA reference
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- **No network access.** All inference is local.
|
||||
- **No user data in repo.** Benchmarks use public WikiText corpus.
|
||||
- **Binary blobs.** `llama-turbo.cpp` compiles to native code. No sandboxing.
|
||||
- **Upstream dependency.** Fork of TheTom/llama-cpp-turboquant. Trust boundary at upstream.
|
||||
|
||||
## Dependencies
|
||||
|
||||
| Dependency | Type | Source |
|
||||
|------------|------|--------|
|
||||
| llama.cpp | Fork | TheTom/llama-cpp-turboquant |
|
||||
| Metal | System | Apple GPU framework |
|
||||
| CMake | Build | Standard build system |
|
||||
| Python 3.10+ | Scripts | Benchmarks and optimizer |
|
||||
|
||||
## Key Files
|
||||
|
||||
```
|
||||
turboquant/
|
||||
llama-turbo.h # C API header
|
||||
llama-turbo.cpp # Core encode/decode implementation
|
||||
ggml-metal-turbo.metal # Metal GPU shaders
|
||||
benchmarks/ # Perplexity and speed benchmarks
|
||||
evolution/ # Hardware optimizer
|
||||
profiles/ # Hermes deployment profile
|
||||
docs/ # Project status and build spec
|
||||
.gitea/workflows/ # CI smoke test
|
||||
```
|
||||
Reference in New Issue
Block a user