genomes/turboquant/GENOME.md

# GENOME.md — TurboQuant (Timmy_Foundation/turboquant)

> Codebase Genome v1.1 | Refreshed 2026-04-18 | Repo 12/16 | Ref: #679

## Project Overview

**TurboQuant** is a KV cache compression system for local inference on Apple Silicon. Implements Google's ICLR 2026 paper to unlock 64K-128K context on 27B models within 32GB unified memory.

**Three-stage compression:**
1. **PolarQuant** — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
2. **QJL** — 1-bit quantized Johnson-Lindenstrauss residual correction
3. **TurboQuant** — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss

**Key result:** 73% KV memory savings with 1% prompt processing overhead, 11% generation overhead.

## Architecture

```mermaid
graph TD
    subgraph "Compression Pipeline"
        KV[Raw KV Cache fp16] --> WHT[WHT Rotation]
        WHT --> POLAR[PolarQuant 4-bit]
        POLAR --> QJL[QJL Residual]
        QJL --> PACKED[Packed KV ~3.5bit]
    end

    subgraph "Metal Shaders"
        PACKED --> DECODE[Polar Decode Kernel]
        DECODE --> ATTEN[Flash Attention]
        ATTEN --> OUTPUT[Model Output]
    end

    subgraph "Build System"
        CMAKE[CMakeLists.txt] --> LIB[turboquant.a]
        LIB --> TEST[turboquant_roundtrip_test]
        LIB --> LLAMA[llama.cpp fork integration]
    end

    subgraph "Python Layer"
        SELECTOR[quant_selector.py] --> MODELS[model_registry/]
        MODELS --> PROFILE[hardware_profiles.py]
        PROFILE --> DECISION[quantization decision]
    end
```

## Entry Points

| Entry Point | File | Purpose |
|-------------|------|---------|
| `polar_quant_encode_turbo4()` | llama-turbo.cpp | Encode float KV → 4-bit packed |
| `polar_quant_decode_turbo4()` | llama-turbo.cpp | Decode 4-bit packed → float KV |
| `cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON` | CMakeLists.txt | Build static library + CTest suite |
| `ctest --test-dir build --output-on-failure` | build/ | Run C++ roundtrip tests |
| `run_benchmarks.py` | benchmarks/ | Run perplexity benchmarks |
| `quant_selector.py` | quant_selector/ | Hardware-aware quantization selection |

## Key Abstractions

| Symbol | File | Purpose |
|--------|------|---------|
| `polar_quant_encode_turbo4()` | llama-turbo.h/.cpp | Encode float[d] → packed 4-bit + L2 norm |
| `polar_quant_decode_turbo4()` | llama-turbo.h/.cpp | Decode packed 4-bit + norm → float[d] |
| `turbo_dequantize_k()` | ggml-metal-turbo.metal | Metal kernel: dequantize K cache |
| `turbo_dequantize_v()` | ggml-metal-turbo.metal | Metal kernel: dequantize V cache |
| `turbo_fwht_128()` | ggml-metal-turbo.metal | Fast Walsh-Hadamard Transform |
| `run_perplexity.py` | benchmarks/ | Measure perplexity impact |
| `run_benchmarks.py` | benchmarks/ | Full benchmark suite (speed + quality) |
| `select_quantization()` | quant_selector.py | Pick quant scheme from hardware profile |

## Data Flow

```
Input: float KV vectors [d=128 per head]
  ↓
1. WHT rotation (in-place, O(d log d))
  ↓
2. Convert to polar coords (radius, angles)
  ↓
3. Lloyd-Max quantize angles → 4-bit indices
  ↓
4. Store: packed indices [d/2 bytes] + float norm [4 bytes]
  ↓
Decode: indices → codebook lookup → polar → cartesian → inverse WHT
  ↓
Output: reconstructed float KV [d=128]
```

## API Surface

| Function | Signature | Notes |
|----------|-----------|-------|
| `polar_quant_encode_turbo4` | `(const float*, uint8_t*, float*, int)` | Core encode path |
| `polar_quant_decode_turbo4` | `(const uint8_t*, float, float*, int)` | Core decode path |
| `select_quantization` | `(HardwareProfile) -> QuantConfig` | Python quant selector |

## File Index

| File | LOC | Purpose |
|------|-----|---------|
| `llama-turbo.h` | 24 | C API: encode/decode function declarations |
| `llama-turbo.cpp` | 78 | Implementation: PolarQuant encode/decode |
| `ggml-metal-turbo.metal` | 76 | Metal shader: dequantize + FWHT kernels |
| `CMakeLists.txt` | 42 | Standalone build: lib + test targets |
| `quant_selector.py` | ~120 | Python: hardware profile → quant decision |
| `tests/test_quant_selector.py` | ~90 | Pytest: quant selector (currently failing) |
| `benchmarks/run_benchmarks.py` | ~85 | Perplexity + speed benchmarking |

## CI / Runtime Drift

| Dimension | Status | Notes |
|-----------|--------|-------|
| **CMake/CTest standalone build** | ✅ Passing | `cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON && ctest --test-dir build` works on current main |
| **Python quant selector tests** | ❌ Failing | `tests/test_quant_selector.py` fails on current main — tracked in `turboquant #139` |
| **CI lane: quant_selector** | ❌ Broken | The quant selector CI lane is non-blocking due to persistent failures |
| **CI lane: cmake roundtrip** | ✅ Green | C++ roundtrip test passes in CI |
| **Metal shader compilation** | ⚠️ Apple Silicon only | Cannot be tested in CI runners; validated manually on M-series hardware |

## Test Coverage Gaps

- `tests/test_quant_selector.py` is currently broken — selector returns wrong quantization for edge-case hardware profiles (see `turboquant #139`)
- No CI coverage for Metal shader correctness (Apple Silicon only)
- Benchmark regression detection is manual; no automated threshold enforcement

## Security Considerations

- C API operates on caller-allocated buffers — no internal bounds checking on `d` parameter
- Python quant selector reads hardware profile from filesystem; path traversal risk if profile dir is user-controllable

## Dependencies

| Dependency | Version | Purpose |
|------------|---------|---------|
| CMake | ≥3.20 | Build system |
| Python | ≥3.10 | Benchmarks + quant selector |
| pytest | any | Test runner for Python tests |
| Metal (macOS) | 14+ | GPU shader compilation |
| llama.cpp | fork | Integration layer |

## Deployment

- Static library `turboquant.a` linked into llama.cpp fork
- Python quant selector invoked at model-load time to pick compression scheme
- No standalone server component; embedded in inference runtime

## Technical Debt

- `turboquant #139` — quant selector test failures not yet resolved; CI lane is non-blocking
- No automated benchmark regression detection
- Metal shaders untestable in CI — manual validation on Apple Silicon required
- Stale genome (v1.0, 2026-04-15) did not reflect quant selector addition or CI drift
feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00			`# GENOME.md — TurboQuant (Timmy_Foundation/turboquant)`

fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00			`> Codebase Genome v1.1 \| Refreshed 2026-04-18 \| Repo 12/16 \| Ref: #679`
feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00
			`## Project Overview`

			`TurboQuant is a KV cache compression system for local inference on Apple Silicon. Implements Google's ICLR 2026 paper to unlock 64K-128K context on 27B models within 32GB unified memory.`

			`Three-stage compression:`
			`1. PolarQuant — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)`
			`2. QJL — 1-bit quantized Johnson-Lindenstrauss residual correction`
			`3. TurboQuant — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss`

			`Key result: 73% KV memory savings with 1% prompt processing overhead, 11% generation overhead.`

			`## Architecture`

			```mermaid
			`graph TD`
			`subgraph "Compression Pipeline"`
			`KV[Raw KV Cache fp16] --> WHT[WHT Rotation]`
			`WHT --> POLAR[PolarQuant 4-bit]`
			`POLAR --> QJL[QJL Residual]`
			`QJL --> PACKED[Packed KV ~3.5bit]`
			`end`

			`subgraph "Metal Shaders"`
			`PACKED --> DECODE[Polar Decode Kernel]`
			`DECODE --> ATTEN[Flash Attention]`
			`ATTEN --> OUTPUT[Model Output]`
			`end`

			`subgraph "Build System"`
			`CMAKE[CMakeLists.txt] --> LIB[turboquant.a]`
			`LIB --> TEST[turboquant_roundtrip_test]`
			`LIB --> LLAMA[llama.cpp fork integration]`
			`end`
fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00
			`subgraph "Python Layer"`
			`SELECTOR[quant_selector.py] --> MODELS[model_registry/]`
			`MODELS --> PROFILE[hardware_profiles.py]`
			`PROFILE --> DECISION[quantization decision]`
			`end`
feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00			```

			`## Entry Points`

			`\| Entry Point \| File \| Purpose \|`
			`\|-------------\|------\|---------\|`
			\| `polar_quant_encode_turbo4()` \| llama-turbo.cpp \| Encode float KV → 4-bit packed \|
			\| `polar_quant_decode_turbo4()` \| llama-turbo.cpp \| Decode 4-bit packed → float KV \|
fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00			\| `cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON` \| CMakeLists.txt \| Build static library + CTest suite \|
			\| `ctest --test-dir build --output-on-failure` \| build/ \| Run C++ roundtrip tests \|
feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00			\| `run_benchmarks.py` \| benchmarks/ \| Run perplexity benchmarks \|
fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00			\| `quant_selector.py` \| quant_selector/ \| Hardware-aware quantization selection \|
feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00
			`## Key Abstractions`

			`\| Symbol \| File \| Purpose \|`
			`\|--------\|------\|---------\|`
			\| `polar_quant_encode_turbo4()` \| llama-turbo.h/.cpp \| Encode float[d] → packed 4-bit + L2 norm \|
			\| `polar_quant_decode_turbo4()` \| llama-turbo.h/.cpp \| Decode packed 4-bit + norm → float[d] \|
			\| `turbo_dequantize_k()` \| ggml-metal-turbo.metal \| Metal kernel: dequantize K cache \|
			\| `turbo_dequantize_v()` \| ggml-metal-turbo.metal \| Metal kernel: dequantize V cache \|
			\| `turbo_fwht_128()` \| ggml-metal-turbo.metal \| Fast Walsh-Hadamard Transform \|
			\| `run_perplexity.py` \| benchmarks/ \| Measure perplexity impact \|
			\| `run_benchmarks.py` \| benchmarks/ \| Full benchmark suite (speed + quality) \|
fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00			\| `select_quantization()` \| quant_selector.py \| Pick quant scheme from hardware profile \|
feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00
			`## Data Flow`

			```
			`Input: float KV vectors [d=128 per head]`
			`↓`
			`1. WHT rotation (in-place, O(d log d))`
			`↓`
			`2. Convert to polar coords (radius, angles)`
			`↓`
			`3. Lloyd-Max quantize angles → 4-bit indices`
			`↓`
			`4. Store: packed indices [d/2 bytes] + float norm [4 bytes]`
			`↓`
			`Decode: indices → codebook lookup → polar → cartesian → inverse WHT`
			`↓`
			`Output: reconstructed float KV [d=128]`
			```

fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00			`## API Surface`

			`\| Function \| Signature \| Notes \|`
			`\|----------\|-----------\|-------\|`
			\| `polar_quant_encode_turbo4` \| `(const float, uint8_t, float*, int)` \| Core encode path \|
			\| `polar_quant_decode_turbo4` \| `(const uint8_t, float, float, int)` \| Core decode path \|
			\| `select_quantization` \| `(HardwareProfile) -> QuantConfig` \| Python quant selector \|

feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00			`## File Index`

			`\| File \| LOC \| Purpose \|`
			`\|------\|-----\|---------\|`
			\| `llama-turbo.h` \| 24 \| C API: encode/decode function declarations \|
			\| `llama-turbo.cpp` \| 78 \| Implementation: PolarQuant encode/decode \|
fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00			\| `ggml-metal-turbo.metal` \| 76 \| Metal shader: dequantize + FWHT kernels \|
			\| `CMakeLists.txt` \| 42 \| Standalone build: lib + test targets \|
			\| `quant_selector.py` \| ~120 \| Python: hardware profile → quant decision \|
			\| `tests/test_quant_selector.py` \| ~90 \| Pytest: quant selector (currently failing) \|
			\| `benchmarks/run_benchmarks.py` \| ~85 \| Perplexity + speed benchmarking \|
feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00
fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00			`## CI / Runtime Drift`
feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00
fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00			`\| Dimension \| Status \| Notes \|`
			`\|-----------\|--------\|-------\|`
			\| CMake/CTest standalone build \| ✅ Passing \| `cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON && ctest --test-dir build` works on current main \|
			\| Python quant selector tests \| ❌ Failing \| `tests/test_quant_selector.py` fails on current main — tracked in `turboquant #139` \|
			`\| CI lane: quant_selector \| ❌ Broken \| The quant selector CI lane is non-blocking due to persistent failures \|`
			`\| CI lane: cmake roundtrip \| ✅ Green \| C++ roundtrip test passes in CI \|`
			`\| Metal shader compilation \| ⚠️ Apple Silicon only \| Cannot be tested in CI runners; validated manually on M-series hardware \|`
feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00
fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00			`## Test Coverage Gaps`
feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00
fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00			- `tests/test_quant_selector.py` is currently broken — selector returns wrong quantization for edge-case hardware profiles (see `turboquant #139`)
			`- No CI coverage for Metal shader correctness (Apple Silicon only)`
			`- Benchmark regression detection is manual; no automated threshold enforcement`
feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00
			`## Security Considerations`

fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00			- C API operates on caller-allocated buffers — no internal bounds checking on `d` parameter
			`- Python quant selector reads hardware profile from filesystem; path traversal risk if profile dir is user-controllable`

			`## Dependencies`
feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00
fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00			`\| Dependency \| Version \| Purpose \|`
			`\|------------\|---------\|---------\|`
			`\| CMake \| ≥3.20 \| Build system \|`
			`\| Python \| ≥3.10 \| Benchmarks + quant selector \|`
			`\| pytest \| any \| Test runner for Python tests \|`
			`\| Metal (macOS) \| 14+ \| GPU shader compilation \|`
			`\| llama.cpp \| fork \| Integration layer \|`
feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00
fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00			`## Deployment`
feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00
fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00			- Static library `turboquant.a` linked into llama.cpp fork
			`- Python quant selector invoked at model-load time to pick compression scheme`
			`- No standalone server component; embedded in inference runtime`
feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00
fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00			`## Technical Debt`
feat: Codebase Genome for turboquant (#679) Complete GENOME.md for turboquant (KV cache compression): - Project overview: PolarQuant + QJL = 3.5bit/channel - Architecture diagram (Mermaid) - Entry points and data flow - Key abstractions (encode/decode/Metal shaders) - File index (~660 LOC) - Upstream source repos - Test coverage - Sovereignty assessment Repo 12/16. Closes #679. 2026-04-15 20:59:55 -04:00
fix: docs: refresh turboquant codebase genome for #679 (closes #827) 2026-04-20 21:42:33 -04:00			- `turboquant #139` — quant selector test failures not yet resolved; CI lane is non-blocking
			`- No automated benchmark regression detection`
			`- Metal shaders untestable in CI — manual validation on Apple Silicon required`
			`- Stale genome (v1.0, 2026-04-15) did not reflect quant selector addition or CI drift`