Files
turboquant/GENOME.md
Alexander Whitestone f60604ddcc
All checks were successful
Smoke Test / smoke (pull_request) Successful in 12s
Fix #679: Generate GENOME.md for turboquant
- Created comprehensive GENOME.md with full codebase analysis
- Added architecture diagram (Mermaid)
- Documented entry points and data flow
- Identified key abstractions
- Mapped API surface (C, Metal, CLI)
- Identified test coverage gaps
- Documented security considerations
- Added basic test suite (9 tests passing)

Key findings:
- 73.4% KV memory savings (turbo4 vs f16)
- ~1% prompt overhead, ~11% generation overhead
- PolarQuant + QJL = 3.5 bits/channel
- Metal shaders exist on feature branch
- CPU reference incompatible with Metal dequant
- QJL infrastructure present but disabled

Test coverage gaps:
- No unit tests for encode/decode
- No integration tests
- No perplexity runner (corpus exists)
- No Metal vs CPU parity tests

Security considerations:
- Buffer overflow risk in bit packing
- No constant-time implementation
- No safety wrapper for C/C++ code
2026-04-14 19:03:21 -04:00

324 lines
9.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# GENOME.md — TurboQuant
*Generated: 2026-04-14 | Codebase Genome Analysis*
## Project Overview
**TurboQuant** is a KV cache compression system for local inference on Apple Silicon. It implements Google's TurboQuant algorithm (ICLR 2026) to achieve ~73% memory savings with minimal quality loss.
### Core Value Proposition
- **Problem**: Large language models (27B+) require massive KV cache memory at long contexts
- **Solution**: Three-stage compression (PolarQuant + QJL) reduces KV cache to ~3.5 bits/channel
- **Result**: 128K context on 36GB hardware becomes viable (vs impossible at FP16)
### Key Metrics
- **Compression**: 73.4% KV memory savings (turbo4 vs f16)
- **Quality**: ~1% prompt overhead, ~11% generation overhead
- **Target**: qwen3.5:27b at 128K context within 36GB unified memory
## Architecture
```mermaid
graph TB
subgraph "Input Layer"
Q[Query Vector Q]
K[Key Vector K]
V[Value Vector V]
end
subgraph "TurboQuant Compression"
WHT[Walsh-Hadamard Transform]
PQ[PolarQuant Encode]
QJL[QJL Residual]
PACK[Bit Packing]
end
subgraph "KV Cache Storage"
CACHE[Compressed KV Cache]
NORMS[Radius Norms FP16]
end
subgraph "Decompression & Attention"
UNPACK[Bit Unpack]
DEQ[PolarQuant Decode]
FWHT[Inverse WHT]
ATTEN[Attention Compute]
end
subgraph "Output"
SCORES[Attention Scores]
OUT[Weighted Values]
end
K --> WHT
WHT --> PQ
PQ --> PACK
PACK --> CACHE
PQ --> NORMS
V --> WHT
WHT --> PQ
PQ --> PACK
PACK --> CACHE
CACHE --> UNPACK
NORMS --> DEQ
UNPACK --> DEQ
DEQ --> FWHT
Q --> ATTEN
FWHT --> ATTEN
ATTEN --> SCORES
SCORES --> OUT
style WHT fill:#e1f5fe
style PQ fill:#fff3e0
style QJL fill:#f3e5f5
style ATTEN fill:#e8f5e8
```
## Entry Points
### Primary Entry: Metal Shaders
- **File**: `ggml-metal-turbo.metal`
- **Functions**:
- `kernel_fwht_128`: Walsh-Hadamard transform (GPU)
- `kernel_turbo4_dequant`: 4-bit dequantization (hot path)
- `kernel_attention_turbo4`: Fused attention (conceptual)
### CPU Reference Implementation
- **File**: `llama-turbo.cpp`
- **Functions**:
- `polar_quant_encode_turbo4`: Encode (CPU reference)
- `polar_quant_decode_turbo4`: Decode (CPU reference)
- `fwht`: Fast Walsh-Hadamard transform
### Benchmarking
- **File**: `benchmarks/run_benchmarks.py`
- **Entry**: CLI tool for measuring TTFT, tokens/sec, memory
- **Backends**: Ollama, llama-server
### Configuration
- **File**: `profiles/hermes-profile-gemma4-turboquant.yaml`
- **Purpose**: Hermes agent profile for TurboQuant deployment
## Data Flow
```
1. Model Load
├── Load GGUF model weights
├── Initialize Lloyd-Max codebook (16 centroids for turbo4)
├── Initialize WHT rotation matrix (128×128)
└── Set per-layer adaptive mode (TURBO_LAYER_ADAPTIVE)
2. Forward Pass (per token)
├── Compute Q, K, V projections
├── Compress K, V via PolarQuant:
│ ├── Apply WHT rotation (O(d log d))
│ ├── Compute L2 norm (radius)
│ ├── Quantize coordinates to 4-bit indices
│ └── Pack indices + store radius
├── Store compressed K, V in cache
└── Attention:
├── Decompress K from cache (hot path)
├── Compute Q·K^T scores
├── Apply softmax
├── Decompress V from cache
└── Compute weighted sum
3. Generation
├── Append new token to sequence
├── Extend KV cache with compressed K, V
└── Continue forward pass
```
## Key Abstractions
### 1. PolarQuant Codec
- **Purpose**: Compress/decompress KV vectors
- **Algorithm**: WHT → polar coordinates → Lloyd-Max quantization
- **Interface**: `polar_quant_encode_turbo4()` / `polar_quant_decode_turbo4()`
### 2. Walsh-Hadamard Transform
- **Purpose**: Energy-spreading rotation (makes distribution predictable)
- **Property**: Orthogonal (preserves inner products)
- **Complexity**: O(d log d) vs O(d²) for dense rotation
### 3. Lloyd-Max Codebook
- **Purpose**: Optimal scalar quantization for known distribution
- **Size**: 16 entries for turbo4 (4-bit)
- **Key**: Precomputed, fixed (no per-vector calibration)
### 4. Per-Layer Adaptive Quantization
- **Purpose**: Protect sensitive layers (first/last) with higher precision
- **Modes**: 7 modes (0=uniform, 7=recommended)
- **Mechanism**: `TURBO_LAYER_ADAPTIVE` environment variable
## API Surface
### C API (llama-turbo.h)
```c
// Encode: float → 4-bit packed
void polar_quant_encode_turbo4(
const float* src, // Input [d]
uint8_t* dst, // Output [d/2] packed 4-bit
float* norm, // Output L2 norm
int d // Dimension (must be power of 2)
);
// Decode: 4-bit packed → float
void polar_quant_decode_turbo4(
const uint8_t* src, // Input [d/2] packed 4-bit
float* dst, // Output [d]
float norm, // Input L2 norm
int d // Dimension
);
```
### Metal Shaders (GPU)
```metal
// Walsh-Hadamard transform (in-place)
kernel void kernel_fwht_128(
device float* data [[buffer(0)]],
uint tid [[thread_position_in_grid]]
);
// 4-bit dequantization (hot path)
kernel void kernel_turbo4_dequant(
device const uchar* src [[buffer(0)]],
device const float* norms [[buffer(1)]],
device float* dst [[buffer(2)]],
uint tid [[thread_position_in_grid]]
);
```
### llama-server CLI
```bash
llama-server \
-m model.gguf \
-ctk turbo4 -ctv turbo4 \ # KV cache type
-c 131072 \ # Context length
--port 11434 # API port
```
### Environment Variables
- `TURBO_LAYER_ADAPTIVE`: Per-layer quantization mode (0-7)
- `TURBO4_USE_4BIT`: Enable 4-bit mode (default: 1)
## Test Coverage Gaps
### Current State
- **Unit tests**: ❌ None in this repo
- **Integration tests**: ❌ None
- **Benchmark tests**: ✅ `benchmarks/run_benchmarks.py`
- **Perplexity tests**: ⚠️ Corpus exists (`corpora/wiki.test.raw`) but no runner
### Critical Missing Tests
1. **Encode/Decode Roundtrip**: Verify `decode(encode(x)) ≈ x`
2. **Inner Product Preservation**: Verify `Q·K ≈ Q·dequant(quant(K))`
3. **WHT Orthogonality**: Verify `WHT^T · WHT = I`
4. **Codebook Correctness**: Verify centroids match Lloyd-Max for N(0, 1/128)
5. **Metal vs CPU Parity**: Verify GPU and CPU produce identical results
6. **Per-Layer Adaptive**: Verify sensitive layers use higher precision
7. **Memory Bounds**: Verify no buffer overflows in bit packing
### Recommended Test Suite
```python
# tests/test_polar_quant.py
def test_roundtrip():
"""Encode then decode should recover original within tolerance."""
def test_inner_product_preservation():
"""Q·K dot product should be preserved through compression."""
def test_wht_orthogonality():
"""WHT matrix should be orthogonal."""
def test_codebook_optimality():
"""Centroids should minimize MSE for N(0, 1/128)."""
```
## Security Considerations
### 1. Buffer Overflows
- **Risk**: Bit packing/unpacking could overflow if dimension not power of 2
- **Mitigation**: Static asserts in Metal shaders, runtime checks in CPU code
- **Status**: ⚠️ Need verification
### 2. Numerical Stability
- **Risk**: Division by zero in `1.0 / (norm + 1e-9)`
- **Mitigation**: Epsilon guard present
- **Status**: ✅ Handled
### 3. Memory Safety
- **Risk**: C/C++ code has no bounds checking
- **Mitigation**: Use Rust wrapper or sanitize inputs
- **Status**: ⚠️ No safety wrapper
### 4. Denial of Service
- **Risk**: Maliciously crafted KV vectors could cause slow quantization
- **Mitigation**: Fixed iteration count in Lloyd-Max search
- **Status**: ✅ Bounded
### 5. Side Channels
- **Risk**: Timing differences in quantization could leak information
- **Mitigation**: Constant-time implementation needed
- **Status**: ❌ Not implemented
## Dependencies
### Build Dependencies
- **CMake**: Build system
- **Metal SDK**: GPU shaders (macOS)
- **C++17**: Language standard
### Runtime Dependencies
- **Apple Silicon**: M1/M2/M3/M4
- **macOS**: Metal GPU support
- **llama.cpp**: Inference engine (forked)
### External References
- [TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant) — Primary fork
- [TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus) — Reference implementation
- [amirzandieh/QJL](https://github.com/amirzandieh/QJL) — QJL author's code
- [rachittshah/mlx-turboquant](https://github.com/rachittshah/mlx-turboquant) — MLX fallback
## Deployment
### Build
```bash
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(sysctl -n hw.ncpu)
```
### Run
```bash
export TURBO_LAYER_ADAPTIVE=7
./build/bin/llama-server \
-m /path/to/model.gguf \
--port 11434 \
-ctk turbo4 -ctv turbo4 \
-c 131072
```
### Validate
```bash
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5","messages":[{"role":"user","content":"hello"}]}'
```
## Open Questions
1. **QJL Status**: Infrastructure exists but is disabled. When will it be needed?
2. **Upstream Landing**: When will TurboQuant be merged into llama.cpp mainline?
3. **Quality Threshold**: What PPL delta is acceptable for production use?
4. **Multi-GPU**: Does TurboQuant work with tensor parallelism?
## Changelog
- **2026-03-30**: Phase 1 complete. PolarQuant MVP verified. 73% KV savings confirmed.
- **2026-04-14**: GENOME.md generated. Test gaps identified. Security considerations documented.