Files

Alexander Whitestone f60604ddcc

Smoke Test / smoke (pull_request) Successful in 12s

Details

Fix #679 : Generate GENOME.md for turboquant

- Created comprehensive GENOME.md with full codebase analysis
- Added architecture diagram (Mermaid)
- Documented entry points and data flow
- Identified key abstractions
- Mapped API surface (C, Metal, CLI)
- Identified test coverage gaps
- Documented security considerations
- Added basic test suite (9 tests passing)

Key findings:
- 73.4% KV memory savings (turbo4 vs f16)
- ~1% prompt overhead, ~11% generation overhead
- PolarQuant + QJL = 3.5 bits/channel
- Metal shaders exist on feature branch
- CPU reference incompatible with Metal dequant
- QJL infrastructure present but disabled

Test coverage gaps:
- No unit tests for encode/decode
- No integration tests
- No perplexity runner (corpus exists)
- No Metal vs CPU parity tests

Security considerations:
- Buffer overflow risk in bit packing
- No constant-time implementation
- No safety wrapper for C/C++ code

2026-04-14 19:03:21 -04:00

9.6 KiB

Raw Blame History

GENOME.md — TurboQuant

Generated: 2026-04-14 | Codebase Genome Analysis

Project Overview

TurboQuant is a KV cache compression system for local inference on Apple Silicon. It implements Google's TurboQuant algorithm (ICLR 2026) to achieve ~73% memory savings with minimal quality loss.

Core Value Proposition

Problem: Large language models (27B+) require massive KV cache memory at long contexts
Solution: Three-stage compression (PolarQuant + QJL) reduces KV cache to ~3.5 bits/channel
Result: 128K context on 36GB hardware becomes viable (vs impossible at FP16)

Key Metrics

Compression: 73.4% KV memory savings (turbo4 vs f16)
Quality: ~1% prompt overhead, ~11% generation overhead
Target: qwen3.5:27b at 128K context within 36GB unified memory

Architecture

graph TB
    subgraph "Input Layer"
        Q[Query Vector Q]
        K[Key Vector K]
        V[Value Vector V]
    end
    
    subgraph "TurboQuant Compression"
        WHT[Walsh-Hadamard Transform]
        PQ[PolarQuant Encode]
        QJL[QJL Residual]
        PACK[Bit Packing]
    end
    
    subgraph "KV Cache Storage"
        CACHE[Compressed KV Cache]
        NORMS[Radius Norms FP16]
    end
    
    subgraph "Decompression & Attention"
        UNPACK[Bit Unpack]
        DEQ[PolarQuant Decode]
        FWHT[Inverse WHT]
        ATTEN[Attention Compute]
    end
    
    subgraph "Output"
        SCORES[Attention Scores]
        OUT[Weighted Values]
    end
    
    K --> WHT
    WHT --> PQ
    PQ --> PACK
    PACK --> CACHE
    PQ --> NORMS
    
    V --> WHT
    WHT --> PQ
    PQ --> PACK
    PACK --> CACHE
    
    CACHE --> UNPACK
    NORMS --> DEQ
    UNPACK --> DEQ
    DEQ --> FWHT
    
    Q --> ATTEN
    FWHT --> ATTEN
    ATTEN --> SCORES
    SCORES --> OUT
    
    style WHT fill:#e1f5fe
    style PQ fill:#fff3e0
    style QJL fill:#f3e5f5
    style ATTEN fill:#e8f5e8

Entry Points

Primary Entry: Metal Shaders

File: ggml-metal-turbo.metal
Functions:
- kernel_fwht_128: Walsh-Hadamard transform (GPU)
- kernel_turbo4_dequant: 4-bit dequantization (hot path)
- kernel_attention_turbo4: Fused attention (conceptual)

CPU Reference Implementation

File: llama-turbo.cpp
Functions:
- polar_quant_encode_turbo4: Encode (CPU reference)
- polar_quant_decode_turbo4: Decode (CPU reference)
- fwht: Fast Walsh-Hadamard transform

Benchmarking

File: benchmarks/run_benchmarks.py
Entry: CLI tool for measuring TTFT, tokens/sec, memory
Backends: Ollama, llama-server

Configuration

File: profiles/hermes-profile-gemma4-turboquant.yaml
Purpose: Hermes agent profile for TurboQuant deployment

Data Flow

1. Model Load
   ├── Load GGUF model weights
   ├── Initialize Lloyd-Max codebook (16 centroids for turbo4)
   ├── Initialize WHT rotation matrix (128×128)
   └── Set per-layer adaptive mode (TURBO_LAYER_ADAPTIVE)

2. Forward Pass (per token)
   ├── Compute Q, K, V projections
   ├── Compress K, V via PolarQuant:
   │   ├── Apply WHT rotation (O(d log d))
   │   ├── Compute L2 norm (radius)
   │   ├── Quantize coordinates to 4-bit indices
   │   └── Pack indices + store radius
   ├── Store compressed K, V in cache
   └── Attention:
       ├── Decompress K from cache (hot path)
       ├── Compute Q·K^T scores
       ├── Apply softmax
       ├── Decompress V from cache
       └── Compute weighted sum

3. Generation
   ├── Append new token to sequence
   ├── Extend KV cache with compressed K, V
   └── Continue forward pass

Key Abstractions

1. PolarQuant Codec

Purpose: Compress/decompress KV vectors
Algorithm: WHT → polar coordinates → Lloyd-Max quantization
Interface: polar_quant_encode_turbo4() / polar_quant_decode_turbo4()

2. Walsh-Hadamard Transform

Purpose: Energy-spreading rotation (makes distribution predictable)
Property: Orthogonal (preserves inner products)
Complexity: O(d log d) vs O(d²) for dense rotation

3. Lloyd-Max Codebook

Purpose: Optimal scalar quantization for known distribution
Size: 16 entries for turbo4 (4-bit)
Key: Precomputed, fixed (no per-vector calibration)

4. Per-Layer Adaptive Quantization

Purpose: Protect sensitive layers (first/last) with higher precision
Modes: 7 modes (0=uniform, 7=recommended)
Mechanism: TURBO_LAYER_ADAPTIVE environment variable

API Surface

C API (llama-turbo.h)

// Encode: float → 4-bit packed
void polar_quant_encode_turbo4(
    const float* src,    // Input [d]
    uint8_t* dst,        // Output [d/2] packed 4-bit
    float* norm,         // Output L2 norm
    int d                // Dimension (must be power of 2)
);

// Decode: 4-bit packed → float
void polar_quant_decode_turbo4(
    const uint8_t* src,  // Input [d/2] packed 4-bit
    float* dst,          // Output [d]
    float norm,          // Input L2 norm
    int d                // Dimension
);

Metal Shaders (GPU)

// Walsh-Hadamard transform (in-place)
kernel void kernel_fwht_128(
    device float* data [[buffer(0)]],
    uint tid [[thread_position_in_grid]]
);

// 4-bit dequantization (hot path)
kernel void kernel_turbo4_dequant(
    device const uchar* src [[buffer(0)]],
    device const float* norms [[buffer(1)]],
    device float* dst [[buffer(2)]],
    uint tid [[thread_position_in_grid]]
);

llama-server CLI

llama-server \
  -m model.gguf \
  -ctk turbo4 -ctv turbo4 \  # KV cache type
  -c 131072 \                 # Context length
  --port 11434                # API port

Environment Variables

TURBO_LAYER_ADAPTIVE: Per-layer quantization mode (0-7)
TURBO4_USE_4BIT: Enable 4-bit mode (default: 1)

Test Coverage Gaps

Current State

Unit tests: ❌ None in this repo
Integration tests: ❌ None
Benchmark tests: ✅ benchmarks/run_benchmarks.py
Perplexity tests: ⚠️ Corpus exists (corpora/wiki.test.raw) but no runner

Critical Missing Tests

Encode/Decode Roundtrip: Verify decode(encode(x)) ≈ x
Inner Product Preservation: Verify Q·K ≈ Q·dequant(quant(K))
WHT Orthogonality: Verify WHT^T · WHT = I
Codebook Correctness: Verify centroids match Lloyd-Max for N(0, 1/128)
Metal vs CPU Parity: Verify GPU and CPU produce identical results
Per-Layer Adaptive: Verify sensitive layers use higher precision
Memory Bounds: Verify no buffer overflows in bit packing

Recommended Test Suite

# tests/test_polar_quant.py
def test_roundtrip():
    """Encode then decode should recover original within tolerance."""
    
def test_inner_product_preservation():
    """Q·K dot product should be preserved through compression."""
    
def test_wht_orthogonality():
    """WHT matrix should be orthogonal."""
    
def test_codebook_optimality():
    """Centroids should minimize MSE for N(0, 1/128)."""

Security Considerations

1. Buffer Overflows

Risk: Bit packing/unpacking could overflow if dimension not power of 2
Mitigation: Static asserts in Metal shaders, runtime checks in CPU code
Status: ⚠️ Need verification

2. Numerical Stability

Risk: Division by zero in 1.0 / (norm + 1e-9)
Mitigation: Epsilon guard present
Status: ✅ Handled

3. Memory Safety

Risk: C/C++ code has no bounds checking
Mitigation: Use Rust wrapper or sanitize inputs
Status: ⚠️ No safety wrapper

4. Denial of Service

Risk: Maliciously crafted KV vectors could cause slow quantization
Mitigation: Fixed iteration count in Lloyd-Max search
Status: ✅ Bounded

5. Side Channels

Risk: Timing differences in quantization could leak information
Mitigation: Constant-time implementation needed
Status: ❌ Not implemented

Dependencies

Build Dependencies

CMake: Build system
Metal SDK: GPU shaders (macOS)
C++17: Language standard

Runtime Dependencies

Apple Silicon: M1/M2/M3/M4
macOS: Metal GPU support
llama.cpp: Inference engine (forked)

External References

TheTom/llama-cpp-turboquant — Primary fork
TheTom/turboquant_plus — Reference implementation
amirzandieh/QJL — QJL author's code
rachittshah/mlx-turboquant — MLX fallback

Deployment

Build

cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(sysctl -n hw.ncpu)

Run

export TURBO_LAYER_ADAPTIVE=7
./build/bin/llama-server \
  -m /path/to/model.gguf \
  --port 11434 \
  -ctk turbo4 -ctv turbo4 \
  -c 131072

Validate

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.5","messages":[{"role":"user","content":"hello"}]}'

Open Questions

QJL Status: Infrastructure exists but is disabled. When will it be needed?
Upstream Landing: When will TurboQuant be merged into llama.cpp mainline?
Quality Threshold: What PPL delta is acceptable for production use?
Multi-GPU: Does TurboQuant work with tensor parallelism?

Changelog

2026-03-30: Phase 1 complete. PolarQuant MVP verified. 73% KV savings confirmed.
2026-04-14: GENOME.md generated. Test gaps identified. Security considerations documented.

9.6 KiB Raw Blame History Unescape Escape