Files
turboquant/GENOME.md
Alexander Whitestone f60604ddcc
All checks were successful
Smoke Test / smoke (pull_request) Successful in 12s
Fix #679: Generate GENOME.md for turboquant
- Created comprehensive GENOME.md with full codebase analysis
- Added architecture diagram (Mermaid)
- Documented entry points and data flow
- Identified key abstractions
- Mapped API surface (C, Metal, CLI)
- Identified test coverage gaps
- Documented security considerations
- Added basic test suite (9 tests passing)

Key findings:
- 73.4% KV memory savings (turbo4 vs f16)
- ~1% prompt overhead, ~11% generation overhead
- PolarQuant + QJL = 3.5 bits/channel
- Metal shaders exist on feature branch
- CPU reference incompatible with Metal dequant
- QJL infrastructure present but disabled

Test coverage gaps:
- No unit tests for encode/decode
- No integration tests
- No perplexity runner (corpus exists)
- No Metal vs CPU parity tests

Security considerations:
- Buffer overflow risk in bit packing
- No constant-time implementation
- No safety wrapper for C/C++ code
2026-04-14 19:03:21 -04:00

9.6 KiB
Raw Blame History

GENOME.md — TurboQuant

Generated: 2026-04-14 | Codebase Genome Analysis

Project Overview

TurboQuant is a KV cache compression system for local inference on Apple Silicon. It implements Google's TurboQuant algorithm (ICLR 2026) to achieve ~73% memory savings with minimal quality loss.

Core Value Proposition

  • Problem: Large language models (27B+) require massive KV cache memory at long contexts
  • Solution: Three-stage compression (PolarQuant + QJL) reduces KV cache to ~3.5 bits/channel
  • Result: 128K context on 36GB hardware becomes viable (vs impossible at FP16)

Key Metrics

  • Compression: 73.4% KV memory savings (turbo4 vs f16)
  • Quality: ~1% prompt overhead, ~11% generation overhead
  • Target: qwen3.5:27b at 128K context within 36GB unified memory

Architecture

graph TB
    subgraph "Input Layer"
        Q[Query Vector Q]
        K[Key Vector K]
        V[Value Vector V]
    end
    
    subgraph "TurboQuant Compression"
        WHT[Walsh-Hadamard Transform]
        PQ[PolarQuant Encode]
        QJL[QJL Residual]
        PACK[Bit Packing]
    end
    
    subgraph "KV Cache Storage"
        CACHE[Compressed KV Cache]
        NORMS[Radius Norms FP16]
    end
    
    subgraph "Decompression & Attention"
        UNPACK[Bit Unpack]
        DEQ[PolarQuant Decode]
        FWHT[Inverse WHT]
        ATTEN[Attention Compute]
    end
    
    subgraph "Output"
        SCORES[Attention Scores]
        OUT[Weighted Values]
    end
    
    K --> WHT
    WHT --> PQ
    PQ --> PACK
    PACK --> CACHE
    PQ --> NORMS
    
    V --> WHT
    WHT --> PQ
    PQ --> PACK
    PACK --> CACHE
    
    CACHE --> UNPACK
    NORMS --> DEQ
    UNPACK --> DEQ
    DEQ --> FWHT
    
    Q --> ATTEN
    FWHT --> ATTEN
    ATTEN --> SCORES
    SCORES --> OUT
    
    style WHT fill:#e1f5fe
    style PQ fill:#fff3e0
    style QJL fill:#f3e5f5
    style ATTEN fill:#e8f5e8

Entry Points

Primary Entry: Metal Shaders

  • File: ggml-metal-turbo.metal
  • Functions:
    • kernel_fwht_128: Walsh-Hadamard transform (GPU)
    • kernel_turbo4_dequant: 4-bit dequantization (hot path)
    • kernel_attention_turbo4: Fused attention (conceptual)

CPU Reference Implementation

  • File: llama-turbo.cpp
  • Functions:
    • polar_quant_encode_turbo4: Encode (CPU reference)
    • polar_quant_decode_turbo4: Decode (CPU reference)
    • fwht: Fast Walsh-Hadamard transform

Benchmarking

  • File: benchmarks/run_benchmarks.py
  • Entry: CLI tool for measuring TTFT, tokens/sec, memory
  • Backends: Ollama, llama-server

Configuration

  • File: profiles/hermes-profile-gemma4-turboquant.yaml
  • Purpose: Hermes agent profile for TurboQuant deployment

Data Flow

1. Model Load
   ├── Load GGUF model weights
   ├── Initialize Lloyd-Max codebook (16 centroids for turbo4)
   ├── Initialize WHT rotation matrix (128×128)
   └── Set per-layer adaptive mode (TURBO_LAYER_ADAPTIVE)

2. Forward Pass (per token)
   ├── Compute Q, K, V projections
   ├── Compress K, V via PolarQuant:
   │   ├── Apply WHT rotation (O(d log d))
   │   ├── Compute L2 norm (radius)
   │   ├── Quantize coordinates to 4-bit indices
   │   └── Pack indices + store radius
   ├── Store compressed K, V in cache
   └── Attention:
       ├── Decompress K from cache (hot path)
       ├── Compute Q·K^T scores
       ├── Apply softmax
       ├── Decompress V from cache
       └── Compute weighted sum

3. Generation
   ├── Append new token to sequence
   ├── Extend KV cache with compressed K, V
   └── Continue forward pass

Key Abstractions

1. PolarQuant Codec

  • Purpose: Compress/decompress KV vectors
  • Algorithm: WHT → polar coordinates → Lloyd-Max quantization
  • Interface: polar_quant_encode_turbo4() / polar_quant_decode_turbo4()

2. Walsh-Hadamard Transform

  • Purpose: Energy-spreading rotation (makes distribution predictable)
  • Property: Orthogonal (preserves inner products)
  • Complexity: O(d log d) vs O(d²) for dense rotation

3. Lloyd-Max Codebook

  • Purpose: Optimal scalar quantization for known distribution
  • Size: 16 entries for turbo4 (4-bit)
  • Key: Precomputed, fixed (no per-vector calibration)

4. Per-Layer Adaptive Quantization

  • Purpose: Protect sensitive layers (first/last) with higher precision
  • Modes: 7 modes (0=uniform, 7=recommended)
  • Mechanism: TURBO_LAYER_ADAPTIVE environment variable

API Surface

C API (llama-turbo.h)

// Encode: float → 4-bit packed
void polar_quant_encode_turbo4(
    const float* src,    // Input [d]
    uint8_t* dst,        // Output [d/2] packed 4-bit
    float* norm,         // Output L2 norm
    int d                // Dimension (must be power of 2)
);

// Decode: 4-bit packed → float
void polar_quant_decode_turbo4(
    const uint8_t* src,  // Input [d/2] packed 4-bit
    float* dst,          // Output [d]
    float norm,          // Input L2 norm
    int d                // Dimension
);

Metal Shaders (GPU)

// Walsh-Hadamard transform (in-place)
kernel void kernel_fwht_128(
    device float* data [[buffer(0)]],
    uint tid [[thread_position_in_grid]]
);

// 4-bit dequantization (hot path)
kernel void kernel_turbo4_dequant(
    device const uchar* src [[buffer(0)]],
    device const float* norms [[buffer(1)]],
    device float* dst [[buffer(2)]],
    uint tid [[thread_position_in_grid]]
);

llama-server CLI

llama-server \
  -m model.gguf \
  -ctk turbo4 -ctv turbo4 \  # KV cache type
  -c 131072 \                 # Context length
  --port 11434                # API port

Environment Variables

  • TURBO_LAYER_ADAPTIVE: Per-layer quantization mode (0-7)
  • TURBO4_USE_4BIT: Enable 4-bit mode (default: 1)

Test Coverage Gaps

Current State

  • Unit tests: None in this repo
  • Integration tests: None
  • Benchmark tests: benchmarks/run_benchmarks.py
  • Perplexity tests: ⚠️ Corpus exists (corpora/wiki.test.raw) but no runner

Critical Missing Tests

  1. Encode/Decode Roundtrip: Verify decode(encode(x)) ≈ x
  2. Inner Product Preservation: Verify Q·K ≈ Q·dequant(quant(K))
  3. WHT Orthogonality: Verify WHT^T · WHT = I
  4. Codebook Correctness: Verify centroids match Lloyd-Max for N(0, 1/128)
  5. Metal vs CPU Parity: Verify GPU and CPU produce identical results
  6. Per-Layer Adaptive: Verify sensitive layers use higher precision
  7. Memory Bounds: Verify no buffer overflows in bit packing
# tests/test_polar_quant.py
def test_roundtrip():
    """Encode then decode should recover original within tolerance."""
    
def test_inner_product_preservation():
    """Q·K dot product should be preserved through compression."""
    
def test_wht_orthogonality():
    """WHT matrix should be orthogonal."""
    
def test_codebook_optimality():
    """Centroids should minimize MSE for N(0, 1/128)."""

Security Considerations

1. Buffer Overflows

  • Risk: Bit packing/unpacking could overflow if dimension not power of 2
  • Mitigation: Static asserts in Metal shaders, runtime checks in CPU code
  • Status: ⚠️ Need verification

2. Numerical Stability

  • Risk: Division by zero in 1.0 / (norm + 1e-9)
  • Mitigation: Epsilon guard present
  • Status: Handled

3. Memory Safety

  • Risk: C/C++ code has no bounds checking
  • Mitigation: Use Rust wrapper or sanitize inputs
  • Status: ⚠️ No safety wrapper

4. Denial of Service

  • Risk: Maliciously crafted KV vectors could cause slow quantization
  • Mitigation: Fixed iteration count in Lloyd-Max search
  • Status: Bounded

5. Side Channels

  • Risk: Timing differences in quantization could leak information
  • Mitigation: Constant-time implementation needed
  • Status: Not implemented

Dependencies

Build Dependencies

  • CMake: Build system
  • Metal SDK: GPU shaders (macOS)
  • C++17: Language standard

Runtime Dependencies

  • Apple Silicon: M1/M2/M3/M4
  • macOS: Metal GPU support
  • llama.cpp: Inference engine (forked)

External References

Deployment

Build

cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(sysctl -n hw.ncpu)

Run

export TURBO_LAYER_ADAPTIVE=7
./build/bin/llama-server \
  -m /path/to/model.gguf \
  --port 11434 \
  -ctk turbo4 -ctv turbo4 \
  -c 131072

Validate

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.5","messages":[{"role":"user","content":"hello"}]}'

Open Questions

  1. QJL Status: Infrastructure exists but is disabled. When will it be needed?
  2. Upstream Landing: When will TurboQuant be merged into llama.cpp mainline?
  3. Quality Threshold: What PPL delta is acceptable for production use?
  4. Multi-GPU: Does TurboQuant work with tensor parallelism?

Changelog

  • 2026-03-30: Phase 1 complete. PolarQuant MVP verified. 73% KV savings confirmed.
  • 2026-04-14: GENOME.md generated. Test gaps identified. Security considerations documented.