Files
timmy-home/genomes/turboquant-GENOME.md
Alexander Whitestone ac25f2f9d4
Some checks failed
Smoke Test / smoke (pull_request) Failing after 18s
fix(#679): Codebase Genome for turboquant
2026-04-15 03:11:49 +00:00

5.4 KiB

GENOME.md: turboquant

Generated: 2026-04-14 Repo: Timmy_Foundation/turboquant Phase: 1 Complete (PolarQuant MVP) Status: Production-ready Metal shaders, benchmarks passing


Project Overview

TurboQuant is a KV cache compression system for local LLM inference on Apple Silicon. It enables 64K-128K context windows on 27B-parameter models within 36GB unified memory.

Three-stage compression pipeline:

  1. PolarQuant -- WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
  2. QJL -- 1-bit quantized Johnson-Lindenstrauss residual correction
  3. TurboQuant -- PolarQuant + QJL combined = ~3.5 bits/channel, zero accuracy loss

Key result: turbo4 delivers 73% KV memory savings with 1% prompt processing overhead and 11% generation overhead.

Architecture

graph TD
    A[LLM Inference] --> B[KV Cache Layer]
    B --> C{TurboQuant Mode}
    C -->|turbo2| D[2-bit PolarQuant]
    C -->|turbo3| E[3-bit PolarQuant + QJL]
    C -->|turbo4| F[4-bit PolarQuant]
    D --> G[Metal Shader: encode/decode]
    E --> G
    F --> G
    G --> H[Compressed KV Storage]
    H --> I[Decompress on Attention]
    I --> A

    J[llama-turbo.h] --> K[polar_quant_encode_turbo4]
    K --> L[WHT Rotation]
    L --> M[Polar Transform]
    M --> N[Lloyd-Max Quantize]
    N --> O[Packed 4-bit Output]

Entry Points

Entry Point Type Purpose
llama-turbo.cpp C++ library Core encode/decode functions
llama-turbo.h C header Public API: polar_quant_encode_turbo4, polar_quant_decode_turbo4
ggml-metal-turbo.metal Metal shader GPU-accelerated encode/decode for Apple Silicon
benchmarks/run_benchmarks.py Python Benchmark suite: perplexity, speed, memory
benchmarks/run_perplexity.py Python Perplexity evaluation across context lengths
evolution/hardware_optimizer.py Python Hardware-aware parameter tuning
.gitea/workflows/smoke.yml CI Smoke test on push

Data Flow

Input: float array [d] (d=128, one KV head)
  |
  v
WHT Rotation (structured orthogonal transform)
  |
  v
Polar Transform (cartesian -> polar coordinates)
  |
  v
Lloyd-Max Quantization (non-uniform codebook, 4-bit)
  |
  v
Output: packed uint8_t [d/2] + float norm (radius)

Decode is the inverse: unpack -> dequantize -> inverse polar -> inverse WHT.

Key Abstractions

Abstraction Description
Turbo4 4-bit PolarQuant mode. Best quality. 4.2x compression.
Turbo3 3-bit mode with QJL residual. ~3.5 bits/channel.
Turbo2 2-bit mode. Maximum compression. Quality tradeoff.
WHT Walsh-Hadamard Transform. Structured orthogonal rotation.
Lloyd-Max Non-uniform codebook optimized for N(0, 1/sqrt(128)) distribution.
QJL Quantized Johnson-Lindenstrauss. 1-bit residual correction.

API Surface

C API (llama-turbo.h)

// Encode: float [d] -> packed 4-bit [d/2] + norm
void polar_quant_encode_turbo4(const float* src, uint8_t* dst, float* norm, int d);

// Decode: packed 4-bit [d/2] + norm -> float [d]
void polar_quant_decode_turbo4(const uint8_t* src, float* dst, float norm, int d);

Integration

TurboQuant integrates with llama.cpp via:

  • Mixed quantization pairs: q8_0 x turbo for K/V asymmetric compression
  • Metal shader dispatch in ggml-metal.metal (turbo kernels)
  • Build flag: -DGGML_TURBOQUANT=ON

Hermes Profile

profiles/hermes-profile-gemma4-turboquant.yaml defines deployment config:

  • Model: gemma4 with turbo4 KV compression
  • Target hardware: M3/M4 Max, 36GB+
  • Context window: up to 128K with compression

Test Coverage

Area Coverage Notes
WHT rotation Partial Metal GPU uses WHT. CPU ref uses dense random (legacy).
Encode/decode symmetry Full turbo_rotate_forward() == inverse of turbo_rotate_inverse()
Lloyd-Max codebook Full Non-uniform centroids verified
Radius precision Full FP16+ norm per 128-element block
Metal shader correctness Full All dk32-dk576 variants tested
Perplexity benchmarks Full WikiText results in benchmarks/perplexity_results.json

Gaps

  • No CI integration for Metal shader tests (smoke test only covers build)
  • CPU reference implementation uses dense random, not WHT (legacy)
  • No long-session stress tests beyond 128K
  • QJL implementation not yet verified against CUDA reference

Security Considerations

  • No network access. All inference is local.
  • No user data in repo. Benchmarks use public WikiText corpus.
  • Binary blobs. llama-turbo.cpp compiles to native code. No sandboxing.
  • Upstream dependency. Fork of TheTom/llama-cpp-turboquant. Trust boundary at upstream.

Dependencies

Dependency Type Source
llama.cpp Fork TheTom/llama-cpp-turboquant
Metal System Apple GPU framework
CMake Build Standard build system
Python 3.10+ Scripts Benchmarks and optimizer

Key Files

turboquant/
  llama-turbo.h           # C API header
  llama-turbo.cpp          # Core encode/decode implementation
  ggml-metal-turbo.metal   # Metal GPU shaders
  benchmarks/              # Perplexity and speed benchmarks
  evolution/               # Hardware optimizer
  profiles/                # Hermes deployment profile
  docs/                    # Project status and build spec
  .gitea/workflows/        # CI smoke test