Files
timmy-home/genomes/turboquant/GENOME.md
Alexander Whitestone cb269347cc
Some checks failed
Self-Healing Smoke / self-healing-smoke (pull_request) Failing after 38s
Agent PR Gate / gate (pull_request) Failing after 50s
Smoke Test / smoke (pull_request) Failing after 24s
Agent PR Gate / report (pull_request) Successful in 17s
fix: docs: refresh turboquant codebase genome for #679 (closes #827)
2026-04-20 21:42:33 -04:00

6.2 KiB

GENOME.md — TurboQuant (Timmy_Foundation/turboquant)

Codebase Genome v1.1 | Refreshed 2026-04-18 | Repo 12/16 | Ref: #679

Project Overview

TurboQuant is a KV cache compression system for local inference on Apple Silicon. Implements Google's ICLR 2026 paper to unlock 64K-128K context on 27B models within 32GB unified memory.

Three-stage compression:

  1. PolarQuant — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
  2. QJL — 1-bit quantized Johnson-Lindenstrauss residual correction
  3. TurboQuant — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss

Key result: 73% KV memory savings with 1% prompt processing overhead, 11% generation overhead.

Architecture

graph TD
    subgraph "Compression Pipeline"
        KV[Raw KV Cache fp16] --> WHT[WHT Rotation]
        WHT --> POLAR[PolarQuant 4-bit]
        POLAR --> QJL[QJL Residual]
        QJL --> PACKED[Packed KV ~3.5bit]
    end

    subgraph "Metal Shaders"
        PACKED --> DECODE[Polar Decode Kernel]
        DECODE --> ATTEN[Flash Attention]
        ATTEN --> OUTPUT[Model Output]
    end

    subgraph "Build System"
        CMAKE[CMakeLists.txt] --> LIB[turboquant.a]
        LIB --> TEST[turboquant_roundtrip_test]
        LIB --> LLAMA[llama.cpp fork integration]
    end

    subgraph "Python Layer"
        SELECTOR[quant_selector.py] --> MODELS[model_registry/]
        MODELS --> PROFILE[hardware_profiles.py]
        PROFILE --> DECISION[quantization decision]
    end

Entry Points

Entry Point File Purpose
polar_quant_encode_turbo4() llama-turbo.cpp Encode float KV → 4-bit packed
polar_quant_decode_turbo4() llama-turbo.cpp Decode 4-bit packed → float KV
cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON CMakeLists.txt Build static library + CTest suite
ctest --test-dir build --output-on-failure build/ Run C++ roundtrip tests
run_benchmarks.py benchmarks/ Run perplexity benchmarks
quant_selector.py quant_selector/ Hardware-aware quantization selection

Key Abstractions

Symbol File Purpose
polar_quant_encode_turbo4() llama-turbo.h/.cpp Encode float[d] → packed 4-bit + L2 norm
polar_quant_decode_turbo4() llama-turbo.h/.cpp Decode packed 4-bit + norm → float[d]
turbo_dequantize_k() ggml-metal-turbo.metal Metal kernel: dequantize K cache
turbo_dequantize_v() ggml-metal-turbo.metal Metal kernel: dequantize V cache
turbo_fwht_128() ggml-metal-turbo.metal Fast Walsh-Hadamard Transform
run_perplexity.py benchmarks/ Measure perplexity impact
run_benchmarks.py benchmarks/ Full benchmark suite (speed + quality)
select_quantization() quant_selector.py Pick quant scheme from hardware profile

Data Flow

Input: float KV vectors [d=128 per head]
  ↓
1. WHT rotation (in-place, O(d log d))
  ↓
2. Convert to polar coords (radius, angles)
  ↓
3. Lloyd-Max quantize angles → 4-bit indices
  ↓
4. Store: packed indices [d/2 bytes] + float norm [4 bytes]
  ↓
Decode: indices → codebook lookup → polar → cartesian → inverse WHT
  ↓
Output: reconstructed float KV [d=128]

API Surface

Function Signature Notes
polar_quant_encode_turbo4 (const float*, uint8_t*, float*, int) Core encode path
polar_quant_decode_turbo4 (const uint8_t*, float, float*, int) Core decode path
select_quantization (HardwareProfile) -> QuantConfig Python quant selector

File Index

File LOC Purpose
llama-turbo.h 24 C API: encode/decode function declarations
llama-turbo.cpp 78 Implementation: PolarQuant encode/decode
ggml-metal-turbo.metal 76 Metal shader: dequantize + FWHT kernels
CMakeLists.txt 42 Standalone build: lib + test targets
quant_selector.py ~120 Python: hardware profile → quant decision
tests/test_quant_selector.py ~90 Pytest: quant selector (currently failing)
benchmarks/run_benchmarks.py ~85 Perplexity + speed benchmarking

CI / Runtime Drift

Dimension Status Notes
CMake/CTest standalone build Passing cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON && ctest --test-dir build works on current main
Python quant selector tests Failing tests/test_quant_selector.py fails on current main — tracked in turboquant #139
CI lane: quant_selector Broken The quant selector CI lane is non-blocking due to persistent failures
CI lane: cmake roundtrip Green C++ roundtrip test passes in CI
Metal shader compilation ⚠️ Apple Silicon only Cannot be tested in CI runners; validated manually on M-series hardware

Test Coverage Gaps

  • tests/test_quant_selector.py is currently broken — selector returns wrong quantization for edge-case hardware profiles (see turboquant #139)
  • No CI coverage for Metal shader correctness (Apple Silicon only)
  • Benchmark regression detection is manual; no automated threshold enforcement

Security Considerations

  • C API operates on caller-allocated buffers — no internal bounds checking on d parameter
  • Python quant selector reads hardware profile from filesystem; path traversal risk if profile dir is user-controllable

Dependencies

Dependency Version Purpose
CMake ≥3.20 Build system
Python ≥3.10 Benchmarks + quant selector
pytest any Test runner for Python tests
Metal (macOS) 14+ GPU shader compilation
llama.cpp fork Integration layer

Deployment

  • Static library turboquant.a linked into llama.cpp fork
  • Python quant selector invoked at model-load time to pick compression scheme
  • No standalone server component; embedded in inference runtime

Technical Debt

  • turboquant #139 — quant selector test failures not yet resolved; CI lane is non-blocking
  • No automated benchmark regression detection
  • Metal shaders untestable in CI — manual validation on Apple Silicon required
  • Stale genome (v1.0, 2026-04-15) did not reflect quant selector addition or CI drift