Timmy_Foundation/timmy-home

Fork 0

Files

Alexander Whitestone ac25f2f9d4

Smoke Test / smoke (pull_request) Failing after 18s

Details

fix(#679 ): Codebase Genome for turboquant

2026-04-15 03:11:49 +00:00

5.4 KiB

Raw Blame History

GENOME.md: turboquant

Generated: 2026-04-14 Repo: Timmy_Foundation/turboquant Phase: 1 Complete (PolarQuant MVP) Status: Production-ready Metal shaders, benchmarks passing

Project Overview

TurboQuant is a KV cache compression system for local LLM inference on Apple Silicon. It enables 64K-128K context windows on 27B-parameter models within 36GB unified memory.

Three-stage compression pipeline:

PolarQuant -- WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
QJL -- 1-bit quantized Johnson-Lindenstrauss residual correction
TurboQuant -- PolarQuant + QJL combined = ~3.5 bits/channel, zero accuracy loss

Key result: turbo4 delivers 73% KV memory savings with 1% prompt processing overhead and 11% generation overhead.

Architecture

graph TD
    A[LLM Inference] --> B[KV Cache Layer]
    B --> C{TurboQuant Mode}
    C -->|turbo2| D[2-bit PolarQuant]
    C -->|turbo3| E[3-bit PolarQuant + QJL]
    C -->|turbo4| F[4-bit PolarQuant]
    D --> G[Metal Shader: encode/decode]
    E --> G
    F --> G
    G --> H[Compressed KV Storage]
    H --> I[Decompress on Attention]
    I --> A

    J[llama-turbo.h] --> K[polar_quant_encode_turbo4]
    K --> L[WHT Rotation]
    L --> M[Polar Transform]
    M --> N[Lloyd-Max Quantize]
    N --> O[Packed 4-bit Output]

Entry Points

Entry Point	Type	Purpose
`llama-turbo.cpp`	C++ library	Core encode/decode functions
`llama-turbo.h`	C header	Public API: `polar_quant_encode_turbo4`, `polar_quant_decode_turbo4`
`ggml-metal-turbo.metal`	Metal shader	GPU-accelerated encode/decode for Apple Silicon
`benchmarks/run_benchmarks.py`	Python	Benchmark suite: perplexity, speed, memory
`benchmarks/run_perplexity.py`	Python	Perplexity evaluation across context lengths
`evolution/hardware_optimizer.py`	Python	Hardware-aware parameter tuning
`.gitea/workflows/smoke.yml`	CI	Smoke test on push

Data Flow

Input: float array [d] (d=128, one KV head)
  |
  v
WHT Rotation (structured orthogonal transform)
  |
  v
Polar Transform (cartesian -> polar coordinates)
  |
  v
Lloyd-Max Quantization (non-uniform codebook, 4-bit)
  |
  v
Output: packed uint8_t [d/2] + float norm (radius)

Decode is the inverse: unpack -> dequantize -> inverse polar -> inverse WHT.

Key Abstractions

Abstraction	Description
Turbo4	4-bit PolarQuant mode. Best quality. 4.2x compression.
Turbo3	3-bit mode with QJL residual. ~3.5 bits/channel.
Turbo2	2-bit mode. Maximum compression. Quality tradeoff.
WHT	Walsh-Hadamard Transform. Structured orthogonal rotation.
Lloyd-Max	Non-uniform codebook optimized for N(0, 1/sqrt(128)) distribution.
QJL	Quantized Johnson-Lindenstrauss. 1-bit residual correction.

API Surface

C API (llama-turbo.h)

// Encode: float [d] -> packed 4-bit [d/2] + norm
void polar_quant_encode_turbo4(const float* src, uint8_t* dst, float* norm, int d);

// Decode: packed 4-bit [d/2] + norm -> float [d]
void polar_quant_decode_turbo4(const uint8_t* src, float* dst, float norm, int d);

Integration

TurboQuant integrates with llama.cpp via:

Mixed quantization pairs: q8_0 x turbo for K/V asymmetric compression
Metal shader dispatch in ggml-metal.metal (turbo kernels)
Build flag: -DGGML_TURBOQUANT=ON

Hermes Profile

profiles/hermes-profile-gemma4-turboquant.yaml defines deployment config:

Model: gemma4 with turbo4 KV compression
Target hardware: M3/M4 Max, 36GB+
Context window: up to 128K with compression

Test Coverage

Area	Coverage	Notes
WHT rotation	Partial	Metal GPU uses WHT. CPU ref uses dense random (legacy).
Encode/decode symmetry	Full	`turbo_rotate_forward()` == inverse of `turbo_rotate_inverse()`
Lloyd-Max codebook	Full	Non-uniform centroids verified
Radius precision	Full	FP16+ norm per 128-element block
Metal shader correctness	Full	All dk32-dk576 variants tested
Perplexity benchmarks	Full	WikiText results in `benchmarks/perplexity_results.json`

Gaps

No CI integration for Metal shader tests (smoke test only covers build)
CPU reference implementation uses dense random, not WHT (legacy)
No long-session stress tests beyond 128K
QJL implementation not yet verified against CUDA reference

Security Considerations

No network access. All inference is local.
No user data in repo. Benchmarks use public WikiText corpus.
Binary blobs. llama-turbo.cpp compiles to native code. No sandboxing.
Upstream dependency. Fork of TheTom/llama-cpp-turboquant. Trust boundary at upstream.

Dependencies

Dependency	Type	Source
llama.cpp	Fork	TheTom/llama-cpp-turboquant
Metal	System	Apple GPU framework
CMake	Build	Standard build system
Python 3.10+	Scripts	Benchmarks and optimizer

Key Files

turboquant/
  llama-turbo.h           # C API header
  llama-turbo.cpp          # Core encode/decode implementation
  ggml-metal-turbo.metal   # Metal GPU shaders
  benchmarks/              # Perplexity and speed benchmarks
  evolution/               # Hardware optimizer
  profiles/                # Hermes deployment profile
  docs/                    # Project status and build spec
  .gitea/workflows/        # CI smoke test

5.4 KiB Raw Blame History