Timmy_Foundation/timmy-home

Fork 0

Files

Alexander Whitestone cb269347cc

Self-Healing Smoke / self-healing-smoke (pull_request) Failing after 38s

Details

Agent PR Gate / gate (pull_request) Failing after 50s

Details

Smoke Test / smoke (pull_request) Failing after 24s

Details

Agent PR Gate / report (pull_request) Successful in 17s

Details

fix: docs: refresh turboquant codebase genome for #679 (closes #827 )

2026-04-20 21:42:33 -04:00

6.2 KiB

Raw Blame History

GENOME.md — TurboQuant (Timmy_Foundation/turboquant)

Codebase Genome v1.1 | Refreshed 2026-04-18 | Repo 12/16 | Ref: #679

Project Overview

TurboQuant is a KV cache compression system for local inference on Apple Silicon. Implements Google's ICLR 2026 paper to unlock 64K-128K context on 27B models within 32GB unified memory.

Three-stage compression:

PolarQuant — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
QJL — 1-bit quantized Johnson-Lindenstrauss residual correction
TurboQuant — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss

Key result: 73% KV memory savings with 1% prompt processing overhead, 11% generation overhead.

Architecture

graph TD
    subgraph "Compression Pipeline"
        KV[Raw KV Cache fp16] --> WHT[WHT Rotation]
        WHT --> POLAR[PolarQuant 4-bit]
        POLAR --> QJL[QJL Residual]
        QJL --> PACKED[Packed KV ~3.5bit]
    end

    subgraph "Metal Shaders"
        PACKED --> DECODE[Polar Decode Kernel]
        DECODE --> ATTEN[Flash Attention]
        ATTEN --> OUTPUT[Model Output]
    end

    subgraph "Build System"
        CMAKE[CMakeLists.txt] --> LIB[turboquant.a]
        LIB --> TEST[turboquant_roundtrip_test]
        LIB --> LLAMA[llama.cpp fork integration]
    end

    subgraph "Python Layer"
        SELECTOR[quant_selector.py] --> MODELS[model_registry/]
        MODELS --> PROFILE[hardware_profiles.py]
        PROFILE --> DECISION[quantization decision]
    end

Entry Points

Entry Point	File	Purpose
`polar_quant_encode_turbo4()`	llama-turbo.cpp	Encode float KV → 4-bit packed
`polar_quant_decode_turbo4()`	llama-turbo.cpp	Decode 4-bit packed → float KV
`cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON`	CMakeLists.txt	Build static library + CTest suite
`ctest --test-dir build --output-on-failure`	build/	Run C++ roundtrip tests
`run_benchmarks.py`	benchmarks/	Run perplexity benchmarks
`quant_selector.py`	quant_selector/	Hardware-aware quantization selection

Key Abstractions

Symbol	File	Purpose
`polar_quant_encode_turbo4()`	llama-turbo.h/.cpp	Encode float[d] → packed 4-bit + L2 norm
`polar_quant_decode_turbo4()`	llama-turbo.h/.cpp	Decode packed 4-bit + norm → float[d]
`turbo_dequantize_k()`	ggml-metal-turbo.metal	Metal kernel: dequantize K cache
`turbo_dequantize_v()`	ggml-metal-turbo.metal	Metal kernel: dequantize V cache
`turbo_fwht_128()`	ggml-metal-turbo.metal	Fast Walsh-Hadamard Transform
`run_perplexity.py`	benchmarks/	Measure perplexity impact
`run_benchmarks.py`	benchmarks/	Full benchmark suite (speed + quality)
`select_quantization()`	quant_selector.py	Pick quant scheme from hardware profile

Data Flow

Input: float KV vectors [d=128 per head]
  ↓
1. WHT rotation (in-place, O(d log d))
  ↓
2. Convert to polar coords (radius, angles)
  ↓
3. Lloyd-Max quantize angles → 4-bit indices
  ↓
4. Store: packed indices [d/2 bytes] + float norm [4 bytes]
  ↓
Decode: indices → codebook lookup → polar → cartesian → inverse WHT
  ↓
Output: reconstructed float KV [d=128]

API Surface

Function	Signature	Notes
`polar_quant_encode_turbo4`	`(const float, uint8_t, float*, int)`	Core encode path
`polar_quant_decode_turbo4`	`(const uint8_t, float, float, int)`	Core decode path
`select_quantization`	`(HardwareProfile) -> QuantConfig`	Python quant selector

File Index

File	LOC	Purpose
`llama-turbo.h`	24	C API: encode/decode function declarations
`llama-turbo.cpp`	78	Implementation: PolarQuant encode/decode
`ggml-metal-turbo.metal`	76	Metal shader: dequantize + FWHT kernels
`CMakeLists.txt`	42	Standalone build: lib + test targets
`quant_selector.py`	~120	Python: hardware profile → quant decision
`tests/test_quant_selector.py`	~90	Pytest: quant selector (currently failing)
`benchmarks/run_benchmarks.py`	~85	Perplexity + speed benchmarking

CI / Runtime Drift

Dimension	Status	Notes
CMake/CTest standalone build	✅ Passing	`cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON && ctest --test-dir build` works on current main
Python quant selector tests	❌ Failing	`tests/test_quant_selector.py` fails on current main — tracked in `turboquant #139`
CI lane: quant_selector	❌ Broken	The quant selector CI lane is non-blocking due to persistent failures
CI lane: cmake roundtrip	✅ Green	C++ roundtrip test passes in CI
Metal shader compilation	⚠️ Apple Silicon only	Cannot be tested in CI runners; validated manually on M-series hardware

Test Coverage Gaps

tests/test_quant_selector.py is currently broken — selector returns wrong quantization for edge-case hardware profiles (see turboquant #139)
No CI coverage for Metal shader correctness (Apple Silicon only)
Benchmark regression detection is manual; no automated threshold enforcement

Security Considerations

C API operates on caller-allocated buffers — no internal bounds checking on d parameter
Python quant selector reads hardware profile from filesystem; path traversal risk if profile dir is user-controllable

Dependencies

Dependency	Version	Purpose
CMake	≥3.20	Build system
Python	≥3.10	Benchmarks + quant selector
pytest	any	Test runner for Python tests
Metal (macOS)	14+	GPU shader compilation
llama.cpp	fork	Integration layer

Deployment

Static library turboquant.a linked into llama.cpp fork
Python quant selector invoked at model-load time to pick compression scheme
No standalone server component; embedded in inference runtime

Technical Debt

turboquant #139 — quant selector test failures not yet resolved; CI lane is non-blocking
No automated benchmark regression detection
Metal shaders untestable in CI — manual validation on Apple Silicon required
Stale genome (v1.0, 2026-04-15) did not reflect quant selector addition or CI drift

6.2 KiB Raw Blame History

GENOME.md — TurboQuant (Timmy_Foundation/turboquant)

Project Overview

Architecture

Entry Points

Key Abstractions

Data Flow

API Surface

File Index

CI / Runtime Drift

Test Coverage Gaps

Security Considerations

Dependencies

Deployment

Technical Debt

6.2 KiB

Raw Blame History