fix(#679 ): Codebase Genome for turboquant

2026-04-15 03:11:49 +00:00
1 changed files with 159 additions and 0 deletions
--- a/genomes/turboquant-GENOME.md
+++ b/genomes/turboquant-GENOME.md
@@ -0,0 +1,159 @@
+# GENOME.md: turboquant
+
+**Generated:** 2026-04-14
+**Repo:** Timmy_Foundation/turboquant
+**Phase:** 1 Complete (PolarQuant MVP)
+**Status:** Production-ready Metal shaders, benchmarks passing
+
+---
+
+## Project Overview
+
+TurboQuant is a KV cache compression system for local LLM inference on Apple Silicon. It enables 64K-128K context windows on 27B-parameter models within 36GB unified memory.
+
+Three-stage compression pipeline:
+1. **PolarQuant** -- WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
+2. **QJL** -- 1-bit quantized Johnson-Lindenstrauss residual correction
+3. **TurboQuant** -- PolarQuant + QJL combined = ~3.5 bits/channel, zero accuracy loss
+
+**Key result:** turbo4 delivers 73% KV memory savings with 1% prompt processing overhead and 11% generation overhead.
+
+## Architecture
+
+```mermaid
+graph TD
+    A[LLM Inference] --> B[KV Cache Layer]
+    B --> C{TurboQuant Mode}
+    C -->|turbo2| D[2-bit PolarQuant]
+    C -->|turbo3| E[3-bit PolarQuant + QJL]
+    C -->|turbo4| F[4-bit PolarQuant]
+    D --> G[Metal Shader: encode/decode]
+    E --> G
+    F --> G
+    G --> H[Compressed KV Storage]
+    H --> I[Decompress on Attention]
+    I --> A
+
+    J[llama-turbo.h] --> K[polar_quant_encode_turbo4]
+    K --> L[WHT Rotation]
+    L --> M[Polar Transform]
+    M --> N[Lloyd-Max Quantize]
+    N --> O[Packed 4-bit Output]
+```
+
+## Entry Points
+
+| Entry Point | Type | Purpose |
+|-------------|------|---------|
+| `llama-turbo.cpp` | C++ library | Core encode/decode functions |
+| `llama-turbo.h` | C header | Public API: `polar_quant_encode_turbo4`, `polar_quant_decode_turbo4` |
+| `ggml-metal-turbo.metal` | Metal shader | GPU-accelerated encode/decode for Apple Silicon |
+| `benchmarks/run_benchmarks.py` | Python | Benchmark suite: perplexity, speed, memory |
+| `benchmarks/run_perplexity.py` | Python | Perplexity evaluation across context lengths |
+| `evolution/hardware_optimizer.py` | Python | Hardware-aware parameter tuning |
+| `.gitea/workflows/smoke.yml` | CI | Smoke test on push |
+
+## Data Flow
+
+```
+Input: float array [d] (d=128, one KV head)
+  |
+  v
+WHT Rotation (structured orthogonal transform)
+  |
+  v
+Polar Transform (cartesian -> polar coordinates)
+  |
+  v
+Lloyd-Max Quantization (non-uniform codebook, 4-bit)
+  |
+  v
+Output: packed uint8_t [d/2] + float norm (radius)
+```
+
+Decode is the inverse: unpack -> dequantize -> inverse polar -> inverse WHT.
+
+## Key Abstractions
+
+| Abstraction | Description |
+|-------------|-------------|
+| **Turbo4** | 4-bit PolarQuant mode. Best quality. 4.2x compression. |
+| **Turbo3** | 3-bit mode with QJL residual. ~3.5 bits/channel. |
+| **Turbo2** | 2-bit mode. Maximum compression. Quality tradeoff. |
+| **WHT** | Walsh-Hadamard Transform. Structured orthogonal rotation. |
+| **Lloyd-Max** | Non-uniform codebook optimized for N(0, 1/sqrt(128)) distribution. |
+| **QJL** | Quantized Johnson-Lindenstrauss. 1-bit residual correction. |
+
+## API Surface
+
+### C API (llama-turbo.h)
+
+```c
+// Encode: float [d] -> packed 4-bit [d/2] + norm
+void polar_quant_encode_turbo4(const float* src, uint8_t* dst, float* norm, int d);
+
+// Decode: packed 4-bit [d/2] + norm -> float [d]
+void polar_quant_decode_turbo4(const uint8_t* src, float* dst, float norm, int d);
+```
+
+### Integration
+
+TurboQuant integrates with llama.cpp via:
+- Mixed quantization pairs: `q8_0 x turbo` for K/V asymmetric compression
+- Metal shader dispatch in `ggml-metal.metal` (turbo kernels)
+- Build flag: `-DGGML_TURBOQUANT=ON`
+
+### Hermes Profile
+
+`profiles/hermes-profile-gemma4-turboquant.yaml` defines deployment config:
+- Model: gemma4 with turbo4 KV compression
+- Target hardware: M3/M4 Max, 36GB+
+- Context window: up to 128K with compression
+
+## Test Coverage
+
+| Area | Coverage | Notes |
+|------|----------|-------|
+| WHT rotation | Partial | Metal GPU uses WHT. CPU ref uses dense random (legacy). |
+| Encode/decode symmetry | Full | `turbo_rotate_forward()` == inverse of `turbo_rotate_inverse()` |
+| Lloyd-Max codebook | Full | Non-uniform centroids verified |
+| Radius precision | Full | FP16+ norm per 128-element block |
+| Metal shader correctness | Full | All dk32-dk576 variants tested |
+| Perplexity benchmarks | Full | WikiText results in `benchmarks/perplexity_results.json` |
+
+### Gaps
+
+- No CI integration for Metal shader tests (smoke test only covers build)
+- CPU reference implementation uses dense random, not WHT (legacy)
+- No long-session stress tests beyond 128K
+- QJL implementation not yet verified against CUDA reference
+
+## Security Considerations
+
+- **No network access.** All inference is local.
+- **No user data in repo.** Benchmarks use public WikiText corpus.
+- **Binary blobs.** `llama-turbo.cpp` compiles to native code. No sandboxing.
+- **Upstream dependency.** Fork of TheTom/llama-cpp-turboquant. Trust boundary at upstream.
+
+## Dependencies
+
+| Dependency | Type | Source |
+|------------|------|--------|
+| llama.cpp | Fork | TheTom/llama-cpp-turboquant |
+| Metal | System | Apple GPU framework |
+| CMake | Build | Standard build system |
+| Python 3.10+ | Scripts | Benchmarks and optimizer |
+
+## Key Files
+
+```
+turboquant/
+  llama-turbo.h           # C API header
+  llama-turbo.cpp          # Core encode/decode implementation
+  ggml-metal-turbo.metal   # Metal GPU shaders
+  benchmarks/              # Perplexity and speed benchmarks
+  evolution/               # Hardware optimizer
+  profiles/                # Hermes deployment profile
+  docs/                    # Project status and build spec
+  .gitea/workflows/        # CI smoke test
+```