Compare commits
2 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
f152dd1ea0 | ||
|
|
a075d3855f |
@@ -1,138 +1,200 @@
|
||||
# GENOME.md — TurboQuant (Timmy_Foundation/turboquant)
|
||||
|
||||
> Codebase Genome v1.0 | Generated 2026-04-15 | Repo 12/16
|
||||
> Codebase Genome v1.1 | Refreshed 2026-04-21 for timmy-home #679
|
||||
|
||||
## Project Overview
|
||||
|
||||
**TurboQuant** is a KV cache compression system for local inference on Apple Silicon. Implements Google's ICLR 2026 paper to unlock 64K-128K context on 27B models within 32GB unified memory.
|
||||
TurboQuant is a local-first KV-cache compression repo for long-context inference on Apple Silicon. Its center of gravity is not a web service or product UI; it is a compact implementation and verification surface around PolarQuant, QJL, and TurboQuant compression modes for llama.cpp-style runtimes.
|
||||
|
||||
**Three-stage compression:**
|
||||
1. **PolarQuant** — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
|
||||
2. **QJL** — 1-bit quantized Johnson-Lindenstrauss residual correction
|
||||
3. **TurboQuant** — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss
|
||||
The repo mixes three lanes:
|
||||
1. a small native core (`llama-turbo.cpp`, `llama-turbo.h`, `ggml-metal-turbo.metal`)
|
||||
2. benchmark and evaluation scripts (`benchmarks/`)
|
||||
3. operational selection/profile helpers (`evolution/`, `profiles/`)
|
||||
|
||||
**Key result:** 73% KV memory savings with 1% prompt processing overhead, 11% generation overhead.
|
||||
Current `main` already builds a standalone CMake target and roundtrip C++ test locally, but the checked-in `.gitea/workflows/smoke.yml` still performs only parse + secret scanning. That mismatch matters: the codebase can be build-healthy while CI remains blind to the most important executable path.
|
||||
|
||||
## Architecture
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Compression Pipeline"
|
||||
KV[Raw KV Cache fp16] --> WHT[WHT Rotation]
|
||||
WHT --> POLAR[PolarQuant 4-bit]
|
||||
POLAR --> QJL[QJL Residual]
|
||||
QJL --> PACKED[Packed KV ~3.5bit]
|
||||
subgraph Compression_Core
|
||||
H[llama-turbo.h] --> CPP[llama-turbo.cpp]
|
||||
CPP --> PQ[PolarQuant encode/decode]
|
||||
CPP --> QJL[QJL residual helpers]
|
||||
end
|
||||
|
||||
subgraph "Metal Shaders"
|
||||
PACKED --> DECODE[Polar Decode Kernel]
|
||||
DECODE --> ATTEN[Flash Attention]
|
||||
ATTEN --> OUTPUT[Model Output]
|
||||
subgraph Metal_Runtime
|
||||
MTL[ggml-metal-turbo.metal]
|
||||
PQ --> MTL
|
||||
QJL --> MTL
|
||||
MTL --> ATTN[KV dequant + attention path]
|
||||
end
|
||||
|
||||
subgraph "Build System"
|
||||
CMAKE[CMakeLists.txt] --> LIB[turboquant.a]
|
||||
LIB --> TEST[turboquant_roundtrip_test]
|
||||
LIB --> LLAMA[llama.cpp fork integration]
|
||||
subgraph Verification
|
||||
CMAKE[CMakeLists.txt] --> ROUNDTRIP[tests/roundtrip_test.cpp]
|
||||
ROUNDTRIP --> CTEST[ctest --test-dir build --output-on-failure]
|
||||
BENCH[benchmarks/run_benchmarks.py] --> PPL[benchmarks/run_perplexity.py]
|
||||
PYTEST[tests/test_quant_selector.py]
|
||||
PYTEST2[tests/test_tool_call_integration.py]
|
||||
end
|
||||
|
||||
subgraph Ops_Surface
|
||||
QS[evolution/quant_selector.py]
|
||||
HO[evolution/hardware_optimizer.py]
|
||||
PROFILE[profiles/hermes-profile-gemma4-turboquant.yaml]
|
||||
end
|
||||
|
||||
ATTN --> PROFILE
|
||||
BENCH --> QS
|
||||
QS --> PROFILE
|
||||
```
|
||||
|
||||
## Entry Points
|
||||
|
||||
| Entry Point | File | Purpose |
|
||||
|-------------|------|---------|
|
||||
| `polar_quant_encode_turbo4()` | llama-turbo.cpp | Encode float KV → 4-bit packed |
|
||||
| `polar_quant_decode_turbo4()` | llama-turbo.cpp | Decode 4-bit packed → float KV |
|
||||
| `cmake build` | CMakeLists.txt | Build static library + tests |
|
||||
| `run_benchmarks.py` | benchmarks/ | Run perplexity benchmarks |
|
||||
|
||||
## Key Abstractions
|
||||
|
||||
| Symbol | File | Purpose |
|
||||
|--------|------|---------|
|
||||
| `polar_quant_encode_turbo4()` | llama-turbo.h/.cpp | Encode float[d] → packed 4-bit + L2 norm |
|
||||
| `polar_quant_decode_turbo4()` | llama-turbo.h/.cpp | Decode packed 4-bit + norm → float[d] |
|
||||
| `turbo_dequantize_k()` | ggml-metal-turbo.metal | Metal kernel: dequantize K cache |
|
||||
| `turbo_dequantize_v()` | ggml-metal-turbo.metal | Metal kernel: dequantize V cache |
|
||||
| `turbo_fwht_128()` | ggml-metal-turbo.metal | Fast Walsh-Hadamard Transform |
|
||||
| `run_perplexity.py` | benchmarks/ | Measure perplexity impact |
|
||||
| `run_benchmarks.py` | benchmarks/ | Full benchmark suite (speed + quality) |
|
||||
| Entry point | File | Purpose |
|
||||
| --- | --- | --- |
|
||||
| Native encode/decode API | `llama-turbo.h` / `llama-turbo.cpp` | Public compression functions and implementation |
|
||||
| Metal kernels | `ggml-metal-turbo.metal` | Apple GPU execution path for dequant / rotation helpers |
|
||||
| Standalone build | `CMakeLists.txt` | Builds static library plus `turboquant_roundtrip_test` |
|
||||
| Benchmark suite | `benchmarks/run_benchmarks.py` | Speed / quality benchmark runner |
|
||||
| Perplexity harness | `benchmarks/run_perplexity.py` | Prompt-quality regression measurement |
|
||||
| Long-session harness | `benchmarks/run_long_session.py` | Extended runtime exercise |
|
||||
| Quant chooser | `evolution/quant_selector.py` | Memory-aware preset selection |
|
||||
| Hardware optimizer stub | `evolution/hardware_optimizer.py` | Hardware tuning placeholder/stub |
|
||||
| Hermes profile | `profiles/hermes-profile-gemma4-turboquant.yaml` | Deployment config for compressed long-context runs |
|
||||
| Repo CI floor | `.gitea/workflows/smoke.yml` | Parse + secret scan only on current `main` |
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
Input: float KV vectors [d=128 per head]
|
||||
↓
|
||||
1. WHT rotation (in-place, O(d log d))
|
||||
↓
|
||||
2. Convert to polar coords (radius, angles)
|
||||
↓
|
||||
3. Lloyd-Max quantize angles → 4-bit indices
|
||||
↓
|
||||
4. Store: packed indices [d/2 bytes] + float norm [4 bytes]
|
||||
↓
|
||||
Decode: indices → codebook lookup → polar → cartesian → inverse WHT
|
||||
↓
|
||||
Output: reconstructed float KV [d=128]
|
||||
1. Input KV vectors enter the native compression path in `llama-turbo.cpp`.
|
||||
2. The encoder performs WHT rotation, transforms into polar/codebook space, and writes packed KV outputs plus norms.
|
||||
3. `ggml-metal-turbo.metal` supplies the Apple Silicon GPU-side decode and attention helpers for runtime execution.
|
||||
4. `tests/roundtrip_test.cpp` validates encode→decode fidelity via the standalone CMake/CTest path.
|
||||
5. Python benchmark scripts consume the native/built surface to measure speed and quality.
|
||||
6. `evolution/quant_selector.py` translates machine/resource constraints into quantization choices that feed the deployment profile in `profiles/hermes-profile-gemma4-turboquant.yaml`.
|
||||
|
||||
## Key Abstractions
|
||||
|
||||
| Abstraction | File | Role |
|
||||
| --- | --- | --- |
|
||||
| `polar_quant_encode_turbo4` / `polar_quant_decode_turbo4` | `llama-turbo.cpp` | Core public encode/decode path for TurboQuant |
|
||||
| `turboquant_roundtrip_test` | `tests/roundtrip_test.cpp` | Executable spec for standalone correctness |
|
||||
| `QuantLevel` / `QUANT_LEVELS` | `evolution/quant_selector.py` | Encodes quality/compression presets and fallback ordering |
|
||||
| `select_quant_for_memory(...)`-style selection logic | `evolution/quant_selector.py` | Converts hardware budgets into quant choices |
|
||||
| benchmark runners | `benchmarks/run_benchmarks.py`, `benchmarks/run_perplexity.py` | Measure regression envelope for performance/quality |
|
||||
| deployment profile | `profiles/hermes-profile-gemma4-turboquant.yaml` | Operationalizes the repo into a Hermes runtime |
|
||||
|
||||
## API Surface
|
||||
|
||||
### Native API
|
||||
The repo exposes a narrow C/C++ surface through `llama-turbo.h`:
|
||||
- encode float KV blocks into packed TurboQuant form
|
||||
- decode packed blocks back into float KV
|
||||
|
||||
### Verification / CLI surface
|
||||
Operationally, the repo is exercised via command entry points rather than a service API:
|
||||
- `cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON`
|
||||
- `cmake --build build`
|
||||
- `ctest --test-dir build --output-on-failure`
|
||||
- `python3 benchmarks/run_benchmarks.py`
|
||||
- `python3 benchmarks/run_perplexity.py`
|
||||
|
||||
### Python operational surface
|
||||
The repo also exposes Python analysis / selection code:
|
||||
- `evolution/quant_selector.py`
|
||||
- `evolution/hardware_optimizer.py`
|
||||
- `tests/test_quant_selector.py`
|
||||
- `tests/test_tool_call_integration.py`
|
||||
|
||||
## Repository Surface Snapshot
|
||||
|
||||
Current high-signal files on `main`:
|
||||
- `README.md`
|
||||
- `docs/PROJECT_STATUS.md`
|
||||
- `CMakeLists.txt`
|
||||
- `llama-turbo.cpp`
|
||||
- `llama-turbo.h`
|
||||
- `ggml-metal-turbo.metal`
|
||||
- `tests/roundtrip_test.cpp`
|
||||
- `tests/test_quant_selector.py`
|
||||
- `tests/test_tool_call_integration.py`
|
||||
- `evolution/quant_selector.py`
|
||||
- `evolution/hardware_optimizer.py`
|
||||
- `profiles/hermes-profile-gemma4-turboquant.yaml`
|
||||
- `.gitea/workflows/smoke.yml`
|
||||
|
||||
Using `pygount --format=summary`, the current repo footprint is roughly:
|
||||
- 29 tracked source/doc/config files in the main tree (excluding build output)
|
||||
- 1535 lines of code
|
||||
- Python is the largest language block, followed by YAML and C++
|
||||
|
||||
## Runtime Verification (2026-04-21)
|
||||
|
||||
Verified directly against a fresh clone of `Timmy_Foundation/turboquant`:
|
||||
|
||||
```bash
|
||||
cmake -S . -B build -DTURBOQUANT_BUILD_TESTS=ON
|
||||
cmake --build build
|
||||
ctest --test-dir build --output-on-failure
|
||||
python3 -m pytest tests/test_quant_selector.py -q
|
||||
```
|
||||
|
||||
## File Index
|
||||
|
||||
| File | LOC | Purpose |
|
||||
|------|-----|---------|
|
||||
| `llama-turbo.h` | 24 | C API: encode/decode function declarations |
|
||||
| `llama-turbo.cpp` | 78 | Implementation: PolarQuant encode/decode |
|
||||
| `ggml-metal-turbo.metal` | 76 | Metal shaders: dequantize + flash attention |
|
||||
| `CMakeLists.txt` | 44 | Build system: static lib + tests |
|
||||
| `tests/roundtrip_test.cpp` | 104 | Roundtrip encode→decode validation |
|
||||
| `benchmarks/run_benchmarks.py` | 227 | Benchmark suite |
|
||||
| `benchmarks/run_perplexity.py` | ~100 | Perplexity measurement |
|
||||
| `evolution/hardware_optimizer.py` | 5 | Hardware detection stub |
|
||||
|
||||
**Total: ~660 LOC | C++ core: 206 LOC | Python benchmarks: 232 LOC**
|
||||
|
||||
## Dependencies
|
||||
|
||||
| Dependency | Purpose |
|
||||
|------------|---------|
|
||||
| CMake 3.16+ | Build system |
|
||||
| C++17 compiler | Core implementation |
|
||||
| Metal (macOS) | GPU shader execution |
|
||||
| Python 3.11+ | Benchmarks |
|
||||
| llama.cpp fork | Integration target |
|
||||
|
||||
## Source Repos (Upstream)
|
||||
|
||||
| Repo | Role |
|
||||
|------|------|
|
||||
| TheTom/llama-cpp-turboquant | llama.cpp fork with Metal shaders |
|
||||
| TheTom/turboquant_plus | Reference impl, 511+ tests |
|
||||
| amirzandieh/QJL | Author QJL code (CUDA) |
|
||||
| rachittshah/mlx-turboquant | MLX fallback |
|
||||
Observed results:
|
||||
- standalone CMake configure/build passes
|
||||
- `ctest --test-dir build --output-on-failure` passes on current `main`
|
||||
- `tests/test_quant_selector.py` is not fully green on current `main`
|
||||
- the failing lane is already tracked as `turboquant #139`
|
||||
|
||||
## Test Coverage
|
||||
|
||||
| Test | File | Validates |
|
||||
|------|------|-----------|
|
||||
| `turboquant_roundtrip` | tests/roundtrip_test.cpp | Encode→decode roundtrip fidelity |
|
||||
| Perplexity benchmarks | benchmarks/run_perplexity.py | Quality preservation across prompts |
|
||||
| Speed benchmarks | benchmarks/run_benchmarks.py | Compression overhead measurement |
|
||||
Present coverage on the target repo includes:
|
||||
- `tests/roundtrip_test.cpp` — core encode/decode roundtrip executable validation
|
||||
- `tests/test_quant_selector.py` — Python quant preset / selection behavior
|
||||
- `tests/test_tool_call_integration.py` — integration-oriented Python coverage
|
||||
- benchmark scripts for performance and perplexity measurement
|
||||
|
||||
## Test Coverage Gaps
|
||||
|
||||
1. `.gitea/workflows/smoke.yml` on current `main` still omits the standalone build/test path.
|
||||
- The workflow checks parse/secret hygiene only.
|
||||
- It does not execute `cmake --build build` or `ctest --test-dir build --output-on-failure`.
|
||||
- This gap is already being addressed separately in turboquant issue/PR work around `#50` / PR `#137`.
|
||||
|
||||
2. `tests/test_quant_selector.py` currently has a real failing assertion on current `main`.
|
||||
- The failure is tracked in `turboquant #139`.
|
||||
- The mismatch is around ordering expectations between `turbo2` and `q4_0` presets.
|
||||
|
||||
3. `evolution/hardware_optimizer.py` remains a very thin stub relative to the repo's ambition.
|
||||
- The genome should treat it as a placeholder, not a mature optimizer layer.
|
||||
|
||||
4. There is still no broad end-to-end CI lane that combines:
|
||||
- native build
|
||||
- Python tests
|
||||
- benchmark smoke
|
||||
- profile validation
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **No network calls** — Pure local computation, no telemetry
|
||||
2. **Memory safety** — C++ code uses raw pointers; roundtrip tests validate correctness
|
||||
3. **Build isolation** — CMake builds static library; no dynamic linking
|
||||
- Strong privacy posture: the repo is fundamentally local-first and computational, not network-oriented.
|
||||
- No obvious telemetry surface in the core implementation.
|
||||
- Native code and Metal shaders increase the blast radius of correctness bugs; correctness verification matters more than in pure Python tooling.
|
||||
- Build artifacts are not sandboxed by default; consumers inherit the trust boundary of local execution.
|
||||
- Deployment profiles can route this code into larger local inference systems, so stale configuration guidance is an operational risk even when the core compression code is sound.
|
||||
|
||||
## Sovereignty Assessment
|
||||
## Current Drift and Operational Findings
|
||||
|
||||
- **Fully local** — No cloud dependencies, no API calls
|
||||
- **Open source** — All code on Gitea, upstream repos public
|
||||
- **No telemetry** — Pure computation
|
||||
- **Hardware-specific** — Metal shaders target Apple Silicon; CUDA upstream for other GPUs
|
||||
- The refreshed genome here is more current than the older generated snapshot because it explicitly incorporates the present repo surface (`tests/test_quant_selector.py`, `tests/test_tool_call_integration.py`, `evolution/quant_selector.py`, and `profiles/hermes-profile-gemma4-turboquant.yaml`).
|
||||
- Current `main` still has a CI/runtime visibility gap: `.gitea/workflows/smoke.yml` does not yet run the native build/test path even though local runtime verification passes.
|
||||
- `turboquant #139` captures the currently failing `tests/test_quant_selector.py` ordering assertion and should remain referenced as live drift until fixed upstream.
|
||||
|
||||
**Verdict: Fully sovereign. No corporate lock-in. Pure local inference enhancement.**
|
||||
## Operational Risk Summary
|
||||
|
||||
---
|
||||
- No network requirement for core compression logic.
|
||||
- Benchmarks and profiles are local-file driven.
|
||||
- The main security/quality risks are correctness drift, stale CI, and native-code trust boundaries rather than data exfiltration.
|
||||
|
||||
*"A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context."*
|
||||
## Conclusion
|
||||
|
||||
TurboQuant is a compact but real codebase: a native compression core, a Metal runtime surface, benchmark harnesses, quant-selection logic, and Hermes deployment configuration. The core standalone path is healthy under manual verification, but the repo still depends on human discipline because CI does not yet enforce the same native build/test contract and the quant-selector lane remains partially red on `main`.
|
||||
|
||||
That makes the repo strong on local sovereignty and promising on runtime execution, but still operationally immature in continuous verification. The refreshed genome should be read as both an architecture map and a warning: the compression core is ahead of its automated guardrails.
|
||||
|
||||
44
tests/test_turboquant_genome.py
Normal file
44
tests/test_turboquant_genome.py
Normal file
@@ -0,0 +1,44 @@
|
||||
from pathlib import Path
|
||||
import unittest
|
||||
|
||||
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
GENOME_PATH = ROOT / "genomes" / "turboquant" / "GENOME.md"
|
||||
|
||||
|
||||
class TestTurboquantGenome(unittest.TestCase):
|
||||
def test_genome_exists_with_required_sections(self):
|
||||
self.assertTrue(GENOME_PATH.exists(), "missing genomes/turboquant/GENOME.md")
|
||||
text = GENOME_PATH.read_text(encoding="utf-8")
|
||||
required_sections = [
|
||||
"# GENOME.md — TurboQuant (Timmy_Foundation/turboquant)",
|
||||
"## Project Overview",
|
||||
"## Architecture",
|
||||
"```mermaid",
|
||||
"## Entry Points",
|
||||
"## Data Flow",
|
||||
"## Key Abstractions",
|
||||
"## Test Coverage",
|
||||
"## Security Considerations",
|
||||
]
|
||||
for section in required_sections:
|
||||
self.assertIn(section, text)
|
||||
|
||||
def test_genome_mentions_current_runtime_and_ci_findings(self):
|
||||
text = GENOME_PATH.read_text(encoding="utf-8")
|
||||
required_snippets = [
|
||||
".gitea/workflows/smoke.yml",
|
||||
"ctest --test-dir build --output-on-failure",
|
||||
"tests/test_quant_selector.py",
|
||||
"tests/test_tool_call_integration.py",
|
||||
"evolution/quant_selector.py",
|
||||
"profiles/hermes-profile-gemma4-turboquant.yaml",
|
||||
"pygount",
|
||||
"turboquant #139",
|
||||
]
|
||||
for snippet in required_snippets:
|
||||
self.assertIn(snippet, text)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
Reference in New Issue
Block a user