# TurboQuant KV cache compression for local inference on M4 Max MacBook Pro. ## What TurboQuant (Google, ICLR 2026) is a three-stage KV cache compression method: 1. **PolarQuant** — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression) 2. **QJL** — 1-bit quantized Johnson-Lindenstrauss residual correction 3. **TurboQuant** — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss ## Why Unlock 64K-128K context on qwen3.5:27b within 32GB unified memory. A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context. ## Status See [issues](http://143.198.27.163:3000/Timmy_Foundation/turboquant/issues) for current progress. ## Building ### Prerequisites - CMake 3.10+ - C++11 compiler - Xcode Command Line Tools (for Metal on macOS) ### Build Instructions ```bash # Clone the repository git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/turboquant.git cd turboquant # Build with CMake cmake -B build -DCMAKE_BUILD_TYPE=Release cmake --build build # Run tests cd build && ctest ``` ### Integration with llama.cpp See [PR-IMPLEMENTATION-PLAN.md](PR-IMPLEMENTATION-PLAN.md) for integration steps. ## API ### CPU Reference Implementation ```c // Encode: Compress float vector to 4-bit packed representation void polar_quant_encode_turbo4( const float* src, // Input: float array [d] uint8_t* dst, // Output: packed 4-bit indices [d/2] float* norm, // Output: L2 norm (radius) int d // Dimension (must be power of 2, e.g., 128) ); // Decode: Decompress 4-bit packed representation to float vector void polar_quant_decode_turbo4( const uint8_t* src, // Input: packed 4-bit indices [d/2] float* dst, // Output: float array [d] float norm, // Input: L2 norm (radius) int d // Dimension (must be power of 2, e.g., 128) ); ``` ### Metal Shaders See `ggml-metal-turbo.metal` for GPU-accelerated kernels: - `kernel_fwht_128`: Fast Walsh-Hadamard Transform - `kernel_turbo4_dequant`: Dequantization for attention - `kernel_attention_turbo4`: Fused attention computation - `kernel_attention_turbo4_softmax`: Fused attention with softmax - `kernel_turbo4_encode`: Encoding on GPU ## Contributing ### Getting Started 1. Fork the repository 2. Create a feature branch: `git checkout -b feature/your-feature` 3. Make your changes 4. Add tests for new functionality 5. Run the test suite: `cd build && ctest` 6. Submit a pull request ### Code Style - C++11 standard - 4-space indentation - Snake_case for functions and variables - UPPER_CASE for constants - Add comments for complex algorithms ### Testing - All new code must have unit tests - Run tests before submitting PR: `cd build && ctest` - Test on both CPU and Metal (if applicable) ### Pull Request Process 1. Update documentation if needed 2. Add tests for new functionality 3. Ensure all tests pass 4. Request review from maintainers ### Issues - Use issue templates when available - Tag issues appropriately (`bug`, `enhancement`, `documentation`) - Include reproduction steps for bugs - For performance issues, include benchmark results ## Roles - **Strago:** Build spec author - **Cid:** Implementation, benchmarks, deployment - **Locke:** Research support, upstream watch - **John:** Quality review - **Frankie:** Coordination ## Source Repos - [TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant) — llama.cpp fork with Metal - [TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus) — Reference impl, 511+ tests - [amirzandieh/QJL](https://github.com/amirzandieh/QJL) — Author QJL code (CUDA) - [rachittshah/mlx-turboquant](https://github.com/rachittshah/mlx-turboquant) — MLX fallback ## Docs - [BUILD-SPEC.md](BUILD-SPEC.md) — Full build specification (Strago, v2.2) - [docs/PROJECT_STATUS.md](docs/PROJECT_STATUS.md) — Current project status - [docs/INITIATIVE_REVIEW.md](docs/INITIATIVE_REVIEW.md) — Initiative review and feedback