2026-03-30 13:11:45 -04:00
|
|
|
# TurboQuant
|
2026-03-30 17:08:45 +00:00
|
|
|
|
2026-03-30 13:11:45 -04:00
|
|
|
KV cache compression for local inference on M4 Max MacBook Pro.
|
|
|
|
|
|
|
|
|
|
## What
|
|
|
|
|
TurboQuant (Google, ICLR 2026) is a three-stage KV cache compression method:
|
|
|
|
|
1. **PolarQuant** — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
|
|
|
|
|
2. **QJL** — 1-bit quantized Johnson-Lindenstrauss residual correction
|
|
|
|
|
3. **TurboQuant** — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss
|
|
|
|
|
|
|
|
|
|
## Why
|
|
|
|
|
Unlock 64K-128K context on qwen3.5:27b within 32GB unified memory.
|
|
|
|
|
A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context.
|
|
|
|
|
|
|
|
|
|
## Status
|
|
|
|
|
See [issues](http://143.198.27.163:3000/Timmy_Foundation/turboquant/issues) for current progress.
|
|
|
|
|
|
2026-04-14 22:07:21 -04:00
|
|
|
## Building
|
|
|
|
|
|
|
|
|
|
### Prerequisites
|
|
|
|
|
- CMake 3.10+
|
|
|
|
|
- C++11 compiler
|
|
|
|
|
- Xcode Command Line Tools (for Metal on macOS)
|
|
|
|
|
|
|
|
|
|
### Build Instructions
|
|
|
|
|
```bash
|
|
|
|
|
# Clone the repository
|
|
|
|
|
git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/turboquant.git
|
|
|
|
|
cd turboquant
|
|
|
|
|
|
|
|
|
|
# Build with CMake
|
|
|
|
|
cmake -B build -DCMAKE_BUILD_TYPE=Release
|
|
|
|
|
cmake --build build
|
|
|
|
|
|
|
|
|
|
# Run tests
|
|
|
|
|
cd build && ctest
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Integration with llama.cpp
|
|
|
|
|
See [PR-IMPLEMENTATION-PLAN.md](PR-IMPLEMENTATION-PLAN.md) for integration steps.
|
|
|
|
|
|
|
|
|
|
## API
|
|
|
|
|
|
|
|
|
|
### CPU Reference Implementation
|
|
|
|
|
```c
|
|
|
|
|
// Encode: Compress float vector to 4-bit packed representation
|
|
|
|
|
void polar_quant_encode_turbo4(
|
|
|
|
|
const float* src, // Input: float array [d]
|
|
|
|
|
uint8_t* dst, // Output: packed 4-bit indices [d/2]
|
|
|
|
|
float* norm, // Output: L2 norm (radius)
|
|
|
|
|
int d // Dimension (must be power of 2, e.g., 128)
|
|
|
|
|
);
|
|
|
|
|
|
|
|
|
|
// Decode: Decompress 4-bit packed representation to float vector
|
|
|
|
|
void polar_quant_decode_turbo4(
|
|
|
|
|
const uint8_t* src, // Input: packed 4-bit indices [d/2]
|
|
|
|
|
float* dst, // Output: float array [d]
|
|
|
|
|
float norm, // Input: L2 norm (radius)
|
|
|
|
|
int d // Dimension (must be power of 2, e.g., 128)
|
|
|
|
|
);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Metal Shaders
|
|
|
|
|
See `ggml-metal-turbo.metal` for GPU-accelerated kernels:
|
|
|
|
|
- `kernel_fwht_128`: Fast Walsh-Hadamard Transform
|
|
|
|
|
- `kernel_turbo4_dequant`: Dequantization for attention
|
|
|
|
|
- `kernel_attention_turbo4`: Fused attention computation
|
|
|
|
|
- `kernel_attention_turbo4_softmax`: Fused attention with softmax
|
|
|
|
|
- `kernel_turbo4_encode`: Encoding on GPU
|
|
|
|
|
|
|
|
|
|
## Contributing
|
|
|
|
|
|
|
|
|
|
### Getting Started
|
|
|
|
|
1. Fork the repository
|
|
|
|
|
2. Create a feature branch: `git checkout -b feature/your-feature`
|
|
|
|
|
3. Make your changes
|
|
|
|
|
4. Add tests for new functionality
|
|
|
|
|
5. Run the test suite: `cd build && ctest`
|
|
|
|
|
6. Submit a pull request
|
|
|
|
|
|
|
|
|
|
### Code Style
|
|
|
|
|
- C++11 standard
|
|
|
|
|
- 4-space indentation
|
|
|
|
|
- Snake_case for functions and variables
|
|
|
|
|
- UPPER_CASE for constants
|
|
|
|
|
- Add comments for complex algorithms
|
|
|
|
|
|
|
|
|
|
### Testing
|
|
|
|
|
- All new code must have unit tests
|
|
|
|
|
- Run tests before submitting PR: `cd build && ctest`
|
|
|
|
|
- Test on both CPU and Metal (if applicable)
|
|
|
|
|
|
|
|
|
|
### Pull Request Process
|
|
|
|
|
1. Update documentation if needed
|
|
|
|
|
2. Add tests for new functionality
|
|
|
|
|
3. Ensure all tests pass
|
|
|
|
|
4. Request review from maintainers
|
|
|
|
|
|
|
|
|
|
### Issues
|
|
|
|
|
- Use issue templates when available
|
|
|
|
|
- Tag issues appropriately (`bug`, `enhancement`, `documentation`)
|
|
|
|
|
- Include reproduction steps for bugs
|
|
|
|
|
- For performance issues, include benchmark results
|
|
|
|
|
|
2026-03-30 13:11:45 -04:00
|
|
|
## Roles
|
|
|
|
|
- **Strago:** Build spec author
|
|
|
|
|
- **Cid:** Implementation, benchmarks, deployment
|
|
|
|
|
- **Locke:** Research support, upstream watch
|
|
|
|
|
- **John:** Quality review
|
|
|
|
|
- **Frankie:** Coordination
|
|
|
|
|
|
|
|
|
|
## Source Repos
|
|
|
|
|
- [TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant) — llama.cpp fork with Metal
|
|
|
|
|
- [TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus) — Reference impl, 511+ tests
|
|
|
|
|
- [amirzandieh/QJL](https://github.com/amirzandieh/QJL) — Author QJL code (CUDA)
|
|
|
|
|
- [rachittshah/mlx-turboquant](https://github.com/rachittshah/mlx-turboquant) — MLX fallback
|
|
|
|
|
|
|
|
|
|
## Docs
|
|
|
|
|
- [BUILD-SPEC.md](BUILD-SPEC.md) — Full build specification (Strago, v2.2)
|
2026-04-14 22:07:21 -04:00
|
|
|
- [docs/PROJECT_STATUS.md](docs/PROJECT_STATUS.md) — Current project status
|
|
|
|
|
- [docs/INITIATIVE_REVIEW.md](docs/INITIATIVE_REVIEW.md) — Initiative review and feedback
|