All checks were successful
Smoke Test / smoke (pull_request) Successful in 24s
This commit addresses issue #17 by providing a comprehensive review of the TurboQuant initiative and implementing key improvements. ## Changes ### 1. Initiative Review (docs/INITIATIVE_REVIEW.md) - Comprehensive assessment of current state - Code quality findings and recommendations - Contributor feedback for @manus, @Timmy, @Rockachopa - Implementation plan with clear milestones ### 2. Code Improvements #### llama-turbo.cpp - Added input validation with assertions - Optimized Lloyd-Max search with binary search (O(log n) vs O(n)) - Added stack allocation for d=128 (avoids heap allocation in hot path) - Added error handling for edge cases - Added decision boundaries for efficient quantization #### ggml-metal-turbo.metal - Added bounds checking to all kernels - Added NaN/Inf handling for numerical stability - Completed fused attention kernel (was stub) - Added fused attention with softmax kernel - Added Metal encoding kernel for completeness - Added binary search for quantization ### 3. Testing (tests/test_turbo.cpp) - Unit tests for encode/decode round-trip - Tests for known values (zeros, ones) - Tests for edge cases (large/small values) - Error handling tests ### 4. Build System (CMakeLists.txt) - Added CMake configuration for building library - Added test executable - Added install targets ### 5. Documentation (README.md) - Added build instructions - Added API documentation - Added contributing guidelines - Added code style guide ## Key Improvements 1. **Performance**: Binary search instead of linear search for Lloyd-Max quantization 2. **Memory**: Stack allocation for common case (d=128) 3. **Reliability**: Input validation and error handling 4. **Metal Integration**: Complete fused attention implementation 5. **Testing**: Unit tests for correctness verification 6. **Documentation**: Contributor guidelines and API docs ## Next Steps 1. Run benchmarks to verify performance improvements 2. Test with actual models (qwen3.5:27b) 3. Integrate with llama.cpp fork 4. Deploy to production Closes #17
122 lines
3.9 KiB
Markdown
122 lines
3.9 KiB
Markdown
# TurboQuant
|
|
|
|
KV cache compression for local inference on M4 Max MacBook Pro.
|
|
|
|
## What
|
|
TurboQuant (Google, ICLR 2026) is a three-stage KV cache compression method:
|
|
1. **PolarQuant** — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
|
|
2. **QJL** — 1-bit quantized Johnson-Lindenstrauss residual correction
|
|
3. **TurboQuant** — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss
|
|
|
|
## Why
|
|
Unlock 64K-128K context on qwen3.5:27b within 32GB unified memory.
|
|
A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context.
|
|
|
|
## Status
|
|
See [issues](http://143.198.27.163:3000/Timmy_Foundation/turboquant/issues) for current progress.
|
|
|
|
## Building
|
|
|
|
### Prerequisites
|
|
- CMake 3.10+
|
|
- C++11 compiler
|
|
- Xcode Command Line Tools (for Metal on macOS)
|
|
|
|
### Build Instructions
|
|
```bash
|
|
# Clone the repository
|
|
git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/turboquant.git
|
|
cd turboquant
|
|
|
|
# Build with CMake
|
|
cmake -B build -DCMAKE_BUILD_TYPE=Release
|
|
cmake --build build
|
|
|
|
# Run tests
|
|
cd build && ctest
|
|
```
|
|
|
|
### Integration with llama.cpp
|
|
See [PR-IMPLEMENTATION-PLAN.md](PR-IMPLEMENTATION-PLAN.md) for integration steps.
|
|
|
|
## API
|
|
|
|
### CPU Reference Implementation
|
|
```c
|
|
// Encode: Compress float vector to 4-bit packed representation
|
|
void polar_quant_encode_turbo4(
|
|
const float* src, // Input: float array [d]
|
|
uint8_t* dst, // Output: packed 4-bit indices [d/2]
|
|
float* norm, // Output: L2 norm (radius)
|
|
int d // Dimension (must be power of 2, e.g., 128)
|
|
);
|
|
|
|
// Decode: Decompress 4-bit packed representation to float vector
|
|
void polar_quant_decode_turbo4(
|
|
const uint8_t* src, // Input: packed 4-bit indices [d/2]
|
|
float* dst, // Output: float array [d]
|
|
float norm, // Input: L2 norm (radius)
|
|
int d // Dimension (must be power of 2, e.g., 128)
|
|
);
|
|
```
|
|
|
|
### Metal Shaders
|
|
See `ggml-metal-turbo.metal` for GPU-accelerated kernels:
|
|
- `kernel_fwht_128`: Fast Walsh-Hadamard Transform
|
|
- `kernel_turbo4_dequant`: Dequantization for attention
|
|
- `kernel_attention_turbo4`: Fused attention computation
|
|
- `kernel_attention_turbo4_softmax`: Fused attention with softmax
|
|
- `kernel_turbo4_encode`: Encoding on GPU
|
|
|
|
## Contributing
|
|
|
|
### Getting Started
|
|
1. Fork the repository
|
|
2. Create a feature branch: `git checkout -b feature/your-feature`
|
|
3. Make your changes
|
|
4. Add tests for new functionality
|
|
5. Run the test suite: `cd build && ctest`
|
|
6. Submit a pull request
|
|
|
|
### Code Style
|
|
- C++11 standard
|
|
- 4-space indentation
|
|
- Snake_case for functions and variables
|
|
- UPPER_CASE for constants
|
|
- Add comments for complex algorithms
|
|
|
|
### Testing
|
|
- All new code must have unit tests
|
|
- Run tests before submitting PR: `cd build && ctest`
|
|
- Test on both CPU and Metal (if applicable)
|
|
|
|
### Pull Request Process
|
|
1. Update documentation if needed
|
|
2. Add tests for new functionality
|
|
3. Ensure all tests pass
|
|
4. Request review from maintainers
|
|
|
|
### Issues
|
|
- Use issue templates when available
|
|
- Tag issues appropriately (`bug`, `enhancement`, `documentation`)
|
|
- Include reproduction steps for bugs
|
|
- For performance issues, include benchmark results
|
|
|
|
## Roles
|
|
- **Strago:** Build spec author
|
|
- **Cid:** Implementation, benchmarks, deployment
|
|
- **Locke:** Research support, upstream watch
|
|
- **John:** Quality review
|
|
- **Frankie:** Coordination
|
|
|
|
## Source Repos
|
|
- [TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant) — llama.cpp fork with Metal
|
|
- [TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus) — Reference impl, 511+ tests
|
|
- [amirzandieh/QJL](https://github.com/amirzandieh/QJL) — Author QJL code (CUDA)
|
|
- [rachittshah/mlx-turboquant](https://github.com/rachittshah/mlx-turboquant) — MLX fallback
|
|
|
|
## Docs
|
|
- [BUILD-SPEC.md](BUILD-SPEC.md) — Full build specification (Strago, v2.2)
|
|
- [docs/PROJECT_STATUS.md](docs/PROJECT_STATUS.md) — Current project status
|
|
- [docs/INITIATIVE_REVIEW.md](docs/INITIATIVE_REVIEW.md) — Initiative review and feedback
|