All checks were successful
Smoke Test / smoke (pull_request) Successful in 24s
This commit addresses issue #17 by providing a comprehensive review of the TurboQuant initiative and implementing key improvements. ## Changes ### 1. Initiative Review (docs/INITIATIVE_REVIEW.md) - Comprehensive assessment of current state - Code quality findings and recommendations - Contributor feedback for @manus, @Timmy, @Rockachopa - Implementation plan with clear milestones ### 2. Code Improvements #### llama-turbo.cpp - Added input validation with assertions - Optimized Lloyd-Max search with binary search (O(log n) vs O(n)) - Added stack allocation for d=128 (avoids heap allocation in hot path) - Added error handling for edge cases - Added decision boundaries for efficient quantization #### ggml-metal-turbo.metal - Added bounds checking to all kernels - Added NaN/Inf handling for numerical stability - Completed fused attention kernel (was stub) - Added fused attention with softmax kernel - Added Metal encoding kernel for completeness - Added binary search for quantization ### 3. Testing (tests/test_turbo.cpp) - Unit tests for encode/decode round-trip - Tests for known values (zeros, ones) - Tests for edge cases (large/small values) - Error handling tests ### 4. Build System (CMakeLists.txt) - Added CMake configuration for building library - Added test executable - Added install targets ### 5. Documentation (README.md) - Added build instructions - Added API documentation - Added contributing guidelines - Added code style guide ## Key Improvements 1. **Performance**: Binary search instead of linear search for Lloyd-Max quantization 2. **Memory**: Stack allocation for common case (d=128) 3. **Reliability**: Input validation and error handling 4. **Metal Integration**: Complete fused attention implementation 5. **Testing**: Unit tests for correctness verification 6. **Documentation**: Contributor guidelines and API docs ## Next Steps 1. Run benchmarks to verify performance improvements 2. Test with actual models (qwen3.5:27b) 3. Integrate with llama.cpp fork 4. Deploy to production Closes #17
3.9 KiB
3.9 KiB
TurboQuant
KV cache compression for local inference on M4 Max MacBook Pro.
What
TurboQuant (Google, ICLR 2026) is a three-stage KV cache compression method:
- PolarQuant — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
- QJL — 1-bit quantized Johnson-Lindenstrauss residual correction
- TurboQuant — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss
Why
Unlock 64K-128K context on qwen3.5:27b within 32GB unified memory. A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context.
Status
See issues for current progress.
Building
Prerequisites
- CMake 3.10+
- C++11 compiler
- Xcode Command Line Tools (for Metal on macOS)
Build Instructions
# Clone the repository
git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/turboquant.git
cd turboquant
# Build with CMake
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
# Run tests
cd build && ctest
Integration with llama.cpp
See PR-IMPLEMENTATION-PLAN.md for integration steps.
API
CPU Reference Implementation
// Encode: Compress float vector to 4-bit packed representation
void polar_quant_encode_turbo4(
const float* src, // Input: float array [d]
uint8_t* dst, // Output: packed 4-bit indices [d/2]
float* norm, // Output: L2 norm (radius)
int d // Dimension (must be power of 2, e.g., 128)
);
// Decode: Decompress 4-bit packed representation to float vector
void polar_quant_decode_turbo4(
const uint8_t* src, // Input: packed 4-bit indices [d/2]
float* dst, // Output: float array [d]
float norm, // Input: L2 norm (radius)
int d // Dimension (must be power of 2, e.g., 128)
);
Metal Shaders
See ggml-metal-turbo.metal for GPU-accelerated kernels:
kernel_fwht_128: Fast Walsh-Hadamard Transformkernel_turbo4_dequant: Dequantization for attentionkernel_attention_turbo4: Fused attention computationkernel_attention_turbo4_softmax: Fused attention with softmaxkernel_turbo4_encode: Encoding on GPU
Contributing
Getting Started
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Make your changes
- Add tests for new functionality
- Run the test suite:
cd build && ctest - Submit a pull request
Code Style
- C++11 standard
- 4-space indentation
- Snake_case for functions and variables
- UPPER_CASE for constants
- Add comments for complex algorithms
Testing
- All new code must have unit tests
- Run tests before submitting PR:
cd build && ctest - Test on both CPU and Metal (if applicable)
Pull Request Process
- Update documentation if needed
- Add tests for new functionality
- Ensure all tests pass
- Request review from maintainers
Issues
- Use issue templates when available
- Tag issues appropriately (
bug,enhancement,documentation) - Include reproduction steps for bugs
- For performance issues, include benchmark results
Roles
- Strago: Build spec author
- Cid: Implementation, benchmarks, deployment
- Locke: Research support, upstream watch
- John: Quality review
- Frankie: Coordination
Source Repos
- TheTom/llama-cpp-turboquant — llama.cpp fork with Metal
- TheTom/turboquant_plus — Reference impl, 511+ tests
- amirzandieh/QJL — Author QJL code (CUDA)
- rachittshah/mlx-turboquant — MLX fallback
Docs
- BUILD-SPEC.md — Full build specification (Strago, v2.2)
- docs/PROJECT_STATUS.md — Current project status
- docs/INITIATIVE_REVIEW.md — Initiative review and feedback