README.md

# TurboQuant

KV cache compression for local inference on M4 Max MacBook Pro.

## What
TurboQuant (Google, ICLR 2026) is a three-stage KV cache compression method:
1. **PolarQuant** — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
2. **QJL** — 1-bit quantized Johnson-Lindenstrauss residual correction
3. **TurboQuant** — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss

## Why
Unlock 64K-128K context on qwen3.5:27b within 32GB unified memory.
A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context.

## Status
See [issues](http://143.198.27.163:3000/Timmy_Foundation/turboquant/issues) for current progress.

## Building

### Prerequisites
- CMake 3.10+
- C++11 compiler
- Xcode Command Line Tools (for Metal on macOS)

### Build Instructions
```bash
# Clone the repository
git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/turboquant.git
cd turboquant

# Build with CMake
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

# Run tests
cd build && ctest
```

### Integration with llama.cpp
See [PR-IMPLEMENTATION-PLAN.md](PR-IMPLEMENTATION-PLAN.md) for integration steps.

## API

### CPU Reference Implementation
```c
// Encode: Compress float vector to 4-bit packed representation
void polar_quant_encode_turbo4(
    const float* src,    // Input: float array [d]
    uint8_t* dst,        // Output: packed 4-bit indices [d/2]
    float* norm,         // Output: L2 norm (radius)
    int d                // Dimension (must be power of 2, e.g., 128)
);

// Decode: Decompress 4-bit packed representation to float vector
void polar_quant_decode_turbo4(
    const uint8_t* src,  // Input: packed 4-bit indices [d/2]
    float* dst,          // Output: float array [d]
    float norm,          // Input: L2 norm (radius)
    int d                // Dimension (must be power of 2, e.g., 128)
);
```

### Metal Shaders
See `ggml-metal-turbo.metal` for GPU-accelerated kernels:
- `kernel_fwht_128`: Fast Walsh-Hadamard Transform
- `kernel_turbo4_dequant`: Dequantization for attention
- `kernel_attention_turbo4`: Fused attention computation
- `kernel_attention_turbo4_softmax`: Fused attention with softmax
- `kernel_turbo4_encode`: Encoding on GPU

## Contributing

### Getting Started
1. Fork the repository
2. Create a feature branch: `git checkout -b feature/your-feature`
3. Make your changes
4. Add tests for new functionality
5. Run the test suite: `cd build && ctest`
6. Submit a pull request

### Code Style
- C++11 standard
- 4-space indentation
- Snake_case for functions and variables
- UPPER_CASE for constants
- Add comments for complex algorithms

### Testing
- All new code must have unit tests
- Run tests before submitting PR: `cd build && ctest`
- Test on both CPU and Metal (if applicable)

### Pull Request Process
1. Update documentation if needed
2. Add tests for new functionality
3. Ensure all tests pass
4. Request review from maintainers

### Issues
- Use issue templates when available
- Tag issues appropriately (`bug`, `enhancement`, `documentation`)
- Include reproduction steps for bugs
- For performance issues, include benchmark results

## Roles
- **Strago:** Build spec author
- **Cid:** Implementation, benchmarks, deployment
- **Locke:** Research support, upstream watch
- **John:** Quality review
- **Frankie:** Coordination

## Source Repos
- [TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant) — llama.cpp fork with Metal
- [TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus) — Reference impl, 511+ tests
- [amirzandieh/QJL](https://github.com/amirzandieh/QJL) — Author QJL code (CUDA)
- [rachittshah/mlx-turboquant](https://github.com/rachittshah/mlx-turboquant) — MLX fallback

## Docs
- [BUILD-SPEC.md](BUILD-SPEC.md) — Full build specification (Strago, v2.2)
- [docs/PROJECT_STATUS.md](docs/PROJECT_STATUS.md) — Current project status
- [docs/INITIATIVE_REVIEW.md](docs/INITIATIVE_REVIEW.md) — Initiative review and feedback
Add build spec v2.2 and README TurboQuant KV cache compression for M4 Max local inference. Spec by Strago, triaged into 16 issues across 4 phases. Ref #1 2026-03-30 13:11:45 -04:00			`# TurboQuant`
Initial commit 2026-03-30 17:08:45 +00:00
Add build spec v2.2 and README TurboQuant KV cache compression for M4 Max local inference. Spec by Strago, triaged into 16 issues across 4 phases. Ref #1 2026-03-30 13:11:45 -04:00			`KV cache compression for local inference on M4 Max MacBook Pro.`

			`## What`
			`TurboQuant (Google, ICLR 2026) is a three-stage KV cache compression method:`
			`1. PolarQuant — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)`
			`2. QJL — 1-bit quantized Johnson-Lindenstrauss residual correction`
			`3. TurboQuant — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss`

			`## Why`
			`Unlock 64K-128K context on qwen3.5:27b within 32GB unified memory.`
			`A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context.`

			`## Status`
			`See [issues](http://143.198.27.163:3000/Timmy_Foundation/turboquant/issues) for current progress.`

feat: Comprehensive review and improvements for TurboQuant (#17) This commit addresses issue #17 by providing a comprehensive review of the TurboQuant initiative and implementing key improvements. ## Changes ### 1. Initiative Review (docs/INITIATIVE_REVIEW.md) - Comprehensive assessment of current state - Code quality findings and recommendations - Contributor feedback for @manus, @Timmy, @Rockachopa - Implementation plan with clear milestones ### 2. Code Improvements #### llama-turbo.cpp - Added input validation with assertions - Optimized Lloyd-Max search with binary search (O(log n) vs O(n)) - Added stack allocation for d=128 (avoids heap allocation in hot path) - Added error handling for edge cases - Added decision boundaries for efficient quantization #### ggml-metal-turbo.metal - Added bounds checking to all kernels - Added NaN/Inf handling for numerical stability - Completed fused attention kernel (was stub) - Added fused attention with softmax kernel - Added Metal encoding kernel for completeness - Added binary search for quantization ### 3. Testing (tests/test_turbo.cpp) - Unit tests for encode/decode round-trip - Tests for known values (zeros, ones) - Tests for edge cases (large/small values) - Error handling tests ### 4. Build System (CMakeLists.txt) - Added CMake configuration for building library - Added test executable - Added install targets ### 5. Documentation (README.md) - Added build instructions - Added API documentation - Added contributing guidelines - Added code style guide ## Key Improvements 1. Performance: Binary search instead of linear search for Lloyd-Max quantization 2. Memory: Stack allocation for common case (d=128) 3. Reliability: Input validation and error handling 4. Metal Integration: Complete fused attention implementation 5. Testing: Unit tests for correctness verification 6. Documentation: Contributor guidelines and API docs ## Next Steps 1. Run benchmarks to verify performance improvements 2. Test with actual models (qwen3.5:27b) 3. Integrate with llama.cpp fork 4. Deploy to production Closes #17 2026-04-14 22:07:21 -04:00			`## Building`

			`### Prerequisites`
			`- CMake 3.10+`
			`- C++11 compiler`
			`- Xcode Command Line Tools (for Metal on macOS)`

			`### Build Instructions`
			```bash
			`# Clone the repository`
			`git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/turboquant.git`
			`cd turboquant`

			`# Build with CMake`
			`cmake -B build -DCMAKE_BUILD_TYPE=Release`
			`cmake --build build`

			`# Run tests`
			`cd build && ctest`
			```

			`### Integration with llama.cpp`
			`See [PR-IMPLEMENTATION-PLAN.md](PR-IMPLEMENTATION-PLAN.md) for integration steps.`

			`## API`

			`### CPU Reference Implementation`
			```c
			`// Encode: Compress float vector to 4-bit packed representation`
			`void polar_quant_encode_turbo4(`
			`const float* src, // Input: float array [d]`
			`uint8_t* dst, // Output: packed 4-bit indices [d/2]`
			`float* norm, // Output: L2 norm (radius)`
			`int d // Dimension (must be power of 2, e.g., 128)`
			`);`

			`// Decode: Decompress 4-bit packed representation to float vector`
			`void polar_quant_decode_turbo4(`
			`const uint8_t* src, // Input: packed 4-bit indices [d/2]`
			`float* dst, // Output: float array [d]`
			`float norm, // Input: L2 norm (radius)`
			`int d // Dimension (must be power of 2, e.g., 128)`
			`);`
			```

			`### Metal Shaders`
			See `ggml-metal-turbo.metal` for GPU-accelerated kernels:
			- `kernel_fwht_128`: Fast Walsh-Hadamard Transform
			- `kernel_turbo4_dequant`: Dequantization for attention
			- `kernel_attention_turbo4`: Fused attention computation
			- `kernel_attention_turbo4_softmax`: Fused attention with softmax
			- `kernel_turbo4_encode`: Encoding on GPU

			`## Contributing`

			`### Getting Started`
			`1. Fork the repository`
			2. Create a feature branch: `git checkout -b feature/your-feature`
			`3. Make your changes`
			`4. Add tests for new functionality`
			5. Run the test suite: `cd build && ctest`
			`6. Submit a pull request`

			`### Code Style`
			`- C++11 standard`
			`- 4-space indentation`
			`- Snake_case for functions and variables`
			`- UPPER_CASE for constants`
			`- Add comments for complex algorithms`

			`### Testing`
			`- All new code must have unit tests`
			- Run tests before submitting PR: `cd build && ctest`
			`- Test on both CPU and Metal (if applicable)`

			`### Pull Request Process`
			`1. Update documentation if needed`
			`2. Add tests for new functionality`
			`3. Ensure all tests pass`
			`4. Request review from maintainers`

			`### Issues`
			`- Use issue templates when available`
			- Tag issues appropriately (`bug`, `enhancement`, `documentation`)
			`- Include reproduction steps for bugs`
			`- For performance issues, include benchmark results`

Add build spec v2.2 and README TurboQuant KV cache compression for M4 Max local inference. Spec by Strago, triaged into 16 issues across 4 phases. Ref #1 2026-03-30 13:11:45 -04:00			`## Roles`
			`- Strago: Build spec author`
			`- Cid: Implementation, benchmarks, deployment`
			`- Locke: Research support, upstream watch`
			`- John: Quality review`
			`- Frankie: Coordination`

			`## Source Repos`
			`- [TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant) — llama.cpp fork with Metal`
			`- [TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus) — Reference impl, 511+ tests`
			`- [amirzandieh/QJL](https://github.com/amirzandieh/QJL) — Author QJL code (CUDA)`
			`- [rachittshah/mlx-turboquant](https://github.com/rachittshah/mlx-turboquant) — MLX fallback`

			`## Docs`
			`- [BUILD-SPEC.md](BUILD-SPEC.md) — Full build specification (Strago, v2.2)`
feat: Comprehensive review and improvements for TurboQuant (#17) This commit addresses issue #17 by providing a comprehensive review of the TurboQuant initiative and implementing key improvements. ## Changes ### 1. Initiative Review (docs/INITIATIVE_REVIEW.md) - Comprehensive assessment of current state - Code quality findings and recommendations - Contributor feedback for @manus, @Timmy, @Rockachopa - Implementation plan with clear milestones ### 2. Code Improvements #### llama-turbo.cpp - Added input validation with assertions - Optimized Lloyd-Max search with binary search (O(log n) vs O(n)) - Added stack allocation for d=128 (avoids heap allocation in hot path) - Added error handling for edge cases - Added decision boundaries for efficient quantization #### ggml-metal-turbo.metal - Added bounds checking to all kernels - Added NaN/Inf handling for numerical stability - Completed fused attention kernel (was stub) - Added fused attention with softmax kernel - Added Metal encoding kernel for completeness - Added binary search for quantization ### 3. Testing (tests/test_turbo.cpp) - Unit tests for encode/decode round-trip - Tests for known values (zeros, ones) - Tests for edge cases (large/small values) - Error handling tests ### 4. Build System (CMakeLists.txt) - Added CMake configuration for building library - Added test executable - Added install targets ### 5. Documentation (README.md) - Added build instructions - Added API documentation - Added contributing guidelines - Added code style guide ## Key Improvements 1. Performance: Binary search instead of linear search for Lloyd-Max quantization 2. Memory: Stack allocation for common case (d=128) 3. Reliability: Input validation and error handling 4. Metal Integration: Complete fused attention implementation 5. Testing: Unit tests for correctness verification 6. Documentation: Contributor guidelines and API docs ## Next Steps 1. Run benchmarks to verify performance improvements 2. Test with actual models (qwen3.5:27b) 3. Integrate with llama.cpp fork 4. Deploy to production Closes #17 2026-04-14 22:07:21 -04:00			`- [docs/PROJECT_STATUS.md](docs/PROJECT_STATUS.md) — Current project status`
			`- [docs/INITIATIVE_REVIEW.md](docs/INITIATIVE_REVIEW.md) — Initiative review and feedback`