turboquant/README.md

# TurboQuant

KV cache compression for local inference on M4 Max MacBook Pro.

## What
TurboQuant (Google, ICLR 2026) is a three-stage KV cache compression method:
1. **PolarQuant** — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
2. **QJL** — 1-bit quantized Johnson-Lindenstrauss residual correction
3. **TurboQuant** — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss

## Why
Unlock 64K-128K context on qwen3.5:27b within 32GB unified memory.
A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context.

## Status
See [issues](http://143.198.27.163:3000/Timmy_Foundation/turboquant/issues) for current progress.

## Building

### Prerequisites
- CMake 3.10+
- C++11 compiler
- Xcode Command Line Tools (for Metal on macOS)

### Build Instructions
```bash
# Clone the repository
git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/turboquant.git
cd turboquant

# Build with CMake
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

# Run tests
cd build && ctest
```

### Integration with llama.cpp
See [PR-IMPLEMENTATION-PLAN.md](PR-IMPLEMENTATION-PLAN.md) for integration steps.

## API

### CPU Reference Implementation
```c
// Encode: Compress float vector to 4-bit packed representation
void polar_quant_encode_turbo4(
    const float* src,    // Input: float array [d]
    uint8_t* dst,        // Output: packed 4-bit indices [d/2]
    float* norm,         // Output: L2 norm (radius)
    int d                // Dimension (must be power of 2, e.g., 128)
);

// Decode: Decompress 4-bit packed representation to float vector
void polar_quant_decode_turbo4(
    const uint8_t* src,  // Input: packed 4-bit indices [d/2]
    float* dst,          // Output: float array [d]
    float norm,          // Input: L2 norm (radius)
    int d                // Dimension (must be power of 2, e.g., 128)
);
```

### Metal Shaders
See `ggml-metal-turbo.metal` for GPU-accelerated kernels:
- `kernel_fwht_128`: Fast Walsh-Hadamard Transform
- `kernel_turbo4_dequant`: Dequantization for attention
- `kernel_attention_turbo4`: Fused attention computation
- `kernel_attention_turbo4_softmax`: Fused attention with softmax
- `kernel_turbo4_encode`: Encoding on GPU

## Contributing

### Getting Started
1. Fork the repository
2. Create a feature branch: `git checkout -b feature/your-feature`
3. Make your changes
4. Add tests for new functionality
5. Run the test suite: `cd build && ctest`
6. Submit a pull request

### Code Style
- C++11 standard
- 4-space indentation
- Snake_case for functions and variables
- UPPER_CASE for constants
- Add comments for complex algorithms

### Testing
- All new code must have unit tests
- Run tests before submitting PR: `cd build && ctest`
- Test on both CPU and Metal (if applicable)

### Pull Request Process
1. Update documentation if needed
2. Add tests for new functionality
3. Ensure all tests pass
4. Request review from maintainers

### Issues
- Use issue templates when available
- Tag issues appropriately (`bug`, `enhancement`, `documentation`)
- Include reproduction steps for bugs
- For performance issues, include benchmark results

## Roles
- **Strago:** Build spec author
- **Cid:** Implementation, benchmarks, deployment
- **Locke:** Research support, upstream watch
- **John:** Quality review
- **Frankie:** Coordination

## Source Repos
- [TheTom/llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant) — llama.cpp fork with Metal
- [TheTom/turboquant_plus](https://github.com/TheTom/turboquant_plus) — Reference impl, 511+ tests
- [amirzandieh/QJL](https://github.com/amirzandieh/QJL) — Author QJL code (CUDA)
- [rachittshah/mlx-turboquant](https://github.com/rachittshah/mlx-turboquant) — MLX fallback

## Docs
- [BUILD-SPEC.md](BUILD-SPEC.md) — Full build specification (Strago, v2.2)
- [docs/PROJECT_STATUS.md](docs/PROJECT_STATUS.md) — Current project status
- [docs/INITIATIVE_REVIEW.md](docs/INITIATIVE_REVIEW.md) — Initiative review and feedback