Alexander Whitestone d2ef914edd
All checks were successful
Smoke Test / smoke (pull_request) Successful in 24s
feat: Comprehensive review and improvements for TurboQuant (#17)
This commit addresses issue #17 by providing a comprehensive review
of the TurboQuant initiative and implementing key improvements.

## Changes

### 1. Initiative Review (docs/INITIATIVE_REVIEW.md)
- Comprehensive assessment of current state
- Code quality findings and recommendations
- Contributor feedback for @manus, @Timmy, @Rockachopa
- Implementation plan with clear milestones

### 2. Code Improvements

#### llama-turbo.cpp
- Added input validation with assertions
- Optimized Lloyd-Max search with binary search (O(log n) vs O(n))
- Added stack allocation for d=128 (avoids heap allocation in hot path)
- Added error handling for edge cases
- Added decision boundaries for efficient quantization

#### ggml-metal-turbo.metal
- Added bounds checking to all kernels
- Added NaN/Inf handling for numerical stability
- Completed fused attention kernel (was stub)
- Added fused attention with softmax kernel
- Added Metal encoding kernel for completeness
- Added binary search for quantization

### 3. Testing (tests/test_turbo.cpp)
- Unit tests for encode/decode round-trip
- Tests for known values (zeros, ones)
- Tests for edge cases (large/small values)
- Error handling tests

### 4. Build System (CMakeLists.txt)
- Added CMake configuration for building library
- Added test executable
- Added install targets

### 5. Documentation (README.md)
- Added build instructions
- Added API documentation
- Added contributing guidelines
- Added code style guide

## Key Improvements

1. **Performance**: Binary search instead of linear search for Lloyd-Max quantization
2. **Memory**: Stack allocation for common case (d=128)
3. **Reliability**: Input validation and error handling
4. **Metal Integration**: Complete fused attention implementation
5. **Testing**: Unit tests for correctness verification
6. **Documentation**: Contributor guidelines and API docs

## Next Steps

1. Run benchmarks to verify performance improvements
2. Test with actual models (qwen3.5:27b)
3. Integrate with llama.cpp fork
4. Deploy to production

Closes #17
2026-04-14 22:07:21 -04:00
2026-03-30 17:08:45 +00:00
2026-03-30 21:06:49 +00:00

TurboQuant

KV cache compression for local inference on M4 Max MacBook Pro.

What

TurboQuant (Google, ICLR 2026) is a three-stage KV cache compression method:

  1. PolarQuant — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
  2. QJL — 1-bit quantized Johnson-Lindenstrauss residual correction
  3. TurboQuant — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss

Why

Unlock 64K-128K context on qwen3.5:27b within 32GB unified memory. A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context.

Status

See issues for current progress.

Building

Prerequisites

  • CMake 3.10+
  • C++11 compiler
  • Xcode Command Line Tools (for Metal on macOS)

Build Instructions

# Clone the repository
git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/turboquant.git
cd turboquant

# Build with CMake
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

# Run tests
cd build && ctest

Integration with llama.cpp

See PR-IMPLEMENTATION-PLAN.md for integration steps.

API

CPU Reference Implementation

// Encode: Compress float vector to 4-bit packed representation
void polar_quant_encode_turbo4(
    const float* src,    // Input: float array [d]
    uint8_t* dst,        // Output: packed 4-bit indices [d/2]
    float* norm,         // Output: L2 norm (radius)
    int d                // Dimension (must be power of 2, e.g., 128)
);

// Decode: Decompress 4-bit packed representation to float vector
void polar_quant_decode_turbo4(
    const uint8_t* src,  // Input: packed 4-bit indices [d/2]
    float* dst,          // Output: float array [d]
    float norm,          // Input: L2 norm (radius)
    int d                // Dimension (must be power of 2, e.g., 128)
);

Metal Shaders

See ggml-metal-turbo.metal for GPU-accelerated kernels:

  • kernel_fwht_128: Fast Walsh-Hadamard Transform
  • kernel_turbo4_dequant: Dequantization for attention
  • kernel_attention_turbo4: Fused attention computation
  • kernel_attention_turbo4_softmax: Fused attention with softmax
  • kernel_turbo4_encode: Encoding on GPU

Contributing

Getting Started

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite: cd build && ctest
  6. Submit a pull request

Code Style

  • C++11 standard
  • 4-space indentation
  • Snake_case for functions and variables
  • UPPER_CASE for constants
  • Add comments for complex algorithms

Testing

  • All new code must have unit tests
  • Run tests before submitting PR: cd build && ctest
  • Test on both CPU and Metal (if applicable)

Pull Request Process

  1. Update documentation if needed
  2. Add tests for new functionality
  3. Ensure all tests pass
  4. Request review from maintainers

Issues

  • Use issue templates when available
  • Tag issues appropriately (bug, enhancement, documentation)
  • Include reproduction steps for bugs
  • For performance issues, include benchmark results

Roles

  • Strago: Build spec author
  • Cid: Implementation, benchmarks, deployment
  • Locke: Research support, upstream watch
  • John: Quality review
  • Frankie: Coordination

Source Repos

Docs

Description
TurboQuant KV cache compression for local inference — PolarQuant + QJL on M4 Max via llama.cpp/Ollama. Build spec from Strago, build by Cid, coordination by Frankie.
Readme MIT 28 MiB
Languages
Python 90.5%
C++ 6.2%
Metal 2.4%
CMake 0.9%