Files

Alexander Whitestone d2ef914edd

Smoke Test / smoke (pull_request) Successful in 24s

Details

feat: Comprehensive review and improvements for TurboQuant (#17 )

This commit addresses issue #17 by providing a comprehensive review
of the TurboQuant initiative and implementing key improvements.

## Changes

### 1. Initiative Review (docs/INITIATIVE_REVIEW.md)
- Comprehensive assessment of current state
- Code quality findings and recommendations
- Contributor feedback for @manus, @Timmy, @Rockachopa
- Implementation plan with clear milestones

### 2. Code Improvements

#### llama-turbo.cpp
- Added input validation with assertions
- Optimized Lloyd-Max search with binary search (O(log n) vs O(n))
- Added stack allocation for d=128 (avoids heap allocation in hot path)
- Added error handling for edge cases
- Added decision boundaries for efficient quantization

#### ggml-metal-turbo.metal
- Added bounds checking to all kernels
- Added NaN/Inf handling for numerical stability
- Completed fused attention kernel (was stub)
- Added fused attention with softmax kernel
- Added Metal encoding kernel for completeness
- Added binary search for quantization

### 3. Testing (tests/test_turbo.cpp)
- Unit tests for encode/decode round-trip
- Tests for known values (zeros, ones)
- Tests for edge cases (large/small values)
- Error handling tests

### 4. Build System (CMakeLists.txt)
- Added CMake configuration for building library
- Added test executable
- Added install targets

### 5. Documentation (README.md)
- Added build instructions
- Added API documentation
- Added contributing guidelines
- Added code style guide

## Key Improvements

1. **Performance**: Binary search instead of linear search for Lloyd-Max quantization
2. **Memory**: Stack allocation for common case (d=128)
3. **Reliability**: Input validation and error handling
4. **Metal Integration**: Complete fused attention implementation
5. **Testing**: Unit tests for correctness verification
6. **Documentation**: Contributor guidelines and API docs

## Next Steps

1. Run benchmarks to verify performance improvements
2. Test with actual models (qwen3.5:27b)
3. Integrate with llama.cpp fork
4. Deploy to production

Closes #17

2026-04-14 22:07:21 -04:00

3.9 KiB

Raw Blame History

TurboQuant

KV cache compression for local inference on M4 Max MacBook Pro.

What

TurboQuant (Google, ICLR 2026) is a three-stage KV cache compression method:

PolarQuant — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
QJL — 1-bit quantized Johnson-Lindenstrauss residual correction
TurboQuant — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss

Why

Unlock 64K-128K context on qwen3.5:27b within 32GB unified memory. A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context.

Status

See issues for current progress.

Building

Prerequisites

CMake 3.10+
C++11 compiler
Xcode Command Line Tools (for Metal on macOS)

Build Instructions

# Clone the repository
git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/turboquant.git
cd turboquant

# Build with CMake
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

# Run tests
cd build && ctest

Integration with llama.cpp

See PR-IMPLEMENTATION-PLAN.md for integration steps.

API

CPU Reference Implementation

// Encode: Compress float vector to 4-bit packed representation
void polar_quant_encode_turbo4(
    const float* src,    // Input: float array [d]
    uint8_t* dst,        // Output: packed 4-bit indices [d/2]
    float* norm,         // Output: L2 norm (radius)
    int d                // Dimension (must be power of 2, e.g., 128)
);

// Decode: Decompress 4-bit packed representation to float vector
void polar_quant_decode_turbo4(
    const uint8_t* src,  // Input: packed 4-bit indices [d/2]
    float* dst,          // Output: float array [d]
    float norm,          // Input: L2 norm (radius)
    int d                // Dimension (must be power of 2, e.g., 128)
);

Metal Shaders

See ggml-metal-turbo.metal for GPU-accelerated kernels:

kernel_fwht_128: Fast Walsh-Hadamard Transform
kernel_turbo4_dequant: Dequantization for attention
kernel_attention_turbo4: Fused attention computation
kernel_attention_turbo4_softmax: Fused attention with softmax
kernel_turbo4_encode: Encoding on GPU

Contributing

Getting Started

Fork the repository
Create a feature branch: git checkout -b feature/your-feature
Make your changes
Add tests for new functionality
Run the test suite: cd build && ctest
Submit a pull request

Code Style

C++11 standard
4-space indentation
Snake_case for functions and variables
UPPER_CASE for constants
Add comments for complex algorithms

Testing

All new code must have unit tests
Run tests before submitting PR: cd build && ctest
Test on both CPU and Metal (if applicable)

Pull Request Process

Update documentation if needed
Add tests for new functionality
Ensure all tests pass
Request review from maintainers

Issues

Use issue templates when available
Tag issues appropriately (bug, enhancement, documentation)
Include reproduction steps for bugs
For performance issues, include benchmark results

Roles

Strago: Build spec author
Cid: Implementation, benchmarks, deployment
Locke: Research support, upstream watch
John: Quality review
Frankie: Coordination

Source Repos

TheTom/llama-cpp-turboquant — llama.cpp fork with Metal
TheTom/turboquant_plus — Reference impl, 511+ tests
amirzandieh/QJL — Author QJL code (CUDA)
rachittshah/mlx-turboquant — MLX fallback

Docs

BUILD-SPEC.md — Full build specification (Strago, v2.2)
docs/PROJECT_STATUS.md — Current project status
docs/INITIATIVE_REVIEW.md — Initiative review and feedback

3.9 KiB Raw Blame History