turboquant

Timmy_Foundation/turboquant

Fork 0

Commit Graph

Author	SHA1	Message	Date
Alexander Whitestone	d2ef914edd	feat: Comprehensive review and improvements for TurboQuant (#17 ) All checks were successful Smoke Test / smoke (pull_request) Successful in 24s Details This commit addresses issue #17 by providing a comprehensive review of the TurboQuant initiative and implementing key improvements. ## Changes ### 1. Initiative Review (docs/INITIATIVE_REVIEW.md) - Comprehensive assessment of current state - Code quality findings and recommendations - Contributor feedback for @manus, @Timmy, @Rockachopa - Implementation plan with clear milestones ### 2. Code Improvements #### llama-turbo.cpp - Added input validation with assertions - Optimized Lloyd-Max search with binary search (O(log n) vs O(n)) - Added stack allocation for d=128 (avoids heap allocation in hot path) - Added error handling for edge cases - Added decision boundaries for efficient quantization #### ggml-metal-turbo.metal - Added bounds checking to all kernels - Added NaN/Inf handling for numerical stability - Completed fused attention kernel (was stub) - Added fused attention with softmax kernel - Added Metal encoding kernel for completeness - Added binary search for quantization ### 3. Testing (tests/test_turbo.cpp) - Unit tests for encode/decode round-trip - Tests for known values (zeros, ones) - Tests for edge cases (large/small values) - Error handling tests ### 4. Build System (CMakeLists.txt) - Added CMake configuration for building library - Added test executable - Added install targets ### 5. Documentation (README.md) - Added build instructions - Added API documentation - Added contributing guidelines - Added code style guide ## Key Improvements 1. Performance: Binary search instead of linear search for Lloyd-Max quantization 2. Memory: Stack allocation for common case (d=128) 3. Reliability: Input validation and error handling 4. Metal Integration: Complete fused attention implementation 5. Testing: Unit tests for correctness verification 6. Documentation: Contributor guidelines and API docs ## Next Steps 1. Run benchmarks to verify performance improvements 2. Test with actual models (qwen3.5:27b) 3. Integrate with llama.cpp fork 4. Deploy to production Closes #17	2026-04-14 22:07:21 -04:00
Timmy	cefaa6e778	Add build spec v2.2 and README TurboQuant KV cache compression for M4 Max local inference. Spec by Strago, triaged into 16 issues across 4 phases. Ref #1	2026-03-30 13:11:45 -04:00
Timmy Time	0b62c72737	Initial commit	2026-03-30 17:08:45 +00:00

Author

SHA1

Message

Date

Alexander Whitestone

d2ef914edd

feat: Comprehensive review and improvements for TurboQuant (#17 )

Smoke Test / smoke (pull_request) Successful in 24s

Details

This commit addresses issue #17 by providing a comprehensive review
of the TurboQuant initiative and implementing key improvements.

## Changes

### 1. Initiative Review (docs/INITIATIVE_REVIEW.md)
- Comprehensive assessment of current state
- Code quality findings and recommendations
- Contributor feedback for @manus, @Timmy, @Rockachopa
- Implementation plan with clear milestones

### 2. Code Improvements

#### llama-turbo.cpp
- Added input validation with assertions
- Optimized Lloyd-Max search with binary search (O(log n) vs O(n))
- Added stack allocation for d=128 (avoids heap allocation in hot path)
- Added error handling for edge cases
- Added decision boundaries for efficient quantization

#### ggml-metal-turbo.metal
- Added bounds checking to all kernels
- Added NaN/Inf handling for numerical stability
- Completed fused attention kernel (was stub)
- Added fused attention with softmax kernel
- Added Metal encoding kernel for completeness
- Added binary search for quantization

### 3. Testing (tests/test_turbo.cpp)
- Unit tests for encode/decode round-trip
- Tests for known values (zeros, ones)
- Tests for edge cases (large/small values)
- Error handling tests

### 4. Build System (CMakeLists.txt)
- Added CMake configuration for building library
- Added test executable
- Added install targets

### 5. Documentation (README.md)
- Added build instructions
- Added API documentation
- Added contributing guidelines
- Added code style guide

## Key Improvements

1. **Performance**: Binary search instead of linear search for Lloyd-Max quantization
2. **Memory**: Stack allocation for common case (d=128)
3. **Reliability**: Input validation and error handling
4. **Metal Integration**: Complete fused attention implementation
5. **Testing**: Unit tests for correctness verification
6. **Documentation**: Contributor guidelines and API docs

## Next Steps

1. Run benchmarks to verify performance improvements
2. Test with actual models (qwen3.5:27b)
3. Integrate with llama.cpp fork
4. Deploy to production

Closes #17

2026-04-14 22:07:21 -04:00

Timmy

cefaa6e778

Add build spec v2.2 and README

TurboQuant KV cache compression for M4 Max local inference.
Spec by Strago, triaged into 16 issues across 4 phases.

Ref #1

2026-03-30 13:11:45 -04:00

Timmy Time

0b62c72737

Initial commit

2026-03-30 17:08:45 +00:00

3 Commits