2fca513e26bb5d6006ae0ea5e94fe68b4f1800c9
All checks were successful
Smoke Test / smoke (pull_request) Successful in 11s
Adds comprehensive regression test suite for TurboQuant-compressed models
to verify hermes tool calling functionality remains intact after quantization.
- New test: tests/tool_call_regression.py
* Schema contract tests for 5 core tools (read_file, web_search,
terminal, execute_code, delegate_task)
* Parallel tool calling validation
* Profile configuration validation (TurboQuant settings, server flags)
* Live integration tests (skipped unless TURBOQUANT_SERVER_URL set)
* Results matrix generator (benchmarks/tool-call-regression.md)
* Enforces 95% accuracy threshold via pytest assertion
- New results matrix: benchmarks/tool-call-regression.md
* Markdown table logging model/preset/accuracy/per-tool results
* Auto-updates when tests run with --generate-matrix
- CI gate: .gitea/workflows/smoke.yml
* Runs tool call regression suite on every push/PR
* Live tests will fail pipeline if accuracy drops below 95%
Closes #96
TurboQuant
KV cache compression for local inference on M4 Max MacBook Pro.
What
TurboQuant (Google, ICLR 2026) is a three-stage KV cache compression method:
- PolarQuant — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
- QJL — 1-bit quantized Johnson-Lindenstrauss residual correction
- TurboQuant — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss
Why
Unlock 64K-128K context on qwen3.5:27b within 32GB unified memory. A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context.
Status
See issues for current progress.
Roles
- Strago: Build spec author
- Cid: Implementation, benchmarks, deployment
- Locke: Research support, upstream watch
- John: Quality review
- Frankie: Coordination
Source Repos
- TheTom/llama-cpp-turboquant — llama.cpp fork with Metal
- TheTom/turboquant_plus — Reference impl, 511+ tests
- amirzandieh/QJL — Author QJL code (CUDA)
- rachittshah/mlx-turboquant — MLX fallback
Docs
- Project Status — Full project status and build specification
Languages
Python
90.5%
C++
6.2%
Metal
2.4%
CMake
0.9%