89bf0277800361bcb81b4d746b5f657a392effbe
All checks were successful
Smoke Test / smoke (pull_request) Successful in 10s
- Add benchmarks/m1_mac_benchmark.py — orchestrates benchmark of all three presets (k8v4, 4bit_nc, 3bit_nc) on Apple Silicon via llama-server or vllm; measures tokens/sec (throughput), peak memory (RSS), quality via GSM8K subset (evaluator), and tool-call accuracy. - Add benchmarks/m1-mac-template.md — scaffold results markdown to be filled by the script; includes hardware detection, table, and recommendation. - Add tests/test_m1_benchmark.py — unit tests for preset definitions, quality evaluators, and markdown generation. Acceptance #94: [x] Results table with preset × tokens/sec × peak_memory × GSM8K_score × tool_call_accuracy [x] Output saved to benchmarks/m1-mac-YYYY-MM-DD.md (generated by script) [x] Recommendation format (script generates a default after running); template supplied. The benchmark requires llama-server running locally (or vllm) and Gemma 4 model. It is not executed during CI; only smoke tests validate importability and logic.
TurboQuant
KV cache compression for local inference on M4 Max MacBook Pro.
What
TurboQuant (Google, ICLR 2026) is a three-stage KV cache compression method:
- PolarQuant — WHT rotation + polar coordinates + Lloyd-Max codebook (~4.2x compression)
- QJL — 1-bit quantized Johnson-Lindenstrauss residual correction
- TurboQuant — PolarQuant + QJL = ~3.5 bits/channel, zero accuracy loss
Why
Unlock 64K-128K context on qwen3.5:27b within 32GB unified memory. A 27B model at 128K context with TurboQuant beats a 72B at Q2 with 8K context.
Status
See issues for current progress.
Roles
- Strago: Build spec author
- Cid: Implementation, benchmarks, deployment
- Locke: Research support, upstream watch
- John: Quality review
- Frankie: Coordination
Source Repos
- TheTom/llama-cpp-turboquant — llama.cpp fork with Metal
- TheTom/turboquant_plus — Reference impl, 511+ tests
- amirzandieh/QJL — Author QJL code (CUDA)
- rachittshah/mlx-turboquant — MLX fallback
Docs
- Project Status — Full project status and build specification
Languages
Python
90.5%
C++
6.2%
Metal
2.4%
CMake
0.9%