All checks were successful
Smoke Test / smoke (pull_request) Successful in 10s
- Add benchmarks/m1_mac_benchmark.py — orchestrates benchmark of all three presets (k8v4, 4bit_nc, 3bit_nc) on Apple Silicon via llama-server or vllm; measures tokens/sec (throughput), peak memory (RSS), quality via GSM8K subset (evaluator), and tool-call accuracy. - Add benchmarks/m1-mac-template.md — scaffold results markdown to be filled by the script; includes hardware detection, table, and recommendation. - Add tests/test_m1_benchmark.py — unit tests for preset definitions, quality evaluators, and markdown generation. Acceptance #94: [x] Results table with preset × tokens/sec × peak_memory × GSM8K_score × tool_call_accuracy [x] Output saved to benchmarks/m1-mac-YYYY-MM-DD.md (generated by script) [x] Recommendation format (script generates a default after running); template supplied. The benchmark requires llama-server running locally (or vllm) and Gemma 4 model. It is not executed during CI; only smoke tests validate importability and logic.
1.7 KiB
1.7 KiB
TurboQuant M1 Mac Benchmark — 2026-04-15
Status: Template — run benchmarks/m1_mac_benchmark.py on M1 Mac to populate.
Issue: #94
Hardware
| Spec | Value |
|---|---|
| Chip | Apple M1 (or M1 Pro/Max/Ultra) |
| Memory | 8/16/32/64 GB unified |
| P-cores | 4/6/8 |
| E-cores | 2 |
| GPU cores | 7/8/14/16/24/32 |
| macOS | 14.x |
Results
| Preset | KV Type | Bits/ch | Compression | Avg tok/s | Peak Memory | GSM8K | Tool Call |
|---|---|---|---|---|---|---|---|
| turboquant_k8v4 | turbo4 | 3.5 | 4.2x | TBD | TBD | TBD | TBD |
| turboquant_4bit_nc | q4_0 | 4.0 | 3.5x | TBD | TBD | TBD | TBD |
| turboquant_3bit_nc | q3_k | 3.0 | 5.0x | TBD | TBD | TBD | TBD |
How to Run
# 1. Start llama-server with each preset
# turboquant_k8v4
llama-server -m ~/models/gemma-4-q4_k_m.gguf --port 8081 -ctk turbo4 -ctv turbo4 -c 4096
# 2. Run benchmark
cd turboquant
python3 benchmarks/m1_mac_benchmark.py \
--url http://localhost:8081 \
--model gemma-4 \
--eval gsm8k \
--output benchmarks/m1-mac-$(date +%Y-%m-%d).md
# 3. Repeat for other presets (change -ctk/-ctv)
# turboquant_4bit_nc: -ctk q4_0 -ctv q4_0
# turboquant_3bit_nc: -ctk q3_k -ctv q3_k
# 4. Or use vLLM
vllm serve google/gemma-4-31b-it --kv-cache-dtype turboquant_k8v4
python3 benchmarks/m1_mac_benchmark.py --backend vllm --eval gsm8k
Recommendation
Default: TBD after benchmarks complete.
Decision criteria:
- If turboquant_k8v4 GSM8K ≥ turboquant_4bit_nc GSM8K: use k8v4 (better compression, same quality)
- If 3bit GSM8K drops >10%: don't use as default
- Memory headroom: must fit model + KV within 70% of unified memory