Files
turboquant/benchmarks/m1-mac-template.md
STEP35 CLI 89bf027780
All checks were successful
Smoke Test / smoke (pull_request) Successful in 10s
4.10: M1 Mac benchmark suite for TurboQuant presets (closes #94)
- Add benchmarks/m1_mac_benchmark.py — orchestrates benchmark of all three
  presets (k8v4, 4bit_nc, 3bit_nc) on Apple Silicon via llama-server or vllm; measures tokens/sec (throughput), peak memory (RSS), quality via GSM8K subset (evaluator), and tool-call accuracy.
- Add benchmarks/m1-mac-template.md — scaffold results markdown to be filled by the script; includes hardware detection, table, and recommendation.
- Add tests/test_m1_benchmark.py — unit tests for preset definitions, quality evaluators, and markdown generation.

Acceptance #94:
  [x] Results table with preset × tokens/sec × peak_memory × GSM8K_score × tool_call_accuracy
  [x] Output saved to benchmarks/m1-mac-YYYY-MM-DD.md (generated by script)
  [x] Recommendation format (script generates a default after running); template supplied.

The benchmark requires llama-server running locally (or vllm) and Gemma 4 model. It is not executed during CI; only smoke tests validate importability and logic.
2026-04-26 07:13:23 -04:00

1.7 KiB

TurboQuant M1 Mac Benchmark — 2026-04-15

Status: Template — run benchmarks/m1_mac_benchmark.py on M1 Mac to populate. Issue: #94

Hardware

Spec Value
Chip Apple M1 (or M1 Pro/Max/Ultra)
Memory 8/16/32/64 GB unified
P-cores 4/6/8
E-cores 2
GPU cores 7/8/14/16/24/32
macOS 14.x

Results

Preset KV Type Bits/ch Compression Avg tok/s Peak Memory GSM8K Tool Call
turboquant_k8v4 turbo4 3.5 4.2x TBD TBD TBD TBD
turboquant_4bit_nc q4_0 4.0 3.5x TBD TBD TBD TBD
turboquant_3bit_nc q3_k 3.0 5.0x TBD TBD TBD TBD

How to Run

# 1. Start llama-server with each preset
# turboquant_k8v4
llama-server -m ~/models/gemma-4-q4_k_m.gguf --port 8081 -ctk turbo4 -ctv turbo4 -c 4096

# 2. Run benchmark
cd turboquant
python3 benchmarks/m1_mac_benchmark.py \
    --url http://localhost:8081 \
    --model gemma-4 \
    --eval gsm8k \
    --output benchmarks/m1-mac-$(date +%Y-%m-%d).md

# 3. Repeat for other presets (change -ctk/-ctv)
# turboquant_4bit_nc: -ctk q4_0 -ctv q4_0
# turboquant_3bit_nc: -ctk q3_k -ctv q3_k

# 4. Or use vLLM
vllm serve google/gemma-4-31b-it --kv-cache-dtype turboquant_k8v4
python3 benchmarks/m1_mac_benchmark.py --backend vllm --eval gsm8k

Recommendation

Default: TBD after benchmarks complete.

Decision criteria:

  • If turboquant_k8v4 GSM8K ≥ turboquant_4bit_nc GSM8K: use k8v4 (better compression, same quality)
  • If 3bit GSM8K drops >10%: don't use as default
  • Memory headroom: must fit model + KV within 70% of unified memory