Files
turboquant/benchmarks/m1-mac-template.md
Alexander Whitestone 8a5070dbf6
All checks were successful
Smoke Test / smoke (pull_request) Successful in 19s
docs: M1 Mac benchmark results template (#94)
2026-04-16 01:53:42 +00:00

1.7 KiB

TurboQuant M1 Mac Benchmark — 2026-04-15

Status: Template — run benchmarks/m1_mac_benchmark.py on M1 Mac to populate. Issue: #94

Hardware

Spec Value
Chip Apple M1 (or M1 Pro/Max/Ultra)
Memory 8/16/32/64 GB unified
P-cores 4/6/8
E-cores 2
GPU cores 7/8/14/16/24/32
macOS 14.x

Results

Preset KV Type Bits/ch Compression Avg tok/s Peak Memory GSM8K Tool Call
turboquant_k8v4 turbo4 3.5 4.2x TBD TBD TBD TBD
turboquant_4bit_nc q4_0 4.0 3.5x TBD TBD TBD TBD
turboquant_3bit_nc q3_k 3.0 5.0x TBD TBD TBD TBD

How to Run

# 1. Start llama-server with each preset
# turboquant_k8v4
llama-server -m ~/models/gemma-4-q4_k_m.gguf --port 8081 -ctk turbo4 -ctv turbo4 -c 4096

# 2. Run benchmark
cd turboquant
python3 benchmarks/m1_mac_benchmark.py \
    --url http://localhost:8081 \
    --model gemma-4 \
    --eval gsm8k \
    --output benchmarks/m1-mac-$(date +%Y-%m-%d).md

# 3. Repeat for other presets (change -ctk/-ctv)
# turboquant_4bit_nc: -ctk q4_0 -ctv q4_0
# turboquant_3bit_nc: -ctk q3_k -ctv q3_k

# 4. Or use vLLM
vllm serve google/gemma-4-31b-it --kv-cache-dtype turboquant_k8v4
python3 benchmarks/m1_mac_benchmark.py --backend vllm --eval gsm8k

Recommendation

Default: TBD after benchmarks complete.

Decision criteria:

  • If turboquant_k8v4 GSM8K ≥ turboquant_4bit_nc GSM8K: use k8v4 (better compression, same quality)
  • If 3bit GSM8K drops >10%: don't use as default
  • Memory headroom: must fit model + KV within 70% of unified memory