From 8a5070dbf6437c6cc0b9f1c3fe1e01bf6b63fef6 Mon Sep 17 00:00:00 2001 From: Alexander Whitestone Date: Thu, 16 Apr 2026 01:53:42 +0000 Subject: [PATCH] docs: M1 Mac benchmark results template (#94) --- benchmarks/m1-mac-template.md | 56 +++++++++++++++++++++++++++++++++++ 1 file changed, 56 insertions(+) create mode 100644 benchmarks/m1-mac-template.md diff --git a/benchmarks/m1-mac-template.md b/benchmarks/m1-mac-template.md new file mode 100644 index 00000000..1617a01c --- /dev/null +++ b/benchmarks/m1-mac-template.md @@ -0,0 +1,56 @@ +# TurboQuant M1 Mac Benchmark — 2026-04-15 + +**Status:** Template — run `benchmarks/m1_mac_benchmark.py` on M1 Mac to populate. +**Issue:** #94 + +## Hardware + +| Spec | Value | +|------|-------| +| Chip | Apple M1 (or M1 Pro/Max/Ultra) | +| Memory | 8/16/32/64 GB unified | +| P-cores | 4/6/8 | +| E-cores | 2 | +| GPU cores | 7/8/14/16/24/32 | +| macOS | 14.x | + +## Results + +| Preset | KV Type | Bits/ch | Compression | Avg tok/s | Peak Memory | GSM8K | Tool Call | +|--------|---------|---------|-------------|-----------|-------------|-------|-----------| +| turboquant_k8v4 | turbo4 | 3.5 | 4.2x | TBD | TBD | TBD | TBD | +| turboquant_4bit_nc | q4_0 | 4.0 | 3.5x | TBD | TBD | TBD | TBD | +| turboquant_3bit_nc | q3_k | 3.0 | 5.0x | TBD | TBD | TBD | TBD | + +## How to Run + +```bash +# 1. Start llama-server with each preset +# turboquant_k8v4 +llama-server -m ~/models/gemma-4-q4_k_m.gguf --port 8081 -ctk turbo4 -ctv turbo4 -c 4096 + +# 2. Run benchmark +cd turboquant +python3 benchmarks/m1_mac_benchmark.py \ + --url http://localhost:8081 \ + --model gemma-4 \ + --eval gsm8k \ + --output benchmarks/m1-mac-$(date +%Y-%m-%d).md + +# 3. Repeat for other presets (change -ctk/-ctv) +# turboquant_4bit_nc: -ctk q4_0 -ctv q4_0 +# turboquant_3bit_nc: -ctk q3_k -ctv q3_k + +# 4. Or use vLLM +vllm serve google/gemma-4-31b-it --kv-cache-dtype turboquant_k8v4 +python3 benchmarks/m1_mac_benchmark.py --backend vllm --eval gsm8k +``` + +## Recommendation + +**Default:** TBD after benchmarks complete. + +Decision criteria: +- If turboquant_k8v4 GSM8K ≥ turboquant_4bit_nc GSM8K: use k8v4 (better compression, same quality) +- If 3bit GSM8K drops >10%: don't use as default +- Memory headroom: must fit model + KV within 70% of unified memory