docs: M1 Mac benchmark results template (#94)

2026-04-16 01:53:42 +00:00
parent 4efd2a6b48
commit 8a5070dbf6
1 changed files with 56 additions and 0 deletions
--- a/benchmarks/m1-mac-template.md
+++ b/benchmarks/m1-mac-template.md
@@ -0,0 +1,56 @@
+# TurboQuant M1 Mac Benchmark — 2026-04-15
+
+**Status:** Template — run `benchmarks/m1_mac_benchmark.py` on M1 Mac to populate.
+**Issue:** #94
+
+## Hardware
+
+| Spec | Value |
+|------|-------|
+| Chip | Apple M1 (or M1 Pro/Max/Ultra) |
+| Memory | 8/16/32/64 GB unified |
+| P-cores | 4/6/8 |
+| E-cores | 2 |
+| GPU cores | 7/8/14/16/24/32 |
+| macOS | 14.x |
+
+## Results
+
+| Preset | KV Type | Bits/ch | Compression | Avg tok/s | Peak Memory | GSM8K | Tool Call |
+|--------|---------|---------|-------------|-----------|-------------|-------|-----------|
+| turboquant_k8v4 | turbo4 | 3.5 | 4.2x | TBD | TBD | TBD | TBD |
+| turboquant_4bit_nc | q4_0 | 4.0 | 3.5x | TBD | TBD | TBD | TBD |
+| turboquant_3bit_nc | q3_k | 3.0 | 5.0x | TBD | TBD | TBD | TBD |
+
+## How to Run
+
+```bash
+# 1. Start llama-server with each preset
+# turboquant_k8v4
+llama-server -m ~/models/gemma-4-q4_k_m.gguf --port 8081 -ctk turbo4 -ctv turbo4 -c 4096
+
+# 2. Run benchmark
+cd turboquant
+python3 benchmarks/m1_mac_benchmark.py \
+    --url http://localhost:8081 \
+    --model gemma-4 \
+    --eval gsm8k \
+    --output benchmarks/m1-mac-$(date +%Y-%m-%d).md
+
+# 3. Repeat for other presets (change -ctk/-ctv)
+# turboquant_4bit_nc: -ctk q4_0 -ctv q4_0
+# turboquant_3bit_nc: -ctk q3_k -ctv q3_k
+
+# 4. Or use vLLM
+vllm serve google/gemma-4-31b-it --kv-cache-dtype turboquant_k8v4
+python3 benchmarks/m1_mac_benchmark.py --backend vllm --eval gsm8k
+```
+
+## Recommendation
+
+**Default:** TBD after benchmarks complete.
+
+Decision criteria:
+- If turboquant_k8v4 GSM8K ≥ turboquant_4bit_nc GSM8K: use k8v4 (better compression, same quality)
+- If 3bit GSM8K drops >10%: don't use as default
+- Memory headroom: must fit model + KV within 70% of unified memory