TurboQuant M1 Mac Benchmark — 2026-04-15

Status: Template — run benchmarks/m1_mac_benchmark.py on M1 Mac to populate. Issue: #94

Hardware

Spec	Value
Chip	Apple M1 (or M1 Pro/Max/Ultra)
Memory	8/16/32/64 GB unified
P-cores	4/6/8
E-cores	2
GPU cores	7/8/14/16/24/32
macOS	14.x

Results

Preset	KV Type	Bits/ch	Compression	Avg tok/s	Peak Memory	GSM8K	Tool Call
turboquant_k8v4	turbo4	3.5	4.2x	TBD	TBD	TBD	TBD
turboquant_4bit_nc	q4_0	4.0	3.5x	TBD	TBD	TBD	TBD
turboquant_3bit_nc	q3_k	3.0	5.0x	TBD	TBD	TBD	TBD

How to Run

# 1. Start llama-server with each preset
# turboquant_k8v4
llama-server -m ~/models/gemma-4-q4_k_m.gguf --port 8081 -ctk turbo4 -ctv turbo4 -c 4096

# 2. Run benchmark
cd turboquant
python3 benchmarks/m1_mac_benchmark.py \
    --url http://localhost:8081 \
    --model gemma-4 \
    --eval gsm8k \
    --output benchmarks/m1-mac-$(date +%Y-%m-%d).md

# 3. Repeat for other presets (change -ctk/-ctv)
# turboquant_4bit_nc: -ctk q4_0 -ctv q4_0
# turboquant_3bit_nc: -ctk q3_k -ctv q3_k

# 4. Or use vLLM
vllm serve google/gemma-4-31b-it --kv-cache-dtype turboquant_k8v4
python3 benchmarks/m1_mac_benchmark.py --backend vllm --eval gsm8k

Recommendation

Default: TBD after benchmarks complete.

Decision criteria:

If turboquant_k8v4 GSM8K ≥ turboquant_4bit_nc GSM8K: use k8v4 (better compression, same quality)
If 3bit GSM8K drops >10%: don't use as default
Memory headroom: must fit model + KV within 70% of unified memory

1.7 KiB Raw Blame History

TurboQuant M1 Mac Benchmark — 2026-04-15

Hardware

Results

How to Run

Recommendation

1.7 KiB

Raw Blame History