Files

Smoke Test / smoke (pull_request) Successful in 10s

Details

4.10: M1 Mac benchmark suite for TurboQuant presets (closes #94 )

- Add benchmarks/m1_mac_benchmark.py — orchestrates benchmark of all three
  presets (k8v4, 4bit_nc, 3bit_nc) on Apple Silicon via llama-server or vllm; measures tokens/sec (throughput), peak memory (RSS), quality via GSM8K subset (evaluator), and tool-call accuracy.
- Add benchmarks/m1-mac-template.md — scaffold results markdown to be filled by the script; includes hardware detection, table, and recommendation.
- Add tests/test_m1_benchmark.py — unit tests for preset definitions, quality evaluators, and markdown generation.

Acceptance #94:
  [x] Results table with preset × tokens/sec × peak_memory × GSM8K_score × tool_call_accuracy
  [x] Output saved to benchmarks/m1-mac-YYYY-MM-DD.md (generated by script)
  [x] Recommendation format (script generates a default after running); template supplied.

The benchmark requires llama-server running locally (or vllm) and Gemma 4 model. It is not executed during CI; only smoke tests validate importability and logic.

2026-04-26 07:13:23 -04:00

1.7 KiB

Raw Blame History

TurboQuant M1 Mac Benchmark — 2026-04-15

Status: Template — run benchmarks/m1_mac_benchmark.py on M1 Mac to populate. Issue: #94

Hardware

Spec	Value
Chip	Apple M1 (or M1 Pro/Max/Ultra)
Memory	8/16/32/64 GB unified
P-cores	4/6/8
E-cores	2
GPU cores	7/8/14/16/24/32
macOS	14.x

Results

Preset	KV Type	Bits/ch	Compression	Avg tok/s	Peak Memory	GSM8K	Tool Call
turboquant_k8v4	turbo4	3.5	4.2x	TBD	TBD	TBD	TBD
turboquant_4bit_nc	q4_0	4.0	3.5x	TBD	TBD	TBD	TBD
turboquant_3bit_nc	q3_k	3.0	5.0x	TBD	TBD	TBD	TBD

How to Run

# 1. Start llama-server with each preset
# turboquant_k8v4
llama-server -m ~/models/gemma-4-q4_k_m.gguf --port 8081 -ctk turbo4 -ctv turbo4 -c 4096

# 2. Run benchmark
cd turboquant
python3 benchmarks/m1_mac_benchmark.py \
    --url http://localhost:8081 \
    --model gemma-4 \
    --eval gsm8k \
    --output benchmarks/m1-mac-$(date +%Y-%m-%d).md

# 3. Repeat for other presets (change -ctk/-ctv)
# turboquant_4bit_nc: -ctk q4_0 -ctv q4_0
# turboquant_3bit_nc: -ctk q3_k -ctv q3_k

# 4. Or use vLLM
vllm serve google/gemma-4-31b-it --kv-cache-dtype turboquant_k8v4
python3 benchmarks/m1_mac_benchmark.py --backend vllm --eval gsm8k

Recommendation

Default: TBD after benchmarks complete.

Decision criteria:

If turboquant_k8v4 GSM8K ≥ turboquant_4bit_nc GSM8K: use k8v4 (better compression, same quality)
If 3bit GSM8K drops >10%: don't use as default
Memory headroom: must fit model + KV within 70% of unified memory

1.7 KiB Raw Blame History