2.1 KiB
DFlash on Apple Silicon
This repo now carries a Gitea-first benchmark harness for evaluating whether upstream DFlash on MLX is worth adding to the local Apple Silicon inference stack.
Why
The headline Kimi K2.6 + DFlash benchmark was measured on 8x MI300X with huge RAM and ROCm patches. That exact recipe is not a fit for a 36 GB Apple Silicon Mac.
What is relevant locally is the upstream z-lab/dflash MLX path, which can benchmark smaller matched target/draft pairs that fit on Apple Silicon.
Current repo entry point
Use:
python3 benchmarks/dflash_apple_silicon.py --machine-label "M3 Max 36GB"
This prints a benchmark report template with:
- the selected model/draft pair
- exact setup commands
- the upstream MLX benchmark command
- baseline comparison guidance
Write the template to a file:
python3 benchmarks/dflash_apple_silicon.py \
--machine-label "M3 Max 36GB" \
--output benchmarks/reports/dflash_m3max_36gb.md
Emit the underlying plan as JSON:
python3 benchmarks/dflash_apple_silicon.py --format json
Selection logic
Today the planner uses two upstream-supported MLX pairs:
qwen35-9b- base:
Qwen/Qwen3.5-9B - draft:
z-lab/Qwen3.5-9B-DFlash - chosen for ~28 GB+ machines
- base:
qwen35-4b- base:
Qwen/Qwen3.5-4B - draft:
z-lab/Qwen3.5-4B-DFlash - fallback for tighter-memory Macs
- base:
On a 36 GB Mac, the default recommendation is qwen35-9b.
Upstream benchmark command
The harness uses the upstream MLX benchmark syntax from z-lab/dflash:
python -m dflash.benchmark --backend mlx \
--model Qwen/Qwen3.5-9B \
--draft-model z-lab/Qwen3.5-9B-DFlash \
--dataset gsm8k \
--max-samples 128 \
--enable-thinking \
--draft-sliding-window-size 4096
What remains
This PR adds the planner + report template so the benchmark is reproducible from the repo. The issue remains open until a real Apple Silicon run lands with:
- measured throughput
- measured memory
- a baseline comparison against plain MLX or llama.cpp speculative decoding
- a recommendation on whether to operationalize DFlash locally