3.4 KiB
DFlash on Apple Silicon
This repo now carries a Gitea-first benchmark harness for evaluating whether upstream DFlash on MLX is worth adding to the local Apple Silicon inference stack.
Why
The headline Kimi K2.6 + DFlash benchmark was measured on 8x MI300X with huge RAM and ROCm patches. That exact recipe is not a fit for a 36 GB Apple Silicon Mac.
What is relevant locally is the upstream z-lab/dflash MLX path, which can benchmark smaller matched target/draft pairs that fit on Apple Silicon.
Current repo entry point
Use:
python3 benchmarks/dflash_apple_silicon.py --machine-label "M3 Max 36GB"
This prints a benchmark report template with:
- the selected model/draft pair
- exact setup commands
- the upstream MLX benchmark command
- baseline comparison guidance
Write the template to a file:
python3 benchmarks/dflash_apple_silicon.py \
--machine-label "M3 Max 36GB" \
--output benchmarks/reports/dflash_m3max_36gb.md
Emit the underlying plan as JSON:
python3 benchmarks/dflash_apple_silicon.py --format json
Selection logic
Today the planner uses two upstream-supported MLX pairs:
qwen35-9b- base:
Qwen/Qwen3.5-9B - draft:
z-lab/Qwen3.5-9B-DFlash - chosen for ~28 GB+ machines
- base:
qwen35-4b- base:
Qwen/Qwen3.5-4B - draft:
z-lab/Qwen3.5-4B-DFlash - fallback for tighter-memory Macs
- base:
On a 36 GB Mac, the default recommendation is qwen35-9b.
Pilot result already landed
A first live Apple Silicon run has already been captured in:
benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md
Pilot command:
python -m dflash.benchmark --backend mlx \
--model Qwen/Qwen3.5-4B \
--draft-model z-lab/Qwen3.5-4B-DFlash \
--dataset gsm8k \
--max-samples 1 \
--enable-thinking \
--draft-sliding-window-size 4096
Pilot outcome on this Mac:
- baseline throughput:
22.35 tok/s - DFlash throughput:
46.78 tok/s - decoding speedup:
2.09x
Treat that as a directional proof, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding.
Known 9B failure on this machine
A follow-up live run with:
Qwen/Qwen3.5-9Bz-lab/Qwen3.5-9B-DFlash
failed on this same M3 Max 36GB Mac with:
[METAL] Command buffer execution failed:
Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
That failure is recorded in:
benchmarks/reports/dflash_m3max_36gb_qwen35_9b_timeout.md
So the current guidance is:
- treat
qwen35-9bas experimental on this machine - treat
qwen35-4bas the current known-working local proof path - keep the issue open until we either stabilize the 9B path or clearly rule it out for this hardware tier
Upstream benchmark command
The harness uses the upstream MLX benchmark syntax from z-lab/dflash:
python -m dflash.benchmark --backend mlx \
--model Qwen/Qwen3.5-9B \
--draft-model z-lab/Qwen3.5-9B-DFlash \
--dataset gsm8k \
--max-samples 128 \
--enable-thinking \
--draft-sliding-window-size 4096
What remains
This PR adds the planner + report template so the benchmark is reproducible from the repo. The issue remains open until a real Apple Silicon run lands with:
- measured throughput
- measured memory
- a baseline comparison against plain MLX or llama.cpp speculative decoding
- a recommendation on whether to operationalize DFlash locally