DFlash on Apple Silicon

This repo now carries a Gitea-first benchmark harness for evaluating whether upstream DFlash on MLX is worth adding to the local Apple Silicon inference stack.

Why

The headline Kimi K2.6 + DFlash benchmark was measured on 8x MI300X with huge RAM and ROCm patches. That exact recipe is not a fit for a 36 GB Apple Silicon Mac.

What is relevant locally is the upstream z-lab/dflash MLX path, which can benchmark smaller matched target/draft pairs that fit on Apple Silicon.

Current repo entry point

Use:

python3 benchmarks/dflash_apple_silicon.py --machine-label "M3 Max 36GB"

This prints a benchmark report template with:

the selected model/draft pair
exact setup commands
the upstream MLX benchmark command
baseline comparison guidance

Write the template to a file:

python3 benchmarks/dflash_apple_silicon.py \
  --machine-label "M3 Max 36GB" \
  --output benchmarks/reports/dflash_m3max_36gb.md

Emit the underlying plan as JSON:

python3 benchmarks/dflash_apple_silicon.py --format json

Selection logic

Today the planner uses two upstream-supported MLX pairs:

qwen35-9b
- base: Qwen/Qwen3.5-9B
- draft: z-lab/Qwen3.5-9B-DFlash
- chosen for ~28 GB+ machines
qwen35-4b
- base: Qwen/Qwen3.5-4B
- draft: z-lab/Qwen3.5-4B-DFlash
- fallback for tighter-memory Macs

On a 36 GB Mac, the default recommendation is qwen35-9b.

Upstream benchmark command

The harness uses the upstream MLX benchmark syntax from z-lab/dflash:

python -m dflash.benchmark --backend mlx \
    --model Qwen/Qwen3.5-9B \
    --draft-model z-lab/Qwen3.5-9B-DFlash \
    --dataset gsm8k \
    --max-samples 128 \
    --enable-thinking \
    --draft-sliding-window-size 4096

What remains

This PR adds the planner + report template so the benchmark is reproducible from the repo. The issue remains open until a real Apple Silicon run lands with:

measured throughput
measured memory
a baseline comparison against plain MLX or llama.cpp speculative decoding
a recommendation on whether to operationalize DFlash locally

2.1 KiB Raw Blame History

DFlash on Apple Silicon

Why

Current repo entry point

Selection logic

Upstream benchmark command

What remains

2.1 KiB

Raw Blame History