Files
turboquant/docs/DFLASH_APPLE_SILICON.md
Alexander Whitestone 636d294896
All checks were successful
Smoke Test / smoke (pull_request) Successful in 18s
feat: add Apple Silicon DFlash benchmark planner (refs #152)
2026-04-21 22:00:22 -04:00

2.1 KiB

DFlash on Apple Silicon

This repo now carries a Gitea-first benchmark harness for evaluating whether upstream DFlash on MLX is worth adding to the local Apple Silicon inference stack.

Why

The headline Kimi K2.6 + DFlash benchmark was measured on 8x MI300X with huge RAM and ROCm patches. That exact recipe is not a fit for a 36 GB Apple Silicon Mac.

What is relevant locally is the upstream z-lab/dflash MLX path, which can benchmark smaller matched target/draft pairs that fit on Apple Silicon.

Current repo entry point

Use:

python3 benchmarks/dflash_apple_silicon.py --machine-label "M3 Max 36GB"

This prints a benchmark report template with:

  • the selected model/draft pair
  • exact setup commands
  • the upstream MLX benchmark command
  • baseline comparison guidance

Write the template to a file:

python3 benchmarks/dflash_apple_silicon.py \
  --machine-label "M3 Max 36GB" \
  --output benchmarks/reports/dflash_m3max_36gb.md

Emit the underlying plan as JSON:

python3 benchmarks/dflash_apple_silicon.py --format json

Selection logic

Today the planner uses two upstream-supported MLX pairs:

  • qwen35-9b
    • base: Qwen/Qwen3.5-9B
    • draft: z-lab/Qwen3.5-9B-DFlash
    • chosen for ~28 GB+ machines
  • qwen35-4b
    • base: Qwen/Qwen3.5-4B
    • draft: z-lab/Qwen3.5-4B-DFlash
    • fallback for tighter-memory Macs

On a 36 GB Mac, the default recommendation is qwen35-9b.

Upstream benchmark command

The harness uses the upstream MLX benchmark syntax from z-lab/dflash:

python -m dflash.benchmark --backend mlx \
    --model Qwen/Qwen3.5-9B \
    --draft-model z-lab/Qwen3.5-9B-DFlash \
    --dataset gsm8k \
    --max-samples 128 \
    --enable-thinking \
    --draft-sliding-window-size 4096

What remains

This PR adds the planner + report template so the benchmark is reproducible from the repo. The issue remains open until a real Apple Silicon run lands with:

  • measured throughput
  • measured memory
  • a baseline comparison against plain MLX or llama.cpp speculative decoding
  • a recommendation on whether to operationalize DFlash locally