Files
turboquant/docs/DFLASH_APPLE_SILICON.md
Alexander Whitestone dabb96d315
All checks were successful
Smoke Test / smoke (pull_request) Successful in 19s
docs: record Qwen3.5-9B DFlash Metal timeout (refs #152, #154)
2026-04-21 22:25:25 -04:00

3.4 KiB

DFlash on Apple Silicon

This repo now carries a Gitea-first benchmark harness for evaluating whether upstream DFlash on MLX is worth adding to the local Apple Silicon inference stack.

Why

The headline Kimi K2.6 + DFlash benchmark was measured on 8x MI300X with huge RAM and ROCm patches. That exact recipe is not a fit for a 36 GB Apple Silicon Mac.

What is relevant locally is the upstream z-lab/dflash MLX path, which can benchmark smaller matched target/draft pairs that fit on Apple Silicon.

Current repo entry point

Use:

python3 benchmarks/dflash_apple_silicon.py --machine-label "M3 Max 36GB"

This prints a benchmark report template with:

  • the selected model/draft pair
  • exact setup commands
  • the upstream MLX benchmark command
  • baseline comparison guidance

Write the template to a file:

python3 benchmarks/dflash_apple_silicon.py \
  --machine-label "M3 Max 36GB" \
  --output benchmarks/reports/dflash_m3max_36gb.md

Emit the underlying plan as JSON:

python3 benchmarks/dflash_apple_silicon.py --format json

Selection logic

Today the planner uses two upstream-supported MLX pairs:

  • qwen35-9b
    • base: Qwen/Qwen3.5-9B
    • draft: z-lab/Qwen3.5-9B-DFlash
    • chosen for ~28 GB+ machines
  • qwen35-4b
    • base: Qwen/Qwen3.5-4B
    • draft: z-lab/Qwen3.5-4B-DFlash
    • fallback for tighter-memory Macs

On a 36 GB Mac, the default recommendation is qwen35-9b.

Pilot result already landed

A first live Apple Silicon run has already been captured in:

  • benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md

Pilot command:

python -m dflash.benchmark --backend mlx \
    --model Qwen/Qwen3.5-4B \
    --draft-model z-lab/Qwen3.5-4B-DFlash \
    --dataset gsm8k \
    --max-samples 1 \
    --enable-thinking \
    --draft-sliding-window-size 4096

Pilot outcome on this Mac:

  • baseline throughput: 22.35 tok/s
  • DFlash throughput: 46.78 tok/s
  • decoding speedup: 2.09x

Treat that as a directional proof, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding.

Known 9B failure on this machine

A follow-up live run with:

  • Qwen/Qwen3.5-9B
  • z-lab/Qwen3.5-9B-DFlash

failed on this same M3 Max 36GB Mac with:

[METAL] Command buffer execution failed:
Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)

That failure is recorded in:

  • benchmarks/reports/dflash_m3max_36gb_qwen35_9b_timeout.md

So the current guidance is:

  • treat qwen35-9b as experimental on this machine
  • treat qwen35-4b as the current known-working local proof path
  • keep the issue open until we either stabilize the 9B path or clearly rule it out for this hardware tier

Upstream benchmark command

The harness uses the upstream MLX benchmark syntax from z-lab/dflash:

python -m dflash.benchmark --backend mlx \
    --model Qwen/Qwen3.5-9B \
    --draft-model z-lab/Qwen3.5-9B-DFlash \
    --dataset gsm8k \
    --max-samples 128 \
    --enable-thinking \
    --draft-sliding-window-size 4096

What remains

This PR adds the planner + report template so the benchmark is reproducible from the repo. The issue remains open until a real Apple Silicon run lands with:

  • measured throughput
  • measured memory
  • a baseline comparison against plain MLX or llama.cpp speculative decoding
  • a recommendation on whether to operationalize DFlash locally