Files
turboquant/benchmarks/reports/dflash_m3max_36gb.md
Alexander Whitestone 636d294896
All checks were successful
Smoke Test / smoke (pull_request) Successful in 18s
feat: add Apple Silicon DFlash benchmark planner (refs #152)
2026-04-21 22:00:22 -04:00

974 B

DFlash Apple Silicon Benchmark Report

Machine

  • Label: M3 Max 36GB
  • Selected pair: qwen35-9b
  • Base model: Qwen/Qwen3.5-9B
  • Draft model: z-lab/Qwen3.5-9B-DFlash
  • Estimated total weight footprint: 19.93 GB

Setup

python3 -m venv .venv-dflash
source .venv-dflash/bin/activate
git clone https://github.com/z-lab/dflash.git
cd dflash
pip install -e .[mlx]
python -m dflash.benchmark --backend mlx \
    --model Qwen/Qwen3.5-9B \
    --draft-model z-lab/Qwen3.5-9B-DFlash \
    --dataset gsm8k \
    --max-samples 128 \
    --enable-thinking \
    --draft-sliding-window-size 4096

Baseline comparison

Compare against plain MLX or llama.cpp speculative decoding on the same prompt set.

Results

  • Throughput (tok/s):
  • Peak memory (GB):
  • Notes on acceptance / behavior:

Verdict

Worth operationalizing locally?

  • Yes
  • No
  • Needs more data

Recommendation

Explain whether this should become part of the local inference stack.