From 69cef8a90f87f39f0df3c5cef8c276a3b1d0fe21 Mon Sep 17 00:00:00 2001 From: Alexander Whitestone Date: Tue, 21 Apr 2026 22:20:15 -0400 Subject: [PATCH] bench: record Apple Silicon DFlash pilot result (refs #152) --- .../dflash_m3max_36gb_qwen35_4b_pilot.md | 46 +++++++++++++++++++ docs/DFLASH_APPLE_SILICON.md | 26 +++++++++++ 2 files changed, 72 insertions(+) create mode 100644 benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md diff --git a/benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md b/benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md new file mode 100644 index 0000000..cf0e274 --- /dev/null +++ b/benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md @@ -0,0 +1,46 @@ +# DFlash Apple Silicon Pilot — Qwen3.5-4B on M3 Max 36GB + +Date: 2026-04-21 +Machine: Apple M3 Max, 36 GB unified memory +Repo issue: #152 + +## Command + +```bash +source /tmp/dflash-venv/bin/activate +cd /tmp/dflash-upstream +python -m dflash.benchmark --backend mlx \ + --model Qwen/Qwen3.5-4B \ + --draft-model z-lab/Qwen3.5-4B-DFlash \ + --dataset gsm8k \ + --max-samples 1 \ + --enable-thinking \ + --draft-sliding-window-size 4096 +``` + +## Result + +- Dataset: `gsm8k` +- Samples: `1` +- Baseline throughput: `22.35 tok/s` +- DFlash throughput: `46.78 tok/s` +- Decoding speedup: `2.09x` +- Average acceptance length: `6.48` + +Acceptance length histogram: + +```text +['0.3%', '11.1%', '12.7%', '10.4%', '11.7%', '7.6%', '7.0%', '3.8%', '5.1%', '6.3%', '2.8%', '3.8%', '2.2%', '1.9%', '0.9%', '2.5%', '9.8%'] +``` + +## Caveats + +- This is a **pilot**, not a decision-grade benchmark. +- Only `1` sample was run, so the throughput number is directional. +- No apples-to-apples baseline against plain MLX or llama.cpp speculative decoding is included yet. +- The planner still recommends trying `Qwen/Qwen3.5-9B + z-lab/Qwen3.5-9B-DFlash` on this machine for the more meaningful fit test. + +## Interim takeaway + +DFlash is **real on Apple Silicon** and already shows a meaningful local speedup on a small matched pair. +A `2.09x` pilot speedup on `Qwen3.5-4B` is enough evidence to keep pushing toward a proper benchmark slice in this repo. diff --git a/docs/DFLASH_APPLE_SILICON.md b/docs/DFLASH_APPLE_SILICON.md index be15ef3..d866cc4 100644 --- a/docs/DFLASH_APPLE_SILICON.md +++ b/docs/DFLASH_APPLE_SILICON.md @@ -51,6 +51,32 @@ Today the planner uses two upstream-supported MLX pairs: On a `36 GB` Mac, the default recommendation is `qwen35-9b`. +## Pilot result already landed + +A first live Apple Silicon run has already been captured in: + +- `benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md` + +Pilot command: + +```bash +python -m dflash.benchmark --backend mlx \ + --model Qwen/Qwen3.5-4B \ + --draft-model z-lab/Qwen3.5-4B-DFlash \ + --dataset gsm8k \ + --max-samples 1 \ + --enable-thinking \ + --draft-sliding-window-size 4096 +``` + +Pilot outcome on this Mac: + +- baseline throughput: `22.35 tok/s` +- DFlash throughput: `46.78 tok/s` +- decoding speedup: `2.09x` + +Treat that as a **directional proof**, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding. + ## Upstream benchmark command The harness uses the upstream MLX benchmark syntax from `z-lab/dflash`: