bench: record Apple Silicon DFlash pilot result (refs #152)

2026-04-21 22:20:15 -04:00
parent 636d294896
commit 69cef8a90f
2 changed files with 72 additions and 0 deletions
--- a/docs/DFLASH_APPLE_SILICON.md
+++ b/docs/DFLASH_APPLE_SILICON.md
@@ -51,6 +51,32 @@ Today the planner uses two upstream-supported MLX pairs:

 On a `36 GB` Mac, the default recommendation is `qwen35-9b`.

+## Pilot result already landed
+
+A first live Apple Silicon run has already been captured in:
+
+- `benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md`
+
+Pilot command:
+
+```bash
+python -m dflash.benchmark --backend mlx \
+    --model Qwen/Qwen3.5-4B \
+    --draft-model z-lab/Qwen3.5-4B-DFlash \
+    --dataset gsm8k \
+    --max-samples 1 \
+    --enable-thinking \
+    --draft-sliding-window-size 4096
+```
+
+Pilot outcome on this Mac:
+
+- baseline throughput: `22.35 tok/s`
+- DFlash throughput: `46.78 tok/s`
+- decoding speedup: `2.09x`
+
+Treat that as a **directional proof**, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding.
+
 ## Upstream benchmark command

 The harness uses the upstream MLX benchmark syntax from `z-lab/dflash`: