From 69cef8a90f87f39f0df3c5cef8c276a3b1d0fe21 Mon Sep 17 00:00:00 2001
From: Alexander Whitestone <alexpaynex@gmail.com>
Date: Tue, 21 Apr 2026 22:20:15 -0400
Subject: [PATCH] bench: record Apple Silicon DFlash pilot result (refs #152)

---
 .../dflash_m3max_36gb_qwen35_4b_pilot.md      | 46 +++++++++++++++++++
 docs/DFLASH_APPLE_SILICON.md                  | 26 +++++++++++
 2 files changed, 72 insertions(+)
 create mode 100644 benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md

diff --git a/benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md b/benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md
new file mode 100644
index 0000000..cf0e274
--- /dev/null
+++ b/benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md
@@ -0,0 +1,46 @@
+# DFlash Apple Silicon Pilot — Qwen3.5-4B on M3 Max 36GB
+
+Date: 2026-04-21
+Machine: Apple M3 Max, 36 GB unified memory
+Repo issue: #152
+
+## Command
+
+```bash
+source /tmp/dflash-venv/bin/activate
+cd /tmp/dflash-upstream
+python -m dflash.benchmark --backend mlx \
+    --model Qwen/Qwen3.5-4B \
+    --draft-model z-lab/Qwen3.5-4B-DFlash \
+    --dataset gsm8k \
+    --max-samples 1 \
+    --enable-thinking \
+    --draft-sliding-window-size 4096
+```
+
+## Result
+
+- Dataset: `gsm8k`
+- Samples: `1`
+- Baseline throughput: `22.35 tok/s`
+- DFlash throughput: `46.78 tok/s`
+- Decoding speedup: `2.09x`
+- Average acceptance length: `6.48`
+
+Acceptance length histogram:
+
+```text
+['0.3%', '11.1%', '12.7%', '10.4%', '11.7%', '7.6%', '7.0%', '3.8%', '5.1%', '6.3%', '2.8%', '3.8%', '2.2%', '1.9%', '0.9%', '2.5%', '9.8%']
+```
+
+## Caveats
+
+- This is a **pilot**, not a decision-grade benchmark.
+- Only `1` sample was run, so the throughput number is directional.
+- No apples-to-apples baseline against plain MLX or llama.cpp speculative decoding is included yet.
+- The planner still recommends trying `Qwen/Qwen3.5-9B + z-lab/Qwen3.5-9B-DFlash` on this machine for the more meaningful fit test.
+
+## Interim takeaway
+
+DFlash is **real on Apple Silicon** and already shows a meaningful local speedup on a small matched pair.
+A `2.09x` pilot speedup on `Qwen3.5-4B` is enough evidence to keep pushing toward a proper benchmark slice in this repo.
diff --git a/docs/DFLASH_APPLE_SILICON.md b/docs/DFLASH_APPLE_SILICON.md
index be15ef3..d866cc4 100644
--- a/docs/DFLASH_APPLE_SILICON.md
+++ b/docs/DFLASH_APPLE_SILICON.md
@@ -51,6 +51,32 @@ Today the planner uses two upstream-supported MLX pairs:
 
 On a `36 GB` Mac, the default recommendation is `qwen35-9b`.
 
+## Pilot result already landed
+
+A first live Apple Silicon run has already been captured in:
+
+- `benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md`
+
+Pilot command:
+
+```bash
+python -m dflash.benchmark --backend mlx \
+    --model Qwen/Qwen3.5-4B \
+    --draft-model z-lab/Qwen3.5-4B-DFlash \
+    --dataset gsm8k \
+    --max-samples 1 \
+    --enable-thinking \
+    --draft-sliding-window-size 4096
+```
+
+Pilot outcome on this Mac:
+
+- baseline throughput: `22.35 tok/s`
+- DFlash throughput: `46.78 tok/s`
+- decoding speedup: `2.09x`
+
+Treat that as a **directional proof**, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding.
+
 ## Upstream benchmark command
 
 The harness uses the upstream MLX benchmark syntax from `z-lab/dflash`: