bench: record Apple Silicon DFlash pilot result (refs #152)
All checks were successful
Smoke Test / smoke (pull_request) Successful in 18s

This commit is contained in:
Alexander Whitestone
2026-04-21 22:20:15 -04:00
parent 636d294896
commit 69cef8a90f
2 changed files with 72 additions and 0 deletions

View File

@@ -51,6 +51,32 @@ Today the planner uses two upstream-supported MLX pairs:
On a `36 GB` Mac, the default recommendation is `qwen35-9b`.
## Pilot result already landed
A first live Apple Silicon run has already been captured in:
- `benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md`
Pilot command:
```bash
python -m dflash.benchmark --backend mlx \
--model Qwen/Qwen3.5-4B \
--draft-model z-lab/Qwen3.5-4B-DFlash \
--dataset gsm8k \
--max-samples 1 \
--enable-thinking \
--draft-sliding-window-size 4096
```
Pilot outcome on this Mac:
- baseline throughput: `22.35 tok/s`
- DFlash throughput: `46.78 tok/s`
- decoding speedup: `2.09x`
Treat that as a **directional proof**, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding.
## Upstream benchmark command
The harness uses the upstream MLX benchmark syntax from `z-lab/dflash`: