bench: record Apple Silicon DFlash pilot result (refs #152)
All checks were successful
Smoke Test / smoke (pull_request) Successful in 18s
All checks were successful
Smoke Test / smoke (pull_request) Successful in 18s
This commit is contained in:
46
benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md
Normal file
46
benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md
Normal file
@@ -0,0 +1,46 @@
|
|||||||
|
# DFlash Apple Silicon Pilot — Qwen3.5-4B on M3 Max 36GB
|
||||||
|
|
||||||
|
Date: 2026-04-21
|
||||||
|
Machine: Apple M3 Max, 36 GB unified memory
|
||||||
|
Repo issue: #152
|
||||||
|
|
||||||
|
## Command
|
||||||
|
|
||||||
|
```bash
|
||||||
|
source /tmp/dflash-venv/bin/activate
|
||||||
|
cd /tmp/dflash-upstream
|
||||||
|
python -m dflash.benchmark --backend mlx \
|
||||||
|
--model Qwen/Qwen3.5-4B \
|
||||||
|
--draft-model z-lab/Qwen3.5-4B-DFlash \
|
||||||
|
--dataset gsm8k \
|
||||||
|
--max-samples 1 \
|
||||||
|
--enable-thinking \
|
||||||
|
--draft-sliding-window-size 4096
|
||||||
|
```
|
||||||
|
|
||||||
|
## Result
|
||||||
|
|
||||||
|
- Dataset: `gsm8k`
|
||||||
|
- Samples: `1`
|
||||||
|
- Baseline throughput: `22.35 tok/s`
|
||||||
|
- DFlash throughput: `46.78 tok/s`
|
||||||
|
- Decoding speedup: `2.09x`
|
||||||
|
- Average acceptance length: `6.48`
|
||||||
|
|
||||||
|
Acceptance length histogram:
|
||||||
|
|
||||||
|
```text
|
||||||
|
['0.3%', '11.1%', '12.7%', '10.4%', '11.7%', '7.6%', '7.0%', '3.8%', '5.1%', '6.3%', '2.8%', '3.8%', '2.2%', '1.9%', '0.9%', '2.5%', '9.8%']
|
||||||
|
```
|
||||||
|
|
||||||
|
## Caveats
|
||||||
|
|
||||||
|
- This is a **pilot**, not a decision-grade benchmark.
|
||||||
|
- Only `1` sample was run, so the throughput number is directional.
|
||||||
|
- No apples-to-apples baseline against plain MLX or llama.cpp speculative decoding is included yet.
|
||||||
|
- The planner still recommends trying `Qwen/Qwen3.5-9B + z-lab/Qwen3.5-9B-DFlash` on this machine for the more meaningful fit test.
|
||||||
|
|
||||||
|
## Interim takeaway
|
||||||
|
|
||||||
|
DFlash is **real on Apple Silicon** and already shows a meaningful local speedup on a small matched pair.
|
||||||
|
A `2.09x` pilot speedup on `Qwen3.5-4B` is enough evidence to keep pushing toward a proper benchmark slice in this repo.
|
||||||
@@ -51,6 +51,32 @@ Today the planner uses two upstream-supported MLX pairs:
|
|||||||
|
|
||||||
On a `36 GB` Mac, the default recommendation is `qwen35-9b`.
|
On a `36 GB` Mac, the default recommendation is `qwen35-9b`.
|
||||||
|
|
||||||
|
## Pilot result already landed
|
||||||
|
|
||||||
|
A first live Apple Silicon run has already been captured in:
|
||||||
|
|
||||||
|
- `benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md`
|
||||||
|
|
||||||
|
Pilot command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m dflash.benchmark --backend mlx \
|
||||||
|
--model Qwen/Qwen3.5-4B \
|
||||||
|
--draft-model z-lab/Qwen3.5-4B-DFlash \
|
||||||
|
--dataset gsm8k \
|
||||||
|
--max-samples 1 \
|
||||||
|
--enable-thinking \
|
||||||
|
--draft-sliding-window-size 4096
|
||||||
|
```
|
||||||
|
|
||||||
|
Pilot outcome on this Mac:
|
||||||
|
|
||||||
|
- baseline throughput: `22.35 tok/s`
|
||||||
|
- DFlash throughput: `46.78 tok/s`
|
||||||
|
- decoding speedup: `2.09x`
|
||||||
|
|
||||||
|
Treat that as a **directional proof**, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding.
|
||||||
|
|
||||||
## Upstream benchmark command
|
## Upstream benchmark command
|
||||||
|
|
||||||
The harness uses the upstream MLX benchmark syntax from `z-lab/dflash`:
|
The harness uses the upstream MLX benchmark syntax from `z-lab/dflash`:
|
||||||
|
|||||||
Reference in New Issue
Block a user