103 lines
2.8 KiB
Markdown
103 lines
2.8 KiB
Markdown
# DFlash on Apple Silicon
|
|
|
|
This repo now carries a **Gitea-first benchmark harness** for evaluating whether upstream **DFlash on MLX** is worth adding to the local Apple Silicon inference stack.
|
|
|
|
## Why
|
|
|
|
The headline `Kimi K2.6 + DFlash` benchmark was measured on `8x MI300X` with huge RAM and ROCm patches. That exact recipe is not a fit for a `36 GB` Apple Silicon Mac.
|
|
|
|
What *is* relevant locally is the upstream `z-lab/dflash` MLX path, which can benchmark smaller matched target/draft pairs that fit on Apple Silicon.
|
|
|
|
## Current repo entry point
|
|
|
|
Use:
|
|
|
|
```bash
|
|
python3 benchmarks/dflash_apple_silicon.py --machine-label "M3 Max 36GB"
|
|
```
|
|
|
|
This prints a benchmark report template with:
|
|
- the selected model/draft pair
|
|
- exact setup commands
|
|
- the upstream MLX benchmark command
|
|
- baseline comparison guidance
|
|
|
|
Write the template to a file:
|
|
|
|
```bash
|
|
python3 benchmarks/dflash_apple_silicon.py \
|
|
--machine-label "M3 Max 36GB" \
|
|
--output benchmarks/reports/dflash_m3max_36gb.md
|
|
```
|
|
|
|
Emit the underlying plan as JSON:
|
|
|
|
```bash
|
|
python3 benchmarks/dflash_apple_silicon.py --format json
|
|
```
|
|
|
|
## Selection logic
|
|
|
|
Today the planner uses two upstream-supported MLX pairs:
|
|
|
|
- `qwen35-9b`
|
|
- base: `Qwen/Qwen3.5-9B`
|
|
- draft: `z-lab/Qwen3.5-9B-DFlash`
|
|
- chosen for ~28 GB+ machines
|
|
- `qwen35-4b`
|
|
- base: `Qwen/Qwen3.5-4B`
|
|
- draft: `z-lab/Qwen3.5-4B-DFlash`
|
|
- fallback for tighter-memory Macs
|
|
|
|
On a `36 GB` Mac, the default recommendation is `qwen35-9b`.
|
|
|
|
## Pilot result already landed
|
|
|
|
A first live Apple Silicon run has already been captured in:
|
|
|
|
- `benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md`
|
|
|
|
Pilot command:
|
|
|
|
```bash
|
|
python -m dflash.benchmark --backend mlx \
|
|
--model Qwen/Qwen3.5-4B \
|
|
--draft-model z-lab/Qwen3.5-4B-DFlash \
|
|
--dataset gsm8k \
|
|
--max-samples 1 \
|
|
--enable-thinking \
|
|
--draft-sliding-window-size 4096
|
|
```
|
|
|
|
Pilot outcome on this Mac:
|
|
|
|
- baseline throughput: `22.35 tok/s`
|
|
- DFlash throughput: `46.78 tok/s`
|
|
- decoding speedup: `2.09x`
|
|
|
|
Treat that as a **directional proof**, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding.
|
|
|
|
## Upstream benchmark command
|
|
|
|
The harness uses the upstream MLX benchmark syntax from `z-lab/dflash`:
|
|
|
|
```bash
|
|
python -m dflash.benchmark --backend mlx \
|
|
--model Qwen/Qwen3.5-9B \
|
|
--draft-model z-lab/Qwen3.5-9B-DFlash \
|
|
--dataset gsm8k \
|
|
--max-samples 128 \
|
|
--enable-thinking \
|
|
--draft-sliding-window-size 4096
|
|
```
|
|
|
|
## What remains
|
|
|
|
This PR adds the **planner + report template** so the benchmark is reproducible from the repo.
|
|
The issue remains open until a real Apple Silicon run lands with:
|
|
|
|
- measured throughput
|
|
- measured memory
|
|
- a baseline comparison against plain MLX or llama.cpp speculative decoding
|
|
- a recommendation on whether to operationalize DFlash locally
|