Files
turboquant/docs/DFLASH_APPLE_SILICON.md

125 lines
3.6 KiB
Markdown
Raw Normal View History

# DFlash on Apple Silicon
This repo now carries a **Gitea-first benchmark harness** for evaluating whether upstream **DFlash on MLX** is worth adding to the local Apple Silicon inference stack.
## Why
The headline `Kimi K2.6 + DFlash` benchmark was measured on `8x MI300X` with huge RAM and ROCm patches. That exact recipe is not a fit for a `36 GB` Apple Silicon Mac.
What *is* relevant locally is the upstream `z-lab/dflash` MLX path, which can benchmark smaller matched target/draft pairs that fit on Apple Silicon.
## Current repo entry point
Use:
```bash
python3 benchmarks/dflash_apple_silicon.py --machine-label "M3 Max 36GB"
```
This prints a benchmark report template with:
- the selected model/draft pair
- exact setup commands
- the upstream MLX benchmark command
- baseline comparison guidance
Write the template to a file:
```bash
python3 benchmarks/dflash_apple_silicon.py \
--machine-label "M3 Max 36GB" \
--output benchmarks/reports/dflash_m3max_36gb.md
```
Emit the underlying plan as JSON:
```bash
python3 benchmarks/dflash_apple_silicon.py --format json
```
## Selection logic
Today the planner uses two upstream-supported MLX pairs:
- `qwen35-9b`
- base: `Qwen/Qwen3.5-9B`
- draft: `z-lab/Qwen3.5-9B-DFlash`
- chosen for ~28 GB+ machines
- `qwen35-4b`
- base: `Qwen/Qwen3.5-4B`
- draft: `z-lab/Qwen3.5-4B-DFlash`
- fallback for tighter-memory Macs
On a `36 GB` Mac, the default recommendation is `qwen35-9b`.
**Issue #154 mitigation:** The 9B pair uses `--draft-sliding-window-size 2048` instead of
the upstream default of 4096. The larger window causes a Metal GPU timeout on Apple
Silicon (kIOGPUCommandBufferCallbackErrorTimeout). See issue #154 for details.
## Pilot result
A first live Apple Silicon run with the 4B pair has been captured in:
- `benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md`
Pilot command:
```bash
python -m dflash.benchmark --backend mlx \
--model Qwen/Qwen3.5-4B \
--draft-model z-lab/Qwen3.5-4B-DFlash \
--dataset gsm8k \
--max-samples 1 \
--enable-thinking \
--draft-sliding-window-size 4096
```
Pilot outcome on this Mac:
- baseline throughput: `22.35 tok/s`
- DFlash throughput: `46.78 tok/s`
- decoding speedup: `2.09x`
Treat that as a **directional proof**, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding.
## Upstream benchmark command
The harness uses the upstream MLX benchmark syntax from `z-lab/dflash`:
```bash
python -m dflash.benchmark --backend mlx \
--model Qwen/Qwen3.5-9B \
--draft-model z-lab/Qwen3.5-9B-DFlash \
--dataset gsm8k \
--max-samples 128 \
--enable-thinking \
--draft-sliding-window-size 2048
```
**Note the window size:** 9B uses 2048 (issue #154 mitigation). 4B can use 4096.
## Known issues
### Issue #154: 9B Metal GPU timeout at window=4096
The `qwen35-9b` pair with `--draft-sliding-window-size 4096` fails on M3 Max 36GB:
```
[METAL] Command buffer execution failed:
Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
```
**Mitigation:** The planner automatically sets window=2048 for the 9B pair.
**Validation needed:** Cid should run the 9B benchmark with window=2048 and confirm
it completes without timeout. If 2048 still fails, try window=1024.
## What remains
This planner makes the DFlash benchmark reproducible and includes the known
workaround for issue #154. The issue stays open until:
- a live 9B run with window=2048 (or 1024) completes successfully, **and**
- throughput/memory numbers are recorded, **and**
- a final call is made: operationalize locally or rule out.