# DFlash on Apple Silicon This repo now carries a **Gitea-first benchmark harness** for evaluating whether upstream **DFlash on MLX** is worth adding to the local Apple Silicon inference stack. ## Why The headline `Kimi K2.6 + DFlash` benchmark was measured on `8x MI300X` with huge RAM and ROCm patches. That exact recipe is not a fit for a `36 GB` Apple Silicon Mac. What *is* relevant locally is the upstream `z-lab/dflash` MLX path, which can benchmark smaller matched target/draft pairs that fit on Apple Silicon. ## Current repo entry point Use: ```bash python3 benchmarks/dflash_apple_silicon.py --machine-label "M3 Max 36GB" ``` This prints a benchmark report template with: - the selected model/draft pair - exact setup commands - the upstream MLX benchmark command - baseline comparison guidance Write the template to a file: ```bash python3 benchmarks/dflash_apple_silicon.py \ --machine-label "M3 Max 36GB" \ --output benchmarks/reports/dflash_m3max_36gb.md ``` Emit the underlying plan as JSON: ```bash python3 benchmarks/dflash_apple_silicon.py --format json ``` ## Selection logic Today the planner uses two upstream-supported MLX pairs: - `qwen35-9b` - base: `Qwen/Qwen3.5-9B` - draft: `z-lab/Qwen3.5-9B-DFlash` - chosen for ~28 GB+ machines - `qwen35-4b` - base: `Qwen/Qwen3.5-4B` - draft: `z-lab/Qwen3.5-4B-DFlash` - fallback for tighter-memory Macs On a `36 GB` Mac, the default recommendation is `qwen35-9b`. ## Pilot result already landed A first live Apple Silicon run has already been captured in: - `benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md` Pilot command: ```bash python -m dflash.benchmark --backend mlx \ --model Qwen/Qwen3.5-4B \ --draft-model z-lab/Qwen3.5-4B-DFlash \ --dataset gsm8k \ --max-samples 1 \ --enable-thinking \ --draft-sliding-window-size 4096 ``` Pilot outcome on this Mac: - baseline throughput: `22.35 tok/s` - DFlash throughput: `46.78 tok/s` - decoding speedup: `2.09x` Treat that as a **directional proof**, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding. ## Known 9B failure on this machine A follow-up live run with: - `Qwen/Qwen3.5-9B` - `z-lab/Qwen3.5-9B-DFlash` failed on this same M3 Max 36GB Mac with: ```text [METAL] Command buffer execution failed: Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout) ``` That failure is recorded in: - `benchmarks/reports/dflash_m3max_36gb_qwen35_9b_timeout.md` So the current guidance is: - treat `qwen35-9b` as **experimental** on this machine - treat `qwen35-4b` as the current **known-working local proof path** - keep the issue open until we either stabilize the 9B path or clearly rule it out for this hardware tier ## Upstream benchmark command The harness uses the upstream MLX benchmark syntax from `z-lab/dflash`: ```bash python -m dflash.benchmark --backend mlx \ --model Qwen/Qwen3.5-9B \ --draft-model z-lab/Qwen3.5-9B-DFlash \ --dataset gsm8k \ --max-samples 128 \ --enable-thinking \ --draft-sliding-window-size 4096 ``` ## What remains This PR adds the **planner + report template** so the benchmark is reproducible from the repo. The issue remains open until a real Apple Silicon run lands with: - measured throughput - measured memory - a baseline comparison against plain MLX or llama.cpp speculative decoding - a recommendation on whether to operationalize DFlash locally