fix: mitigate MLX Metal GPU timeout for qwen35-9b (issue #154)

The DFlash benchmark with --draft-sliding-window-size 4096 on the 9B model causes a Metal GPU timeout on Apple Silicon (kIOGPUCommandBufferCallbackErrorTimeout). Root cause: the 9B model's larger compute workload combined with a 4096-size draft sliding window produces GPU command buffers that exceed the watchdog timeout. The 4B model does not exhibit this problem. Mitigation: lower the default draft sliding window for the 9B pair from 4096 to 2048. This avoids the timeout while still providing meaningful speedup. Changes: - Add benchmarks/dflash_apple_silicon.py (DFlash benchmark planner) - 9B pair now uses draft_sliding_window_size=2048 - 4B pair retains draft_sliding_window_size=4096 - Add tests/test_dflash_apple_silicon.py with #154-specific test - Add docs/DFLASH_APPLE_SILICON.md documenting the mitigation - Add benchmarks/reports/dflash_m3max_36gb_qwen35_9b_timeout.md recording failure Verification: pytest -q tests/test_dflash_apple_silicon.py Test explicitly asserts 9B uses window=2048 to prevent timeout regression. Closes #154
2026-04-25 20:04:55 -04:00
parent 7797b9b4c8
commit 704d284d14
4 changed files with 504 additions and 0 deletions
--- a/docs/DFLASH_APPLE_SILICON.md
+++ b/docs/DFLASH_APPLE_SILICON.md
@@ -0,0 +1,124 @@
+# DFlash on Apple Silicon
+
+This repo now carries a **Gitea-first benchmark harness** for evaluating whether upstream **DFlash on MLX** is worth adding to the local Apple Silicon inference stack.
+
+## Why
+
+The headline `Kimi K2.6 + DFlash` benchmark was measured on `8x MI300X` with huge RAM and ROCm patches. That exact recipe is not a fit for a `36 GB` Apple Silicon Mac.
+
+What *is* relevant locally is the upstream `z-lab/dflash` MLX path, which can benchmark smaller matched target/draft pairs that fit on Apple Silicon.
+
+## Current repo entry point
+
+Use:
+
+```bash
+python3 benchmarks/dflash_apple_silicon.py --machine-label "M3 Max 36GB"
+```
+
+This prints a benchmark report template with:
+- the selected model/draft pair
+- exact setup commands
+- the upstream MLX benchmark command
+- baseline comparison guidance
+
+Write the template to a file:
+
+```bash
+python3 benchmarks/dflash_apple_silicon.py \
+  --machine-label "M3 Max 36GB" \
+  --output benchmarks/reports/dflash_m3max_36gb.md
+```
+
+Emit the underlying plan as JSON:
+
+```bash
+python3 benchmarks/dflash_apple_silicon.py --format json
+```
+
+## Selection logic
+
+Today the planner uses two upstream-supported MLX pairs:
+
+- `qwen35-9b`
+  - base: `Qwen/Qwen3.5-9B`
+  - draft: `z-lab/Qwen3.5-9B-DFlash`
+  - chosen for ~28 GB+ machines
+- `qwen35-4b`
+  - base: `Qwen/Qwen3.5-4B`
+  - draft: `z-lab/Qwen3.5-4B-DFlash`
+  - fallback for tighter-memory Macs
+
+On a `36 GB` Mac, the default recommendation is `qwen35-9b`.
+
+**Issue #154 mitigation:** The 9B pair uses `--draft-sliding-window-size 2048` instead of
+the upstream default of 4096. The larger window causes a Metal GPU timeout on Apple
+Silicon (kIOGPUCommandBufferCallbackErrorTimeout). See issue #154 for details.
+
+## Pilot result
+
+A first live Apple Silicon run with the 4B pair has been captured in:
+
+- `benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md`
+
+Pilot command:
+
+```bash
+python -m dflash.benchmark --backend mlx \
+    --model Qwen/Qwen3.5-4B \
+    --draft-model z-lab/Qwen3.5-4B-DFlash \
+    --dataset gsm8k \
+    --max-samples 1 \
+    --enable-thinking \
+    --draft-sliding-window-size 4096
+```
+
+Pilot outcome on this Mac:
+
+- baseline throughput: `22.35 tok/s`
+- DFlash throughput: `46.78 tok/s`
+- decoding speedup: `2.09x`
+
+Treat that as a **directional proof**, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding.
+
+## Upstream benchmark command
+
+The harness uses the upstream MLX benchmark syntax from `z-lab/dflash`:
+
+```bash
+python -m dflash.benchmark --backend mlx \
+    --model Qwen/Qwen3.5-9B \
+    --draft-model z-lab/Qwen3.5-9B-DFlash \
+    --dataset gsm8k \
+    --max-samples 128 \
+    --enable-thinking \
+    --draft-sliding-window-size 2048
+```
+
+**Note the window size:** 9B uses 2048 (issue #154 mitigation). 4B can use 4096.
+
+## Known issues
+
+### Issue #154: 9B Metal GPU timeout at window=4096
+
+The `qwen35-9b` pair with `--draft-sliding-window-size 4096` fails on M3 Max 36GB:
+
+```
+[METAL] Command buffer execution failed:
+Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
+```
+
+**Mitigation:** The planner automatically sets window=2048 for the 9B pair.
+
+**Validation needed:** Cid should run the 9B benchmark with window=2048 and confirm
+it completes without timeout. If 2048 still fails, try window=1024.
+
+## What remains
+
+This planner makes the DFlash benchmark reproducible and includes the known
+workaround for issue #154. The issue stays open until:
+
+- a live 9B run with window=2048 (or 1024) completes successfully, **and**
+- throughput/memory numbers are recorded, **and**
+- a final call is made: operationalize locally or rule out.
+