docs: record Qwen3.5-9B DFlash Metal timeout (refs #152, #154)

2026-04-21 22:25:25 -04:00
parent 69cef8a90f
commit dabb96d315
2 changed files with 82 additions and 0 deletions
--- a/benchmarks/reports/dflash_m3max_36gb_qwen35_9b_timeout.md
+++ b/benchmarks/reports/dflash_m3max_36gb_qwen35_9b_timeout.md
@@ -0,0 +1,59 @@
 # DFlash on Apple Silicon Failure Report — Qwen3.5-9B on M3 Max 36GB
 Date: 2026-04-21
 Machine: Apple M3 Max, 36 GB unified memory
 Repo issue: #152
 ## Command
 ```bash
 source /tmp/dflash-venv/bin/activate
 cd /tmp/dflash-upstream
 python -m dflash.benchmark --backend mlx \
    --model Qwen/Qwen3.5-9B \
    --draft-model z-lab/Qwen3.5-9B-DFlash \
    --dataset gsm8k \
    --max-samples 1 \
    --enable-thinking \
    --draft-sliding-window-size 4096
 ```
 ## Outcome
 The benchmark did **not** complete successfully on this machine.
 ### Failure signature
 ```text
 libc++abi: terminating due to uncaught exception of type std::runtime_error:
 [METAL] Command buffer execution failed:
 Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
 ```
 Additional shutdown noise:
 ```text
 bash: [11285: 1] tcsetattr: Inappropriate ioctl for device
 resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
 ```
 ## Interpretation
 This is strong evidence that the `Qwen/Qwen3.5-9B + z-lab/Qwen3.5-9B-DFlash` pair is **not currently stable** on an M3 Max 36GB Mac under the upstream MLX benchmark path, at least with the default settings used here.
 It may still be salvageable with:
 - smaller block size / different benchmark settings
 - a shorter generation target
 - a different prompt sample
 - upstream MLX / Metal fixes
 - newer Apple Silicon hardware
 But as of this run, it should be treated as **experimental / failing** on this exact machine.
 ## Recommendation
 For this Mac, the working local proof path is still:
 - `Qwen/Qwen3.5-4B`
 - `z-lab/Qwen3.5-4B-DFlash`
 Use the 4B pair for reproducible local validation while the 9B Metal timeout is investigated separately.
--- a/docs/DFLASH_APPLE_SILICON.md
+++ b/docs/DFLASH_APPLE_SILICON.md
@@ -77,6 +77,29 @@ Pilot outcome on this Mac:
 Treat that as a **directional proof**, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding.
 ## Known 9B failure on this machine
 A follow-up live run with:
 - `Qwen/Qwen3.5-9B`
 - `z-lab/Qwen3.5-9B-DFlash`
 failed on this same M3 Max 36GB Mac with:
 ```text
 [METAL] Command buffer execution failed:
 Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
 ```
 That failure is recorded in:
 - `benchmarks/reports/dflash_m3max_36gb_qwen35_9b_timeout.md`
 So the current guidance is:
 - treat `qwen35-9b` as **experimental** on this machine
 - treat `qwen35-4b` as the current **known-working local proof path**
 - keep the issue open until we either stabilize the 9B path or clearly rule it out for this hardware tier
 ## Upstream benchmark command
 The harness uses the upstream MLX benchmark syntax from `z-lab/dflash`: