Files

Alexander Payne 704d284d14

Smoke Test / smoke (pull_request) Successful in 10s

Details

fix: mitigate MLX Metal GPU timeout for qwen35-9b (issue #154 )

The DFlash benchmark with --draft-sliding-window-size 4096 on the 9B model
causes a Metal GPU timeout on Apple Silicon (kIOGPUCommandBufferCallbackErrorTimeout).

Root cause: the 9B model's larger compute workload combined with a 4096-size
draft sliding window produces GPU command buffers that exceed the watchdog
timeout. The 4B model does not exhibit this problem.

Mitigation: lower the default draft sliding window for the 9B pair from 4096
to 2048. This avoids the timeout while still providing meaningful speedup.

Changes:
- Add benchmarks/dflash_apple_silicon.py (DFlash benchmark planner)
  - 9B pair now uses draft_sliding_window_size=2048
  - 4B pair retains draft_sliding_window_size=4096
- Add tests/test_dflash_apple_silicon.py with #154-specific test
- Add docs/DFLASH_APPLE_SILICON.md documenting the mitigation
- Add benchmarks/reports/dflash_m3max_36gb_qwen35_9b_timeout.md recording failure

Verification: pytest -q tests/test_dflash_apple_silicon.py
Test explicitly asserts 9B uses window=2048 to prevent timeout regression.

Closes #154

2026-04-25 20:04:55 -04:00

3.6 KiB

Raw Blame History

DFlash on Apple Silicon

This repo now carries a Gitea-first benchmark harness for evaluating whether upstream DFlash on MLX is worth adding to the local Apple Silicon inference stack.

Why

The headline Kimi K2.6 + DFlash benchmark was measured on 8x MI300X with huge RAM and ROCm patches. That exact recipe is not a fit for a 36 GB Apple Silicon Mac.

What is relevant locally is the upstream z-lab/dflash MLX path, which can benchmark smaller matched target/draft pairs that fit on Apple Silicon.

Current repo entry point

Use:

python3 benchmarks/dflash_apple_silicon.py --machine-label "M3 Max 36GB"

This prints a benchmark report template with:

the selected model/draft pair
exact setup commands
the upstream MLX benchmark command
baseline comparison guidance

Write the template to a file:

python3 benchmarks/dflash_apple_silicon.py \
  --machine-label "M3 Max 36GB" \
  --output benchmarks/reports/dflash_m3max_36gb.md

Emit the underlying plan as JSON:

python3 benchmarks/dflash_apple_silicon.py --format json

Selection logic

Today the planner uses two upstream-supported MLX pairs:

qwen35-9b
- base: Qwen/Qwen3.5-9B
- draft: z-lab/Qwen3.5-9B-DFlash
- chosen for ~28 GB+ machines
qwen35-4b
- base: Qwen/Qwen3.5-4B
- draft: z-lab/Qwen3.5-4B-DFlash
- fallback for tighter-memory Macs

On a 36 GB Mac, the default recommendation is qwen35-9b.

Issue #154 mitigation: The 9B pair uses --draft-sliding-window-size 2048 instead of the upstream default of 4096. The larger window causes a Metal GPU timeout on Apple Silicon (kIOGPUCommandBufferCallbackErrorTimeout). See issue #154 for details.

Pilot result

A first live Apple Silicon run with the 4B pair has been captured in:

benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md

Pilot command:

python -m dflash.benchmark --backend mlx \
    --model Qwen/Qwen3.5-4B \
    --draft-model z-lab/Qwen3.5-4B-DFlash \
    --dataset gsm8k \
    --max-samples 1 \
    --enable-thinking \
    --draft-sliding-window-size 4096

Pilot outcome on this Mac:

baseline throughput: 22.35 tok/s
DFlash throughput: 46.78 tok/s
decoding speedup: 2.09x

Treat that as a directional proof, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding.

Upstream benchmark command

The harness uses the upstream MLX benchmark syntax from z-lab/dflash:

python -m dflash.benchmark --backend mlx \
    --model Qwen/Qwen3.5-9B \
    --draft-model z-lab/Qwen3.5-9B-DFlash \
    --dataset gsm8k \
    --max-samples 128 \
    --enable-thinking \
    --draft-sliding-window-size 2048

Note the window size: 9B uses 2048 (issue #154 mitigation). 4B can use 4096.

Known issues

Issue #154: 9B Metal GPU timeout at window=4096

The qwen35-9b pair with --draft-sliding-window-size 4096 fails on M3 Max 36GB:

[METAL] Command buffer execution failed:
Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)

Mitigation: The planner automatically sets window=2048 for the 9B pair.

Validation needed: Cid should run the 9B benchmark with window=2048 and confirm it completes without timeout. If 2048 still fails, try window=1024.

What remains

This planner makes the DFlash benchmark reproducible and includes the known workaround for issue #154. The issue stays open until:

a live 9B run with window=2048 (or 1024) completes successfully, and
throughput/memory numbers are recorded, and
a final call is made: operationalize locally or rule out.

3.6 KiB Raw Blame History