Timmy_Foundation/turboquant

Fork 0

Files

Alexander Whitestone dabb96d315

Smoke Test / smoke (pull_request) Successful in 19s

Details

docs: record Qwen3.5-9B DFlash Metal timeout (refs #152 , #154 )

2026-04-21 22:25:25 -04:00

3.4 KiB

Raw Blame History

DFlash on Apple Silicon

This repo now carries a Gitea-first benchmark harness for evaluating whether upstream DFlash on MLX is worth adding to the local Apple Silicon inference stack.

Why

The headline Kimi K2.6 + DFlash benchmark was measured on 8x MI300X with huge RAM and ROCm patches. That exact recipe is not a fit for a 36 GB Apple Silicon Mac.

What is relevant locally is the upstream z-lab/dflash MLX path, which can benchmark smaller matched target/draft pairs that fit on Apple Silicon.

Current repo entry point

Use:

python3 benchmarks/dflash_apple_silicon.py --machine-label "M3 Max 36GB"

This prints a benchmark report template with:

the selected model/draft pair
exact setup commands
the upstream MLX benchmark command
baseline comparison guidance

Write the template to a file:

python3 benchmarks/dflash_apple_silicon.py \
  --machine-label "M3 Max 36GB" \
  --output benchmarks/reports/dflash_m3max_36gb.md

Emit the underlying plan as JSON:

python3 benchmarks/dflash_apple_silicon.py --format json

Selection logic

Today the planner uses two upstream-supported MLX pairs:

qwen35-9b
- base: Qwen/Qwen3.5-9B
- draft: z-lab/Qwen3.5-9B-DFlash
- chosen for ~28 GB+ machines
qwen35-4b
- base: Qwen/Qwen3.5-4B
- draft: z-lab/Qwen3.5-4B-DFlash
- fallback for tighter-memory Macs

On a 36 GB Mac, the default recommendation is qwen35-9b.

Pilot result already landed

A first live Apple Silicon run has already been captured in:

benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md

Pilot command:

python -m dflash.benchmark --backend mlx \
    --model Qwen/Qwen3.5-4B \
    --draft-model z-lab/Qwen3.5-4B-DFlash \
    --dataset gsm8k \
    --max-samples 1 \
    --enable-thinking \
    --draft-sliding-window-size 4096

Pilot outcome on this Mac:

baseline throughput: 22.35 tok/s
DFlash throughput: 46.78 tok/s
decoding speedup: 2.09x

Treat that as a directional proof, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding.

Known 9B failure on this machine

A follow-up live run with:

Qwen/Qwen3.5-9B
z-lab/Qwen3.5-9B-DFlash

failed on this same M3 Max 36GB Mac with:

[METAL] Command buffer execution failed:
Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)

That failure is recorded in:

benchmarks/reports/dflash_m3max_36gb_qwen35_9b_timeout.md

So the current guidance is:

treat qwen35-9b as experimental on this machine
treat qwen35-4b as the current known-working local proof path
keep the issue open until we either stabilize the 9B path or clearly rule it out for this hardware tier

Upstream benchmark command

The harness uses the upstream MLX benchmark syntax from z-lab/dflash:

python -m dflash.benchmark --backend mlx \
    --model Qwen/Qwen3.5-9B \
    --draft-model z-lab/Qwen3.5-9B-DFlash \
    --dataset gsm8k \
    --max-samples 128 \
    --enable-thinking \
    --draft-sliding-window-size 4096

What remains

This PR adds the planner + report template so the benchmark is reproducible from the repo. The issue remains open until a real Apple Silicon run lands with:

measured throughput
measured memory
a baseline comparison against plain MLX or llama.cpp speculative decoding
a recommendation on whether to operationalize DFlash locally

3.4 KiB Raw Blame History