docs/DFLASH_APPLE_SILICON.md

# DFlash on Apple Silicon

This repo now carries a **Gitea-first benchmark harness** for evaluating whether upstream **DFlash on MLX** is worth adding to the local Apple Silicon inference stack.

## Why

The headline `Kimi K2.6 + DFlash` benchmark was measured on `8x MI300X` with huge RAM and ROCm patches. That exact recipe is not a fit for a `36 GB` Apple Silicon Mac.

What *is* relevant locally is the upstream `z-lab/dflash` MLX path, which can benchmark smaller matched target/draft pairs that fit on Apple Silicon.

## Current repo entry point

Use:

```bash
python3 benchmarks/dflash_apple_silicon.py --machine-label "M3 Max 36GB"
```

This prints a benchmark report template with:
- the selected model/draft pair
- exact setup commands
- the upstream MLX benchmark command
- baseline comparison guidance

Write the template to a file:

```bash
python3 benchmarks/dflash_apple_silicon.py \
  --machine-label "M3 Max 36GB" \
  --output benchmarks/reports/dflash_m3max_36gb.md
```

Emit the underlying plan as JSON:

```bash
python3 benchmarks/dflash_apple_silicon.py --format json
```

## Selection logic

Today the planner uses two upstream-supported MLX pairs:

- `qwen35-9b`
  - base: `Qwen/Qwen3.5-9B`
  - draft: `z-lab/Qwen3.5-9B-DFlash`
  - chosen for ~28 GB+ machines
- `qwen35-4b`
  - base: `Qwen/Qwen3.5-4B`
  - draft: `z-lab/Qwen3.5-4B-DFlash`
  - fallback for tighter-memory Macs

On a `36 GB` Mac, the default recommendation is `qwen35-9b`.

**Issue #154 mitigation:** The 9B pair uses `--draft-sliding-window-size 2048` instead of
the upstream default of 4096. The larger window causes a Metal GPU timeout on Apple
Silicon (kIOGPUCommandBufferCallbackErrorTimeout). See issue #154 for details.

## Pilot result

A first live Apple Silicon run with the 4B pair has been captured in:

- `benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md`

Pilot command:

```bash
python -m dflash.benchmark --backend mlx \
    --model Qwen/Qwen3.5-4B \
    --draft-model z-lab/Qwen3.5-4B-DFlash \
    --dataset gsm8k \
    --max-samples 1 \
    --enable-thinking \
    --draft-sliding-window-size 4096
```

Pilot outcome on this Mac:

- baseline throughput: `22.35 tok/s`
- DFlash throughput: `46.78 tok/s`
- decoding speedup: `2.09x`

Treat that as a **directional proof**, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding.

## Upstream benchmark command

The harness uses the upstream MLX benchmark syntax from `z-lab/dflash`:

```bash
python -m dflash.benchmark --backend mlx \
    --model Qwen/Qwen3.5-9B \
    --draft-model z-lab/Qwen3.5-9B-DFlash \
    --dataset gsm8k \
    --max-samples 128 \
    --enable-thinking \
    --draft-sliding-window-size 2048
```

**Note the window size:** 9B uses 2048 (issue #154 mitigation). 4B can use 4096.

## Known issues

### Issue #154: 9B Metal GPU timeout at window=4096

The `qwen35-9b` pair with `--draft-sliding-window-size 4096` fails on M3 Max 36GB:

```
[METAL] Command buffer execution failed:
Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
```

**Mitigation:** The planner automatically sets window=2048 for the 9B pair.

**Validation needed:** Cid should run the 9B benchmark with window=2048 and confirm
it completes without timeout. If 2048 still fails, try window=1024.

## What remains

This planner makes the DFlash benchmark reproducible and includes the known
workaround for issue #154. The issue stays open until:

- a live 9B run with window=2048 (or 1024) completes successfully, **and**
- throughput/memory numbers are recorded, **and**
- a final call is made: operationalize locally or rule out.
fix: mitigate MLX Metal GPU timeout for qwen35-9b (issue #154) The DFlash benchmark with --draft-sliding-window-size 4096 on the 9B model causes a Metal GPU timeout on Apple Silicon (kIOGPUCommandBufferCallbackErrorTimeout). Root cause: the 9B model's larger compute workload combined with a 4096-size draft sliding window produces GPU command buffers that exceed the watchdog timeout. The 4B model does not exhibit this problem. Mitigation: lower the default draft sliding window for the 9B pair from 4096 to 2048. This avoids the timeout while still providing meaningful speedup. Changes: - Add benchmarks/dflash_apple_silicon.py (DFlash benchmark planner) - 9B pair now uses draft_sliding_window_size=2048 - 4B pair retains draft_sliding_window_size=4096 - Add tests/test_dflash_apple_silicon.py with #154-specific test - Add docs/DFLASH_APPLE_SILICON.md documenting the mitigation - Add benchmarks/reports/dflash_m3max_36gb_qwen35_9b_timeout.md recording failure Verification: pytest -q tests/test_dflash_apple_silicon.py Test explicitly asserts 9B uses window=2048 to prevent timeout regression. Closes #154 2026-04-25 20:04:55 -04:00			`# DFlash on Apple Silicon`

			`This repo now carries a Gitea-first benchmark harness for evaluating whether upstream DFlash on MLX is worth adding to the local Apple Silicon inference stack.`

			`## Why`

			The headline `Kimi K2.6 + DFlash` benchmark was measured on `8x MI300X` with huge RAM and ROCm patches. That exact recipe is not a fit for a `36 GB` Apple Silicon Mac.

			What is relevant locally is the upstream `z-lab/dflash` MLX path, which can benchmark smaller matched target/draft pairs that fit on Apple Silicon.

			`## Current repo entry point`

			`Use:`

			```bash
			`python3 benchmarks/dflash_apple_silicon.py --machine-label "M3 Max 36GB"`
			```

			`This prints a benchmark report template with:`
			`- the selected model/draft pair`
			`- exact setup commands`
			`- the upstream MLX benchmark command`
			`- baseline comparison guidance`

			`Write the template to a file:`

			```bash
			`python3 benchmarks/dflash_apple_silicon.py \`
			`--machine-label "M3 Max 36GB" \`
			`--output benchmarks/reports/dflash_m3max_36gb.md`
			```

			`Emit the underlying plan as JSON:`

			```bash
			`python3 benchmarks/dflash_apple_silicon.py --format json`
			```

			`## Selection logic`

			`Today the planner uses two upstream-supported MLX pairs:`

			- `qwen35-9b`
			- base: `Qwen/Qwen3.5-9B`
			- draft: `z-lab/Qwen3.5-9B-DFlash`
			`- chosen for ~28 GB+ machines`
			- `qwen35-4b`
			- base: `Qwen/Qwen3.5-4B`
			- draft: `z-lab/Qwen3.5-4B-DFlash`
			`- fallback for tighter-memory Macs`

			On a `36 GB` Mac, the default recommendation is `qwen35-9b`.

			Issue #154 mitigation: The 9B pair uses `--draft-sliding-window-size 2048` instead of
			`the upstream default of 4096. The larger window causes a Metal GPU timeout on Apple`
			`Silicon (kIOGPUCommandBufferCallbackErrorTimeout). See issue #154 for details.`

			`## Pilot result`

			`A first live Apple Silicon run with the 4B pair has been captured in:`

			- `benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md`

			`Pilot command:`

			```bash
			`python -m dflash.benchmark --backend mlx \`
			`--model Qwen/Qwen3.5-4B \`
			`--draft-model z-lab/Qwen3.5-4B-DFlash \`
			`--dataset gsm8k \`
			`--max-samples 1 \`
			`--enable-thinking \`
			`--draft-sliding-window-size 4096`
			```

			`Pilot outcome on this Mac:`

			- baseline throughput: `22.35 tok/s`
			- DFlash throughput: `46.78 tok/s`
			- decoding speedup: `2.09x`

			`Treat that as a directional proof, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding.`

			`## Upstream benchmark command`

			The harness uses the upstream MLX benchmark syntax from `z-lab/dflash`:

			```bash
			`python -m dflash.benchmark --backend mlx \`
			`--model Qwen/Qwen3.5-9B \`
			`--draft-model z-lab/Qwen3.5-9B-DFlash \`
			`--dataset gsm8k \`
			`--max-samples 128 \`
			`--enable-thinking \`
			`--draft-sliding-window-size 2048`
			```

			`Note the window size: 9B uses 2048 (issue #154 mitigation). 4B can use 4096.`

			`## Known issues`

			`### Issue #154: 9B Metal GPU timeout at window=4096`

			The `qwen35-9b` pair with `--draft-sliding-window-size 4096` fails on M3 Max 36GB:

			```
			`[METAL] Command buffer execution failed:`
			`Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)`
			```

			`Mitigation: The planner automatically sets window=2048 for the 9B pair.`

			`Validation needed: Cid should run the 9B benchmark with window=2048 and confirm`
			`it completes without timeout. If 2048 still fails, try window=1024.`

			`## What remains`

			`This planner makes the DFlash benchmark reproducible and includes the known`
			`workaround for issue #154. The issue stays open until:`

			`- a live 9B run with window=2048 (or 1024) completes successfully, and`
			`- throughput/memory numbers are recorded, and`
			`- a final call is made: operationalize locally or rule out.`