turboquant/docs/DFLASH_APPLE_SILICON.md

# DFlash on Apple Silicon

This repo now carries a **Gitea-first benchmark harness** for evaluating whether upstream **DFlash on MLX** is worth adding to the local Apple Silicon inference stack.

## Why

The headline `Kimi K2.6 + DFlash` benchmark was measured on `8x MI300X` with huge RAM and ROCm patches. That exact recipe is not a fit for a `36 GB` Apple Silicon Mac.

What *is* relevant locally is the upstream `z-lab/dflash` MLX path, which can benchmark smaller matched target/draft pairs that fit on Apple Silicon.

## Current repo entry point

Use:

```bash
python3 benchmarks/dflash_apple_silicon.py --machine-label "M3 Max 36GB"
```

This prints a benchmark report template with:
- the selected model/draft pair
- exact setup commands
- the upstream MLX benchmark command
- baseline comparison guidance

Write the template to a file:

```bash
python3 benchmarks/dflash_apple_silicon.py \
  --machine-label "M3 Max 36GB" \
  --output benchmarks/reports/dflash_m3max_36gb.md
```

Emit the underlying plan as JSON:

```bash
python3 benchmarks/dflash_apple_silicon.py --format json
```

## Selection logic

Today the planner uses two upstream-supported MLX pairs:

- `qwen35-9b`
  - base: `Qwen/Qwen3.5-9B`
  - draft: `z-lab/Qwen3.5-9B-DFlash`
  - chosen for ~28 GB+ machines
- `qwen35-4b`
  - base: `Qwen/Qwen3.5-4B`
  - draft: `z-lab/Qwen3.5-4B-DFlash`
  - fallback for tighter-memory Macs

On a `36 GB` Mac, the default recommendation is `qwen35-9b`.

## Pilot result already landed

A first live Apple Silicon run has already been captured in:

- `benchmarks/reports/dflash_m3max_36gb_qwen35_4b_pilot.md`

Pilot command:

```bash
python -m dflash.benchmark --backend mlx \
    --model Qwen/Qwen3.5-4B \
    --draft-model z-lab/Qwen3.5-4B-DFlash \
    --dataset gsm8k \
    --max-samples 1 \
    --enable-thinking \
    --draft-sliding-window-size 4096
```

Pilot outcome on this Mac:

- baseline throughput: `22.35 tok/s`
- DFlash throughput: `46.78 tok/s`
- decoding speedup: `2.09x`

Treat that as a **directional proof**, not a final decision benchmark. The next step is the fuller comparison slice against plain MLX or llama.cpp speculative decoding.

## Known 9B failure on this machine

A follow-up live run with:

- `Qwen/Qwen3.5-9B`
- `z-lab/Qwen3.5-9B-DFlash`

failed on this same M3 Max 36GB Mac with:

```text
[METAL] Command buffer execution failed:
Caused GPU Timeout Error (00000002:kIOGPUCommandBufferCallbackErrorTimeout)
```

That failure is recorded in:

- `benchmarks/reports/dflash_m3max_36gb_qwen35_9b_timeout.md`

So the current guidance is:
- treat `qwen35-9b` as **experimental** on this machine
- treat `qwen35-4b` as the current **known-working local proof path**
- keep the issue open until we either stabilize the 9B path or clearly rule it out for this hardware tier

## Upstream benchmark command

The harness uses the upstream MLX benchmark syntax from `z-lab/dflash`:

```bash
python -m dflash.benchmark --backend mlx \
    --model Qwen/Qwen3.5-9B \
    --draft-model z-lab/Qwen3.5-9B-DFlash \
    --dataset gsm8k \
    --max-samples 128 \
    --enable-thinking \
    --draft-sliding-window-size 4096
```

## What remains

This PR adds the **planner + report template** so the benchmark is reproducible from the repo.
The issue remains open until a real Apple Silicon run lands with:

- measured throughput
- measured memory
- a baseline comparison against plain MLX or llama.cpp speculative decoding
- a recommendation on whether to operationalize DFlash locally