42 lines
974 B
Markdown
42 lines
974 B
Markdown
# DFlash Apple Silicon Benchmark Report
|
|
|
|
## Machine
|
|
- Label: M3 Max 36GB
|
|
- Selected pair: qwen35-9b
|
|
- Base model: Qwen/Qwen3.5-9B
|
|
- Draft model: z-lab/Qwen3.5-9B-DFlash
|
|
- Estimated total weight footprint: 19.93 GB
|
|
|
|
## Setup
|
|
```bash
|
|
python3 -m venv .venv-dflash
|
|
source .venv-dflash/bin/activate
|
|
git clone https://github.com/z-lab/dflash.git
|
|
cd dflash
|
|
pip install -e .[mlx]
|
|
python -m dflash.benchmark --backend mlx \
|
|
--model Qwen/Qwen3.5-9B \
|
|
--draft-model z-lab/Qwen3.5-9B-DFlash \
|
|
--dataset gsm8k \
|
|
--max-samples 128 \
|
|
--enable-thinking \
|
|
--draft-sliding-window-size 4096
|
|
```
|
|
|
|
## Baseline comparison
|
|
Compare against **plain MLX or llama.cpp speculative decoding** on the same prompt set.
|
|
|
|
## Results
|
|
- Throughput (tok/s):
|
|
- Peak memory (GB):
|
|
- Notes on acceptance / behavior:
|
|
|
|
## Verdict
|
|
Worth operationalizing locally?
|
|
- [ ] Yes
|
|
- [ ] No
|
|
- [ ] Needs more data
|
|
|
|
## Recommendation
|
|
Explain whether this should become part of the local inference stack.
|