The DFlash benchmark with --draft-sliding-window-size 4096 on the 9B model
causes a Metal GPU timeout on Apple Silicon (kIOGPUCommandBufferCallbackErrorTimeout).
Root cause: the 9B model's larger compute workload combined with a 4096-size
draft sliding window produces GPU command buffers that exceed the watchdog
timeout. The 4B model does not exhibit this problem.
Mitigation: lower the default draft sliding window for the 9B pair from 4096
to 2048. This avoids the timeout while still providing meaningful speedup.
Changes:
- Add benchmarks/dflash_apple_silicon.py (DFlash benchmark planner)
- 9B pair now uses draft_sliding_window_size=2048
- 4B pair retains draft_sliding_window_size=4096
- Add tests/test_dflash_apple_silicon.py with #154-specific test
- Add docs/DFLASH_APPLE_SILICON.md documenting the mitigation
- Add benchmarks/reports/dflash_m3max_36gb_qwen35_9b_timeout.md recording failure
Verification: pytest -q tests/test_dflash_apple_silicon.py
Test explicitly asserts 9B uses window=2048 to prevent timeout regression.
Closes#154
The test `test_levels_ordered_by_quality` asserted strictly descending
`bits_per_channel`, but `q4_0` (4.0 bits) is a non-TurboQuant fallback
placed last regardless of bit width. The design invariant is:
- TurboQuant levels (turbo4→turbo2): ordered by compression_ratio
ascending (more aggressive = more compression)
- Fallback levels (q4_0): placed after all TurboQuant levels as safe
defaults, not part of the quality progression
Changes:
- `test_levels_ordered_by_quality`: Now validates compression_ratio
ordering for TurboQuant levels only, not across fallbacks
- `test_fallback_quant_is_last`: New test ensuring non-TurboQuant
fallbacks always appear after TurboQuant levels
Closes#138Closes#139 (duplicate)