- Add benchmarks/m1_mac_benchmark.py — orchestrates benchmark of all three
presets (k8v4, 4bit_nc, 3bit_nc) on Apple Silicon via llama-server or vllm; measures tokens/sec (throughput), peak memory (RSS), quality via GSM8K subset (evaluator), and tool-call accuracy.
- Add benchmarks/m1-mac-template.md — scaffold results markdown to be filled by the script; includes hardware detection, table, and recommendation.
- Add tests/test_m1_benchmark.py — unit tests for preset definitions, quality evaluators, and markdown generation.
Acceptance #94:
[x] Results table with preset × tokens/sec × peak_memory × GSM8K_score × tool_call_accuracy
[x] Output saved to benchmarks/m1-mac-YYYY-MM-DD.md (generated by script)
[x] Recommendation format (script generates a default after running); template supplied.
The benchmark requires llama-server running locally (or vllm) and Gemma 4 model. It is not executed during CI; only smoke tests validate importability and logic.
The test `test_levels_ordered_by_quality` asserted strictly descending
`bits_per_channel`, but `q4_0` (4.0 bits) is a non-TurboQuant fallback
placed last regardless of bit width. The design invariant is:
- TurboQuant levels (turbo4→turbo2): ordered by compression_ratio
ascending (more aggressive = more compression)
- Fallback levels (q4_0): placed after all TurboQuant levels as safe
defaults, not part of the quality progression
Changes:
- `test_levels_ordered_by_quality`: Now validates compression_ratio
ordering for TurboQuant levels only, not across fallbacks
- `test_fallback_quant_is_last`: New test ensuring non-TurboQuant
fallbacks always appear after TurboQuant levels
Closes#138Closes#139 (duplicate)