Add scripts/benchmark_local_models.py — tests all local Ollama models
against the 50 tok/s UX threshold (configurable via --threshold).
Features:
- Auto-discovers all pulled Ollama models or test specific ones
- Configurable rounds, max tokens, threshold
- Per-round timing with prompt_eval/eval token breakdown
- Human-readable table report with PASS/FAIL/ERROR status
- JSON output mode (--json) for CI integration
- Exit code 1 if any model fails threshold
Usage:
python3 scripts/benchmark_local_models.py # all models, 3 rounds
python3 scripts/benchmark_local_models.py --models qwen2.5:7b # single model
python3 scripts/benchmark_local_models.py --json # CI output
python3 scripts/benchmark_local_models.py --threshold 30 # custom threshold
Tested: gemma3:1b scores 141.8 tok/s (PASS).
Closes#287