Files
hermes-agent/benchmarks/test_images/blue_square.png
Alexander Whitestone fa81831cd2
Some checks failed
Docker Build and Publish / build-and-push (pull_request) Has been skipped
Contributor Attribution Check / check-attribution (pull_request) Failing after 35s
Supply Chain Audit / Scan PR for supply chain risks (pull_request) Successful in 34s
Tests / e2e (pull_request) Successful in 2m45s
Tests / test (pull_request) Failing after 17m0s
fix: local test images for reliable vision benchmark (#868)
Vision benchmark used external URLs that may become unavailable,
causing flaky CI runs.

New benchmarks/test_images.json:
- 5 test images with local paths, descriptions, expected answers
- Categories: shape_color, ocr, counting

New benchmarks/test_images/:
- 5 generated PNG test images (red_circle, blue_square,
  green_triangle, text_hello, mixed_shapes)
- Deterministic, always available, ~1-3KB each

New benchmarks/vision_benchmark.py:
- load_test_dataset(): loads test_images.json
- verify_images_exist(): checks all images present
- run_vision_test(): single test with base64 image encoding
- evaluate_response(): checks expected keywords in response
- run_benchmark(): full benchmark suite
- format_report(): human-readable results
- --model, --base-url, --json flags

Closes #868
2026-04-15 23:36:58 -04:00

779 B
256x256px