|
|
eed87e454e
|
test: Benchmark Gemma 4 vision accuracy vs current approach (#817)
Contributor Attribution Check / check-attribution (pull_request) Successful in 26s
Docker Build and Publish / build-and-push (pull_request) Has been skipped
Supply Chain Audit / Scan PR for supply chain risks (pull_request) Successful in 26s
Tests / e2e (pull_request) Successful in 2m38s
Tests / test (pull_request) Failing after 47m49s
Vision benchmark suite comparing Gemma 4 (google/gemma-4-27b-it) vs
current Gemini 3 Flash Preview (google/gemini-3-flash-preview).
Metrics:
- OCR accuracy (character + word overlap)
- Description completeness (keyword coverage)
- Structural quality (length, sentences, numbers)
- Latency (ms per image)
- Token usage
- Consistency across runs
Features:
- 24 diverse test images (screenshots, diagrams, photos, charts)
- Category-specific evaluation prompts
- Automated verdict with composite scoring
- JSON + markdown report output
- 28 unit tests passing
Usage:
python benchmarks/vision_benchmark.py --images benchmarks/test_images.json
python benchmarks/vision_benchmark.py --url https://example.com/img.png
python benchmarks/vision_benchmark.py --generate-dataset
Closes #817.
|
2026-04-15 23:02:02 -04:00 |
|