hermes-agent

Timmy_Foundation/hermes-agent

Fork 0

Commit Graph

Author	SHA1	Message	Date
Hermes Merge Bot	fcc322fb81	Merge PR #867	2026-04-16 02:03:23 -04:00
Timmy	eed87e454e	test: Benchmark Gemma 4 vision accuracy vs current approach (#817 ) Some checks failed Contributor Attribution Check / check-attribution (pull_request) Successful in 26s Details Docker Build and Publish / build-and-push (pull_request) Has been skipped Details Supply Chain Audit / Scan PR for supply chain risks (pull_request) Successful in 26s Details Tests / e2e (pull_request) Successful in 2m38s Details Tests / test (pull_request) Failing after 47m49s Details Vision benchmark suite comparing Gemma 4 (google/gemma-4-27b-it) vs current Gemini 3 Flash Preview (google/gemini-3-flash-preview). Metrics: - OCR accuracy (character + word overlap) - Description completeness (keyword coverage) - Structural quality (length, sentences, numbers) - Latency (ms per image) - Token usage - Consistency across runs Features: - 24 diverse test images (screenshots, diagrams, photos, charts) - Category-specific evaluation prompts - Automated verdict with composite scoring - JSON + markdown report output - 28 unit tests passing Usage: python benchmarks/vision_benchmark.py --images benchmarks/test_images.json python benchmarks/vision_benchmark.py --url https://example.com/img.png python benchmarks/vision_benchmark.py --generate-dataset Closes #817.	2026-04-15 23:02:02 -04:00
Alexander Whitestone	a244b157be	bench: add Gemma 4 vs mimo-v2-pro tool calling benchmark (#796 ) Some checks failed Contributor Attribution Check / check-attribution (pull_request) Successful in 42s Details Docker Build and Publish / build-and-push (pull_request) Has been skipped Details Supply Chain Audit / Scan PR for supply chain risks (pull_request) Successful in 32s Details Tests / e2e (pull_request) Successful in 2m26s Details Tests / test (pull_request) Failing after 44m7s Details 100-call regression test across 7 tool categories: - File operations (20): read_file, write_file, search_files - Terminal commands (20): shell execution - Web search (15): web_search - Code execution (15): execute_code - Browser automation (10): browser_navigate - Delegation (10): delegate_task - MCP tools (10): mcp_list/read/call Metrics tracked: - Schema parse success (valid JSON tool calls) - Tool name accuracy (correct tool selected) - Arguments accuracy (required args present) - Average latency per call Usage: python3 benchmarks/tool_call_benchmark.py --model nous:xiaomi/mimo-v2-pro python3 benchmarks/tool_call_benchmark.py --model ollama/gemma4:latest python3 benchmarks/tool_call_benchmark.py --compare	2026-04-15 18:56:35 -04:00

Author

SHA1

Message

Date

Hermes Merge Bot

fcc322fb81

Merge PR #867

2026-04-16 02:03:23 -04:00

Timmy

eed87e454e

test: Benchmark Gemma 4 vision accuracy vs current approach (#817 )

Contributor Attribution Check / check-attribution (pull_request) Successful in 26s

Details

Docker Build and Publish / build-and-push (pull_request) Has been skipped

Details

Supply Chain Audit / Scan PR for supply chain risks (pull_request) Successful in 26s

Details

Tests / e2e (pull_request) Successful in 2m38s

Details

Tests / test (pull_request) Failing after 47m49s

Details

Vision benchmark suite comparing Gemma 4 (google/gemma-4-27b-it) vs
current Gemini 3 Flash Preview (google/gemini-3-flash-preview).

Metrics:
- OCR accuracy (character + word overlap)
- Description completeness (keyword coverage)
- Structural quality (length, sentences, numbers)
- Latency (ms per image)
- Token usage
- Consistency across runs

Features:
- 24 diverse test images (screenshots, diagrams, photos, charts)
- Category-specific evaluation prompts
- Automated verdict with composite scoring
- JSON + markdown report output
- 28 unit tests passing

Usage:
  python benchmarks/vision_benchmark.py --images benchmarks/test_images.json
  python benchmarks/vision_benchmark.py --url https://example.com/img.png
  python benchmarks/vision_benchmark.py --generate-dataset

Closes #817.

2026-04-15 23:02:02 -04:00

Alexander Whitestone

a244b157be

bench: add Gemma 4 vs mimo-v2-pro tool calling benchmark (#796 )

Contributor Attribution Check / check-attribution (pull_request) Successful in 42s

Details

Docker Build and Publish / build-and-push (pull_request) Has been skipped

Details

Supply Chain Audit / Scan PR for supply chain risks (pull_request) Successful in 32s

Details

Tests / e2e (pull_request) Successful in 2m26s

Details

Tests / test (pull_request) Failing after 44m7s

Details

100-call regression test across 7 tool categories:
- File operations (20): read_file, write_file, search_files
- Terminal commands (20): shell execution
- Web search (15): web_search
- Code execution (15): execute_code
- Browser automation (10): browser_navigate
- Delegation (10): delegate_task
- MCP tools (10): mcp_list/read/call

Metrics tracked:
- Schema parse success (valid JSON tool calls)
- Tool name accuracy (correct tool selected)
- Arguments accuracy (required args present)
- Average latency per call

Usage:
  python3 benchmarks/tool_call_benchmark.py --model nous:xiaomi/mimo-v2-pro
  python3 benchmarks/tool_call_benchmark.py --model ollama/gemma4:latest
  python3 benchmarks/tool_call_benchmark.py --compare

2026-04-15 18:56:35 -04:00

3 Commits