Timmy_Foundation/hermes-agent

Fork 0

Files

Alexander Whitestone a244b157be

Contributor Attribution Check / check-attribution (pull_request) Successful in 42s

Details

Docker Build and Publish / build-and-push (pull_request) Has been skipped

Details

Supply Chain Audit / Scan PR for supply chain risks (pull_request) Successful in 32s

Details

Tests / e2e (pull_request) Successful in 2m26s

Details

Tests / test (pull_request) Failing after 44m7s

Details

bench: add Gemma 4 vs mimo-v2-pro tool calling benchmark (#796 )

100-call regression test across 7 tool categories:
- File operations (20): read_file, write_file, search_files
- Terminal commands (20): shell execution
- Web search (15): web_search
- Code execution (15): execute_code
- Browser automation (10): browser_navigate
- Delegation (10): delegate_task
- MCP tools (10): mcp_list/read/call

Metrics tracked:
- Schema parse success (valid JSON tool calls)
- Tool name accuracy (correct tool selected)
- Arguments accuracy (required args present)
- Average latency per call

Usage:
  python3 benchmarks/tool_call_benchmark.py --model nous:xiaomi/mimo-v2-pro
  python3 benchmarks/tool_call_benchmark.py --model ollama/gemma4:latest
  python3 benchmarks/tool_call_benchmark.py --compare

2026-04-15 18:56:35 -04:00

1.1 KiB

Raw Blame History

Tool Call Benchmark: Gemma 4 vs mimo-v2-pro

Date: 2026-04-13 Status: Awaiting execution

Test Design

100 diverse tool calls across 7 categories:

Category	Count	Tools Tested
File operations	20	read_file, write_file, search_files
Terminal commands	20	terminal
Web search	15	web_search
Code execution	15	execute_code
Browser automation	10	browser_navigate
Delegation	10	delegate_task
MCP tools	10	mcp_*

Metrics

Metric	mimo-v2-pro	Gemma 4
Schema parse success	—	—
Tool execution success	—	—
Parallel tool success	—	—
Avg latency (s)	—	—
Token cost per call	—	—

How to Run

python3 benchmarks/tool_call_benchmark.py --model nous:xiaomi/mimo-v2-pro
python3 benchmarks/tool_call_benchmark.py --model ollama/gemma4:latest
python3 benchmarks/tool_call_benchmark.py --compare

Gemma 4-Specific Failure Modes

To be documented after benchmark execution.

1.1 KiB Raw Blame History

Tool Call Benchmark: Gemma 4 vs mimo-v2-pro

Test Design

Metrics

How to Run

Gemma 4-Specific Failure Modes

1.1 KiB

Raw Blame History