Some checks failed
Contributor Attribution Check / check-attribution (pull_request) Successful in 42s
Docker Build and Publish / build-and-push (pull_request) Has been skipped
Supply Chain Audit / Scan PR for supply chain risks (pull_request) Successful in 32s
Tests / e2e (pull_request) Successful in 2m26s
Tests / test (pull_request) Failing after 44m7s
100-call regression test across 7 tool categories: - File operations (20): read_file, write_file, search_files - Terminal commands (20): shell execution - Web search (15): web_search - Code execution (15): execute_code - Browser automation (10): browser_navigate - Delegation (10): delegate_task - MCP tools (10): mcp_list/read/call Metrics tracked: - Schema parse success (valid JSON tool calls) - Tool name accuracy (correct tool selected) - Arguments accuracy (required args present) - Average latency per call Usage: python3 benchmarks/tool_call_benchmark.py --model nous:xiaomi/mimo-v2-pro python3 benchmarks/tool_call_benchmark.py --model ollama/gemma4:latest python3 benchmarks/tool_call_benchmark.py --compare
1.1 KiB
1.1 KiB
Tool Call Benchmark: Gemma 4 vs mimo-v2-pro
Date: 2026-04-13 Status: Awaiting execution
Test Design
100 diverse tool calls across 7 categories:
| Category | Count | Tools Tested |
|---|---|---|
| File operations | 20 | read_file, write_file, search_files |
| Terminal commands | 20 | terminal |
| Web search | 15 | web_search |
| Code execution | 15 | execute_code |
| Browser automation | 10 | browser_navigate |
| Delegation | 10 | delegate_task |
| MCP tools | 10 | mcp_* |
Metrics
| Metric | mimo-v2-pro | Gemma 4 |
|---|---|---|
| Schema parse success | — | — |
| Tool execution success | — | — |
| Parallel tool success | — | — |
| Avg latency (s) | — | — |
| Token cost per call | — | — |
How to Run
python3 benchmarks/tool_call_benchmark.py --model nous:xiaomi/mimo-v2-pro
python3 benchmarks/tool_call_benchmark.py --model ollama/gemma4:latest
python3 benchmarks/tool_call_benchmark.py --compare
Gemma 4-Specific Failure Modes
To be documented after benchmark execution.