hermes-agent/benchmarks/gemma4-tool-calling-2026-04-13.md

# Tool Call Benchmark: Gemma 4 vs mimo-v2-pro

Date: 2026-04-13
Status: Awaiting execution

## Test Design

100 diverse tool calls across 7 categories:

| Category | Count | Tools Tested |
|----------|-------|--------------|
| File operations | 20 | read_file, write_file, search_files |
| Terminal commands | 20 | terminal |
| Web search | 15 | web_search |
| Code execution | 15 | execute_code |
| Browser automation | 10 | browser_navigate |
| Delegation | 10 | delegate_task |
| MCP tools | 10 | mcp_* |

## Metrics

| Metric | mimo-v2-pro | Gemma 4 |
|--------|-------------|---------|
| Schema parse success | — | — |
| Tool execution success | — | — |
| Parallel tool success | — | — |
| Avg latency (s) | — | — |
| Token cost per call | — | — |

## How to Run

```bash
python3 benchmarks/tool_call_benchmark.py --model nous:xiaomi/mimo-v2-pro
python3 benchmarks/tool_call_benchmark.py --model ollama/gemma4:latest
python3 benchmarks/tool_call_benchmark.py --compare
```

## Gemma 4-Specific Failure Modes

To be documented after benchmark execution.