Files
hermes-agent/docs/evaluations/tensorzero-860-evaluation.md
Alexander Whitestone 755e7513a1
All checks were successful
Lint / lint (pull_request) Successful in 37s
docs: add grounded tensorzero evaluation packet (#860)
- add a script that inventories Hermes routing/evaluation surfaces relevant to a TensorZero cutover
- generate a markdown and JSON evaluation packet for issue #860
- score gateway replacement, config migration, canary rollout, session feedback, and eval-suite readiness
- add focused regression tests for touchpoint scanning, requirement scoring, and report rendering

Refs #860
2026-04-22 11:33:31 -04:00

26 KiB
Raw Blame History

TensorZero Evaluation Packet

Issue #860: tensorzero LLMOps platform evaluation

Scope

This packet evaluates TensorZero as a possible replacement for Hermes' custom provider-routing stack. It is intentionally grounded in the current repo state rather than a speculative cutover plan.

Issue requirements being evaluated

  • Deploy tensorzero gateway (Rust binary)
  • Migrate provider routing config
  • Test with canary (10% traffic) before full cutover
  • Feed session data for prompt optimization
  • Evaluation suite for A/B testing models

Recommendation

Not ready for direct replacement. Recommend a shadow-evaluation phase first: keep Hermes routing live, inventory the migration seams, export SessionDB/trajectory data into an offline TensorZero experiment loop, and only design a canary gateway once percentage-based rollout controls exist.

Requirement matrix

Requirement Status Evidence labels Summary
Gateway replacement scope partial fallback_chain, runtime_provider, gateway_provider_routing, cron_runtime_provider, auxiliary_fallback_chain, delegate_runtime_provider Hermes already spreads provider routing across core agent, runtime provider, gateway, cron, auxiliary, and delegation seams; TensorZero would need parity across all of them before it can replace the gateway layer.
Config migration partial provider_routing_config, runtime_provider, smart_model_routing, fallback_chain Hermes has multiple config concepts to migrate (provider_routing, fallback_providers, smart_model_routing, runtime provider resolution), so TensorZero is not a drop-in config swap.
10% traffic canary gap The repo shows semantic routing and fallback, but no grounded 10% traffic-split canary mechanism. A TensorZero cutover would need new percentage-based rollout controls and observability hooks.
Session data for prompt optimization partial session_db, trajectory_export Hermes already has SessionDB and trajectory export surfaces that can feed offline optimization data, but not a TensorZero-native ingestion path yet.
Evaluation suite / A/B testing partial benchmark_suite, trajectory_export Hermes already has benchmark/trajectory machinery that can seed TensorZero A/B evaluation, but no integrated TensorZero experiment runner or live evaluation gateway.

Grounded Hermes touchpoints

  • run_agent.py:601 — [fallback_chain] fallback_model: Dict[str, Any] = None,
  • run_agent.py:995 — [fallback_chain] # failure). Supports both legacy single-dict fallback_model and
  • run_agent.py:996 — [fallback_chain] # new list fallback_providers format.
  • run_agent.py:997 — [fallback_chain] if isinstance(fallback_model, list):
  • run_agent.py:998 — [fallback_chain] self._fallback_chain = [
  • run_agent.py:999 — [fallback_chain] f for f in fallback_model
  • run_agent.py:1002 — [fallback_chain] elif isinstance(fallback_model, dict) and fallback_model.get("provider") and fallback_model.get("model"):
  • run_agent.py:1003 — [fallback_chain] self._fallback_chain = [fallback_model]
  • run_agent.py:1005 — [fallback_chain] self._fallback_chain = []
  • run_agent.py:1009 — [fallback_chain] self._fallback_model = self._fallback_chain[0] if self._fallback_chain else None
  • run_agent.py:1010 — [fallback_chain] if self._fallback_chain and not self.quiet_mode:
  • run_agent.py:1011 — [fallback_chain] if len(self._fallback_chain) == 1:
  • run_agent.py:1012 — [fallback_chain] fb = self._fallback_chain[0]
  • run_agent.py:1015 — [fallback_chain] print(f"🔄 Fallback chain ({len(self._fallback_chain)} providers): " +
  • run_agent.py:1016 — [fallback_chain] " → ".join(f"{f['model']} ({f['provider']})" for f in self._fallback_chain))
  • run_agent.py:5624 — [fallback_chain] if self._fallback_index >= len(self._fallback_chain):
  • run_agent.py:5627 — [fallback_chain] fb = self._fallback_chain[self._fallback_index]
  • run_agent.py:8559 — [fallback_chain] if self._fallback_index < len(self._fallback_chain):
  • run_agent.py:9355 — [fallback_chain] if is_rate_limited and self._fallback_index < len(self._fallback_chain):
  • run_agent.py:10460 — [fallback_chain] if _truly_empty and self._fallback_chain:
  • run_agent.py:10514 — [fallback_chain] + (" and fallback attempts." if self._fallback_chain else
  • cli.py:241 — [provider_routing_config] "smart_model_routing": {
  • cli.py:370 — [provider_routing_config] # (e.g. platform_toolsets, provider_routing, memory, honcho, etc.)
  • cli.py:1753 — [provider_routing_config] pr = CLI_CONFIG.get("provider_routing", {}) or {}
  • cli.py:1762 — [provider_routing_config] # Supports new list format (fallback_providers) and legacy single-dict (fallback_model).
  • cli.py:1763 — [provider_routing_config] fb = CLI_CONFIG.get("fallback_providers") or CLI_CONFIG.get("fallback_model") or []
  • cli.py:1770 — [provider_routing_config] self._smart_model_routing = CLI_CONFIG.get("smart_model_routing", {}) or {}
  • cli.py:2771 — [provider_routing_config] from agent.smart_model_routing import resolve_turn_route
  • cli.py:2776 — [provider_routing_config] self._smart_model_routing,
  • hermes_cli/runtime_provider.py:209 — [runtime_provider] def resolve_requested_provider(requested: Optional[str] = None) -> str:
  • hermes_cli/runtime_provider.py:649 — [runtime_provider] def resolve_runtime_provider(
  • agent/smart_model_routing.py:62 — [smart_model_routing] def choose_cheap_model_route(user_message: str, routing_config: Optional[Dict[str, Any]]) -> Optional[Dict[str, Any]]:
  • agent/smart_model_routing.py:110 — [smart_model_routing] def resolve_turn_route(user_message: str, routing_config: Optional[Dict[str, Any]], primary: Dict[str, Any]) -> Dict[str, Any]:
  • gateway/run.py:1271 — [gateway_provider_routing] def _load_provider_routing() -> dict:
  • gateway/run.py:1285 — [gateway_provider_routing] def _load_fallback_model() -> list | dict | None:
  • gateway/run.py:1306 — [gateway_provider_routing] def _load_smart_model_routing() -> dict:
  • cron/scheduler.py:684 — [cron_runtime_provider] pr = _cfg.get("provider_routing", {})
  • cron/scheduler.py:688 — [cron_runtime_provider] resolve_runtime_provider,
  • cron/scheduler.py:697 — [cron_runtime_provider] runtime = resolve_runtime_provider(**runtime_kwargs)
  • cron/scheduler.py:702 — [cron_runtime_provider] from agent.smart_model_routing import resolve_turn_route
  • cron/scheduler.py:703 — [cron_runtime_provider] turn_route = resolve_turn_route(
  • cron/scheduler.py:717 — [cron_runtime_provider] fallback_model = _cfg.get("fallback_providers") or _cfg.get("fallback_model") or None
  • cron/scheduler.py:746 — [cron_runtime_provider] fallback_model=fallback_model,
  • agent/auxiliary_client.py:1018 — [auxiliary_fallback_chain] def _get_provider_chain() -> List[tuple]:
  • agent/auxiliary_client.py:1107 — [auxiliary_fallback_chain] for label, try_fn in _get_provider_chain():
  • agent/auxiliary_client.py:1189 — [auxiliary_fallback_chain] # ── Step 2: aggregator / fallback chain ──────────────────────────────
  • agent/auxiliary_client.py:1191 — [auxiliary_fallback_chain] for label, try_fn in _get_provider_chain():
  • agent/auxiliary_client.py:2397 — [auxiliary_fallback_chain] # error, fall through to the fallback chain below.
  • agent/auxiliary_client.py:2417 — [auxiliary_fallback_chain] # auto (the default) = best-effort fallback chain. (#7559)
  • agent/auxiliary_client.py:2589 — [auxiliary_fallback_chain] # error, fall through to the fallback chain below.
  • tools/delegate_tool.py:662 — [delegate_runtime_provider] # bundle (base_url, api_key, api_mode) via the same runtime provider system
  • tools/delegate_tool.py:854 — [delegate_runtime_provider] provider) is resolved via the runtime provider system — the same path used
  • tools/delegate_tool.py:909 — [delegate_runtime_provider] from hermes_cli.runtime_provider import resolve_runtime_provider
  • tools/delegate_tool.py:910 — [delegate_runtime_provider] runtime = resolve_runtime_provider(requested=configured_provider)
  • hermes_state.py:115 — [session_db] class SessionDB:
  • batch_runner.py:320 — [trajectory_export] save_trajectories=False, # We handle saving ourselves
  • batch_runner.py:346 — [trajectory_export] trajectory = agent._convert_to_trajectory_format(
  • batch_runner.py:460 — [trajectory_export] trajectory_entry = {
  • batch_runner.py:474 — [trajectory_export] f.write(json.dumps(trajectory_entry, ensure_ascii=False) + "\n")
  • benchmarks/tool_call_benchmark.py:3 — [benchmark_suite] Tool-Calling Benchmark — Gemma 4 vs mimo-v2-pro regression test.
  • benchmarks/tool_call_benchmark.py:9 — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py # full 100-call suite
  • benchmarks/tool_call_benchmark.py:10 — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --limit 10 # quick smoke test
  • benchmarks/tool_call_benchmark.py:11 — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --models nous # single model
  • benchmarks/tool_call_benchmark.py:12 — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --category file # single category
  • benchmarks/tool_call_benchmark.py:37 — [benchmark_suite] class ToolCall:
  • benchmarks/tool_call_benchmark.py:51 — [benchmark_suite] ToolCall("file-01", "file", "Read the file /tmp/test_bench.txt and show me its contents.",
  • benchmarks/tool_call_benchmark.py:53 — [benchmark_suite] ToolCall("file-02", "file", "Write 'hello benchmark' to /tmp/test_bench_out.txt",
  • benchmarks/tool_call_benchmark.py:55 — [benchmark_suite] ToolCall("file-03", "file", "Search for the word 'import' in all Python files in the current directory.",
  • benchmarks/tool_call_benchmark.py:57 — [benchmark_suite] ToolCall("file-04", "file", "Read lines 1-20 of /etc/hosts",
  • benchmarks/tool_call_benchmark.py:59 — [benchmark_suite] ToolCall("file-05", "file", "Patch /tmp/test_bench_out.txt: replace 'hello' with 'goodbye'",
  • benchmarks/tool_call_benchmark.py:61 — [benchmark_suite] ToolCall("file-06", "file", "Search for files matching *.py in the current directory.",
  • benchmarks/tool_call_benchmark.py:63 — [benchmark_suite] ToolCall("file-07", "file", "Read the first 10 lines of /etc/passwd",
  • benchmarks/tool_call_benchmark.py:65 — [benchmark_suite] ToolCall("file-08", "file", "Write a JSON config to /tmp/bench_config.json with key 'debug': true",
  • benchmarks/tool_call_benchmark.py:67 — [benchmark_suite] ToolCall("file-09", "file", "Search for 'def test_' in Python test files.",
  • benchmarks/tool_call_benchmark.py:69 — [benchmark_suite] ToolCall("file-10", "file", "Read /tmp/bench_config.json and tell me what's in it.",
  • benchmarks/tool_call_benchmark.py:71 — [benchmark_suite] ToolCall("file-11", "file", "Create a file /tmp/bench_readme.md with one line: '# Benchmark'",
  • benchmarks/tool_call_benchmark.py:73 — [benchmark_suite] ToolCall("file-12", "file", "Search for 'TODO' comments in all .py files.",
  • benchmarks/tool_call_benchmark.py:75 — [benchmark_suite] ToolCall("file-13", "file", "Read /tmp/bench_readme.md",
  • benchmarks/tool_call_benchmark.py:77 — [benchmark_suite] ToolCall("file-14", "file", "Patch /tmp/bench_readme.md: replace '# Benchmark' with '# Tool Benchmark'",
  • benchmarks/tool_call_benchmark.py:78 — [benchmark_suite] "patch", "Tool Benchmark"),
  • benchmarks/tool_call_benchmark.py:79 — [benchmark_suite] ToolCall("file-15", "file", "Write a Python one-liner to /tmp/bench_hello.py that prints hello.",
  • benchmarks/tool_call_benchmark.py:81 — [benchmark_suite] ToolCall("file-16", "file", "Search for all .json files in /tmp/.",
  • benchmarks/tool_call_benchmark.py:83 — [benchmark_suite] ToolCall("file-17", "file", "Read /tmp/bench_hello.py and verify it has print('hello').",
  • benchmarks/tool_call_benchmark.py:85 — [benchmark_suite] ToolCall("file-18", "file", "Patch /tmp/bench_hello.py to print 'hello world' instead of 'hello'.",
  • benchmarks/tool_call_benchmark.py:87 — [benchmark_suite] ToolCall("file-19", "file", "List files matching 'bench*' in /tmp/.",
  • benchmarks/tool_call_benchmark.py:89 — [benchmark_suite] ToolCall("file-20", "file", "Read /tmp/test_bench.txt again and summarize its contents.",
  • benchmarks/tool_call_benchmark.py:93 — [benchmark_suite] ToolCall("term-01", "terminal", "Run echo hello world in the terminal.",
  • benchmarks/tool_call_benchmark.py:95 — [benchmark_suite] ToolCall("term-02", "terminal", "Run date to get the current date and time.",
  • benchmarks/tool_call_benchmark.py:97 — [benchmark_suite] ToolCall("term-03", "terminal", "Run uname -a to get system information.",
  • benchmarks/tool_call_benchmark.py:99 — [benchmark_suite] ToolCall("term-04", "terminal", "Run pwd to show the current directory.",
  • benchmarks/tool_call_benchmark.py:101 — [benchmark_suite] ToolCall("term-05", "terminal", "Run ls -la /tmp/ | head -20 to list temp files.",
  • benchmarks/tool_call_benchmark.py:103 — [benchmark_suite] ToolCall("term-06", "terminal", "Run whoami to show the current user.",
  • benchmarks/tool_call_benchmark.py:105 — [benchmark_suite] ToolCall("term-07", "terminal", "Run df -h to show disk usage.",
  • benchmarks/tool_call_benchmark.py:107 — [benchmark_suite] ToolCall("term-08", "terminal", "Run python3 --version to check Python version.",
  • benchmarks/tool_call_benchmark.py:109 — [benchmark_suite] ToolCall("term-09", "terminal", "Run cat /etc/hostname to get the hostname.",
  • benchmarks/tool_call_benchmark.py:111 — [benchmark_suite] ToolCall("term-10", "terminal", "Run uptime to see system uptime.",
  • benchmarks/tool_call_benchmark.py:113 — [benchmark_suite] ToolCall("term-11", "terminal", "Run env | grep PATH to show the PATH variable.",
  • benchmarks/tool_call_benchmark.py:115 — [benchmark_suite] ToolCall("term-12", "terminal", "Run wc -l /etc/passwd to count lines.",
  • benchmarks/tool_call_benchmark.py:117 — [benchmark_suite] ToolCall("term-13", "terminal", "Run echo $SHELL to show the current shell.",
  • benchmarks/tool_call_benchmark.py:119 — [benchmark_suite] ToolCall("term-14", "terminal", "Run free -h || vm_stat to check memory usage.",
  • benchmarks/tool_call_benchmark.py:121 — [benchmark_suite] ToolCall("term-15", "terminal", "Run id to show user and group IDs.",
  • benchmarks/tool_call_benchmark.py:123 — [benchmark_suite] ToolCall("term-16", "terminal", "Run hostname to get the machine hostname.",
  • benchmarks/tool_call_benchmark.py:125 — [benchmark_suite] ToolCall("term-17", "terminal", "Run echo {1..5} to test brace expansion.",
  • benchmarks/tool_call_benchmark.py:127 — [benchmark_suite] ToolCall("term-18", "terminal", "Run seq 1 5 to generate a number sequence.",
  • benchmarks/tool_call_benchmark.py:129 — [benchmark_suite] ToolCall("term-19", "terminal", "Run python3 -c 'print(2+2)' to compute 2+2.",
  • benchmarks/tool_call_benchmark.py:131 — [benchmark_suite] ToolCall("term-20", "terminal", "Run ls -d /tmp/bench* 2>/dev/null | wc -l to count bench files.",
  • benchmarks/tool_call_benchmark.py:135 — [benchmark_suite] ToolCall("code-01", "code", "Execute a Python script that computes factorial of 10.",
  • benchmarks/tool_call_benchmark.py:137 — [benchmark_suite] ToolCall("code-02", "code", "Run Python to read /tmp/test_bench.txt and count its words.",
  • benchmarks/tool_call_benchmark.py:139 — [benchmark_suite] ToolCall("code-03", "code", "Execute Python to generate the first 20 Fibonacci numbers.",
  • benchmarks/tool_call_benchmark.py:141 — [benchmark_suite] ToolCall("code-04", "code", "Run Python to parse JSON from a string and print keys.",
  • benchmarks/tool_call_benchmark.py:143 — [benchmark_suite] ToolCall("code-05", "code", "Execute Python to list all files in /tmp/ matching 'bench*'.",
  • benchmarks/tool_call_benchmark.py:145 — [benchmark_suite] ToolCall("code-06", "code", "Run Python to compute the sum of squares from 1 to 100.",
  • benchmarks/tool_call_benchmark.py:147 — [benchmark_suite] ToolCall("code-07", "code", "Execute Python to check if 'racecar' is a palindrome.",
  • benchmarks/tool_call_benchmark.py:149 — [benchmark_suite] ToolCall("code-08", "code", "Run Python to create a CSV string with 5 rows of sample data.",
  • benchmarks/tool_call_benchmark.py:151 — [benchmark_suite] ToolCall("code-09", "code", "Execute Python to sort a list [5,2,8,1,9] and print the result.",
  • benchmarks/tool_call_benchmark.py:153 — [benchmark_suite] ToolCall("code-10", "code", "Run Python to count lines in /etc/passwd.",
  • benchmarks/tool_call_benchmark.py:155 — [benchmark_suite] ToolCall("code-11", "code", "Execute Python to hash the string 'benchmark' with SHA256.",
  • benchmarks/tool_call_benchmark.py:157 — [benchmark_suite] ToolCall("code-12", "code", "Run Python to get the current UTC timestamp.",
  • benchmarks/tool_call_benchmark.py:159 — [benchmark_suite] ToolCall("code-13", "code", "Execute Python to convert 'hello world' to uppercase and reverse it.",
  • benchmarks/tool_call_benchmark.py:161 — [benchmark_suite] ToolCall("code-14", "code", "Run Python to create a dictionary of system info (platform, python version).",
  • benchmarks/tool_call_benchmark.py:163 — [benchmark_suite] ToolCall("code-15", "code", "Execute Python to check internet connectivity by resolving google.com.",
  • benchmarks/tool_call_benchmark.py:167 — [benchmark_suite] ToolCall("deleg-01", "delegate", "Use a subagent to find all .log files in /tmp/.",
  • benchmarks/tool_call_benchmark.py:169 — [benchmark_suite] ToolCall("deleg-02", "delegate", "Delegate to a subagent: what is 15 * 37?",
  • benchmarks/tool_call_benchmark.py:171 — [benchmark_suite] ToolCall("deleg-03", "delegate", "Use a subagent to check if Python 3 is installed and its version.",
  • benchmarks/tool_call_benchmark.py:173 — [benchmark_suite] ToolCall("deleg-04", "delegate", "Delegate: read /tmp/test_bench.txt and summarize it in one sentence.",
  • benchmarks/tool_call_benchmark.py:175 — [benchmark_suite] ToolCall("deleg-05", "delegate", "Use a subagent to list the contents of /tmp/ directory.",
  • benchmarks/tool_call_benchmark.py:177 — [benchmark_suite] ToolCall("deleg-06", "delegate", "Delegate: count the number of .py files in the current directory.",
  • benchmarks/tool_call_benchmark.py:179 — [benchmark_suite] ToolCall("deleg-07", "delegate", "Use a subagent to check disk space with df -h.",
  • benchmarks/tool_call_benchmark.py:181 — [benchmark_suite] ToolCall("deleg-08", "delegate", "Delegate: what OS are we running on?",
  • benchmarks/tool_call_benchmark.py:183 — [benchmark_suite] ToolCall("deleg-09", "delegate", "Use a subagent to find the hostname of this machine.",
  • benchmarks/tool_call_benchmark.py:185 — [benchmark_suite] ToolCall("deleg-10", "delegate", "Delegate: create a temp file /tmp/bench_deleg.txt with 'done'.",
  • benchmarks/tool_call_benchmark.py:189 — [benchmark_suite] ToolCall("todo-01", "todo", "Add a todo item: 'Run benchmark suite'",
  • benchmarks/tool_call_benchmark.py:190 — [benchmark_suite] "todo", "benchmark"),
  • benchmarks/tool_call_benchmark.py:191 — [benchmark_suite] ToolCall("todo-02", "todo", "Show me the current todo list.",
  • benchmarks/tool_call_benchmark.py:193 — [benchmark_suite] ToolCall("todo-03", "todo", "Mark the first todo item as completed.",
  • benchmarks/tool_call_benchmark.py:195 — [benchmark_suite] ToolCall("todo-04", "todo", "Add a todo: 'Review benchmark results' with status pending.",
  • benchmarks/tool_call_benchmark.py:197 — [benchmark_suite] ToolCall("todo-05", "todo", "Clear all completed todos.",
  • benchmarks/tool_call_benchmark.py:199 — [benchmark_suite] ToolCall("todo-06", "memory", "Save this to memory: 'benchmark ran on {date}'".format(
  • benchmarks/tool_call_benchmark.py:201 — [benchmark_suite] "memory", "benchmark"),
  • benchmarks/tool_call_benchmark.py:202 — [benchmark_suite] ToolCall("todo-07", "memory", "Search memory for 'benchmark'.",
  • benchmarks/tool_call_benchmark.py:203 — [benchmark_suite] "memory", "benchmark"),
  • benchmarks/tool_call_benchmark.py:204 — [benchmark_suite] ToolCall("todo-08", "memory", "Add a memory note: 'test models are gemma-4 and mimo-v2-pro'.",
  • benchmarks/tool_call_benchmark.py:206 — [benchmark_suite] ToolCall("todo-09", "todo", "Add three todo items: 'analyze', 'report', 'cleanup'.",
  • benchmarks/tool_call_benchmark.py:208 — [benchmark_suite] ToolCall("todo-10", "memory", "Search memory for any notes about models.",
  • benchmarks/tool_call_benchmark.py:212 — [benchmark_suite] ToolCall("skill-01", "skills", "List all available skills.",
  • benchmarks/tool_call_benchmark.py:214 — [benchmark_suite] ToolCall("skill-02", "skills", "View the skill called 'test-driven-development'.",
  • benchmarks/tool_call_benchmark.py:216 — [benchmark_suite] ToolCall("skill-03", "skills", "Search for skills related to 'git'.",
  • benchmarks/tool_call_benchmark.py:218 — [benchmark_suite] ToolCall("skill-04", "skills", "View the 'code-review' skill.",
  • benchmarks/tool_call_benchmark.py:220 — [benchmark_suite] ToolCall("skill-05", "skills", "List all skills in the 'devops' category.",
  • benchmarks/tool_call_benchmark.py:222 — [benchmark_suite] ToolCall("skill-06", "skills", "View the 'systematic-debugging' skill.",
  • benchmarks/tool_call_benchmark.py:224 — [benchmark_suite] ToolCall("skill-07", "skills", "Search for skills about 'testing'.",
  • benchmarks/tool_call_benchmark.py:226 — [benchmark_suite] ToolCall("skill-08", "skills", "View the 'writing-plans' skill.",
  • benchmarks/tool_call_benchmark.py:228 — [benchmark_suite] ToolCall("skill-09", "skills", "List skills in 'software-development' category.",
  • benchmarks/tool_call_benchmark.py:230 — [benchmark_suite] ToolCall("skill-10", "skills", "View the 'pr-review-discipline' skill.",
  • benchmarks/tool_call_benchmark.py:234 — [benchmark_suite] ToolCall("file-21", "file", "Write a Python snippet to /tmp/bench_sort.py that sorts [3,1,2].",
  • benchmarks/tool_call_benchmark.py:236 — [benchmark_suite] ToolCall("file-22", "file", "Read /tmp/bench_sort.py back and confirm it exists.",
  • benchmarks/tool_call_benchmark.py:238 — [benchmark_suite] ToolCall("file-23", "file", "Search for 'class' in all .py files in the benchmarks directory.",
  • benchmarks/tool_call_benchmark.py:240 — [benchmark_suite] ToolCall("term-21", "terminal", "Run cat /etc/os-release 2>/dev/null || sw_vers 2>/dev/null for OS info.",
  • benchmarks/tool_call_benchmark.py:242 — [benchmark_suite] ToolCall("term-22", "terminal", "Run nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null for CPU count.",
  • benchmarks/tool_call_benchmark.py:244 — [benchmark_suite] ToolCall("code-16", "code", "Execute Python to flatten a nested list 1,2],[3,4],[5.",
  • benchmarks/tool_call_benchmark.py:246 — [benchmark_suite] ToolCall("code-17", "code", "Run Python to check if a number 17 is prime.",
  • benchmarks/tool_call_benchmark.py:248 — [benchmark_suite] ToolCall("deleg-11", "delegate", "Delegate: what is the current working directory?",
  • benchmarks/tool_call_benchmark.py:250 — [benchmark_suite] ToolCall("todo-11", "todo", "Add a todo: 'Finalize benchmark report' status pending.",
  • benchmarks/tool_call_benchmark.py:252 — [benchmark_suite] ToolCall("todo-12", "memory", "Store fact: 'benchmark categories: file, terminal, code, delegate, todo, memory, skills'.",
  • benchmarks/tool_call_benchmark.py:254 — [benchmark_suite] ToolCall("skill-11", "skills", "Search for skills about 'deployment'.",
  • benchmarks/tool_call_benchmark.py:256 — [benchmark_suite] ToolCall("skill-12", "skills", "View the 'gitea-burn-cycle' skill.",
  • benchmarks/tool_call_benchmark.py:258 — [benchmark_suite] ToolCall("skill-13", "skills", "List all available skill categories.",
  • benchmarks/tool_call_benchmark.py:260 — [benchmark_suite] ToolCall("skill-14", "skills", "Search for skills related to 'memory'.",
  • benchmarks/tool_call_benchmark.py:262 — [benchmark_suite] ToolCall("skill-15", "skills", "View the 'mimo-swarm' skill.",
  • benchmarks/tool_call_benchmark.py:311 — [benchmark_suite] """Create prerequisite files for the benchmark."""
  • benchmarks/tool_call_benchmark.py:313 — [benchmark_suite] "This is a benchmark test file.\n"
  • benchmarks/tool_call_benchmark.py:349 — [benchmark_suite] "You are a benchmark test runner. Execute the user's request by calling "
  • benchmarks/tool_call_benchmark.py:406 — [benchmark_suite] """Generate markdown benchmark report."""
  • benchmarks/tool_call_benchmark.py:428 — [benchmark_suite] f"# Tool-Calling Benchmark Report",
  • benchmarks/tool_call_benchmark.py:535 — [benchmark_suite] parser = argparse.ArgumentParser(description="Tool-calling benchmark")
  • benchmarks/tool_call_benchmark.py:544 — [benchmark_suite] help="Output report path (default: benchmarks/gemma4-tool-calling-YYYY-MM-DD.md)")
  • benchmarks/tool_call_benchmark.py:565 — [benchmark_suite] output_path = Path(args.output) if args.output else REPO_ROOT / "benchmarks" / f"gemma4-tool-calling-{date_str}.md"
  • benchmarks/tool_call_benchmark.py:575 — [benchmark_suite] print(f"Benchmark: {len(suite)} tests × {len(model_specs)} models = {len(suite) * len(model_specs)} calls")

Suggested next slice

  1. Build an exporter that emits SessionDB + trajectory data into a TensorZero-friendly offline dataset.
  2. Define percentage-based canary controls before attempting any gateway replacement.
  3. Keep Hermes routing authoritative until TensorZero proves parity across CLI, gateway, cron, auxiliary, and delegation surfaces.