All checks were successful
Lint / lint (pull_request) Successful in 37s
- add a script that inventories Hermes routing/evaluation surfaces relevant to a TensorZero cutover - generate a markdown and JSON evaluation packet for issue #860 - score gateway replacement, config migration, canary rollout, session feedback, and eval-suite readiness - add focused regression tests for touchpoint scanning, requirement scoring, and report rendering Refs #860
26 KiB
26 KiB
TensorZero Evaluation Packet
Issue #860: tensorzero LLMOps platform evaluation
Scope
This packet evaluates TensorZero as a possible replacement for Hermes' custom provider-routing stack. It is intentionally grounded in the current repo state rather than a speculative cutover plan.
Issue requirements being evaluated
- Deploy tensorzero gateway (Rust binary)
- Migrate provider routing config
- Test with canary (10% traffic) before full cutover
- Feed session data for prompt optimization
- Evaluation suite for A/B testing models
Recommendation
Not ready for direct replacement. Recommend a shadow-evaluation phase first: keep Hermes routing live, inventory the migration seams, export SessionDB/trajectory data into an offline TensorZero experiment loop, and only design a canary gateway once percentage-based rollout controls exist.
Requirement matrix
| Requirement | Status | Evidence labels | Summary |
|---|---|---|---|
| Gateway replacement scope | partial | fallback_chain, runtime_provider, gateway_provider_routing, cron_runtime_provider, auxiliary_fallback_chain, delegate_runtime_provider | Hermes already spreads provider routing across core agent, runtime provider, gateway, cron, auxiliary, and delegation seams; TensorZero would need parity across all of them before it can replace the gateway layer. |
| Config migration | partial | provider_routing_config, runtime_provider, smart_model_routing, fallback_chain | Hermes has multiple config concepts to migrate (provider_routing, fallback_providers, smart_model_routing, runtime provider resolution), so TensorZero is not a drop-in config swap. |
| 10% traffic canary | gap | — | The repo shows semantic routing and fallback, but no grounded 10% traffic-split canary mechanism. A TensorZero cutover would need new percentage-based rollout controls and observability hooks. |
| Session data for prompt optimization | partial | session_db, trajectory_export | Hermes already has SessionDB and trajectory export surfaces that can feed offline optimization data, but not a TensorZero-native ingestion path yet. |
| Evaluation suite / A/B testing | partial | benchmark_suite, trajectory_export | Hermes already has benchmark/trajectory machinery that can seed TensorZero A/B evaluation, but no integrated TensorZero experiment runner or live evaluation gateway. |
Grounded Hermes touchpoints
run_agent.py:601— [fallback_chain] fallback_model: Dict[str, Any] = None,run_agent.py:995— [fallback_chain] # failure). Supports both legacy single-dictfallback_modelandrun_agent.py:996— [fallback_chain] # new listfallback_providersformat.run_agent.py:997— [fallback_chain] if isinstance(fallback_model, list):run_agent.py:998— [fallback_chain] self._fallback_chain = [run_agent.py:999— [fallback_chain] f for f in fallback_modelrun_agent.py:1002— [fallback_chain] elif isinstance(fallback_model, dict) and fallback_model.get("provider") and fallback_model.get("model"):run_agent.py:1003— [fallback_chain] self._fallback_chain = [fallback_model]run_agent.py:1005— [fallback_chain] self._fallback_chain = []run_agent.py:1009— [fallback_chain] self._fallback_model = self._fallback_chain[0] if self._fallback_chain else Nonerun_agent.py:1010— [fallback_chain] if self._fallback_chain and not self.quiet_mode:run_agent.py:1011— [fallback_chain] if len(self._fallback_chain) == 1:run_agent.py:1012— [fallback_chain] fb = self._fallback_chain[0]run_agent.py:1015— [fallback_chain] print(f"🔄 Fallback chain ({len(self._fallback_chain)} providers): " +run_agent.py:1016— [fallback_chain] " → ".join(f"{f['model']} ({f['provider']})" for f in self._fallback_chain))run_agent.py:5624— [fallback_chain] if self._fallback_index >= len(self._fallback_chain):run_agent.py:5627— [fallback_chain] fb = self._fallback_chain[self._fallback_index]run_agent.py:8559— [fallback_chain] if self._fallback_index < len(self._fallback_chain):run_agent.py:9355— [fallback_chain] if is_rate_limited and self._fallback_index < len(self._fallback_chain):run_agent.py:10460— [fallback_chain] if _truly_empty and self._fallback_chain:run_agent.py:10514— [fallback_chain] + (" and fallback attempts." if self._fallback_chain elsecli.py:241— [provider_routing_config] "smart_model_routing": {cli.py:370— [provider_routing_config] # (e.g. platform_toolsets, provider_routing, memory, honcho, etc.)cli.py:1753— [provider_routing_config] pr = CLI_CONFIG.get("provider_routing", {}) or {}cli.py:1762— [provider_routing_config] # Supports new list format (fallback_providers) and legacy single-dict (fallback_model).cli.py:1763— [provider_routing_config] fb = CLI_CONFIG.get("fallback_providers") or CLI_CONFIG.get("fallback_model") or []cli.py:1770— [provider_routing_config] self._smart_model_routing = CLI_CONFIG.get("smart_model_routing", {}) or {}cli.py:2771— [provider_routing_config] from agent.smart_model_routing import resolve_turn_routecli.py:2776— [provider_routing_config] self._smart_model_routing,hermes_cli/runtime_provider.py:209— [runtime_provider] def resolve_requested_provider(requested: Optional[str] = None) -> str:hermes_cli/runtime_provider.py:649— [runtime_provider] def resolve_runtime_provider(agent/smart_model_routing.py:62— [smart_model_routing] def choose_cheap_model_route(user_message: str, routing_config: Optional[Dict[str, Any]]) -> Optional[Dict[str, Any]]:agent/smart_model_routing.py:110— [smart_model_routing] def resolve_turn_route(user_message: str, routing_config: Optional[Dict[str, Any]], primary: Dict[str, Any]) -> Dict[str, Any]:gateway/run.py:1271— [gateway_provider_routing] def _load_provider_routing() -> dict:gateway/run.py:1285— [gateway_provider_routing] def _load_fallback_model() -> list | dict | None:gateway/run.py:1306— [gateway_provider_routing] def _load_smart_model_routing() -> dict:cron/scheduler.py:684— [cron_runtime_provider] pr = _cfg.get("provider_routing", {})cron/scheduler.py:688— [cron_runtime_provider] resolve_runtime_provider,cron/scheduler.py:697— [cron_runtime_provider] runtime = resolve_runtime_provider(**runtime_kwargs)cron/scheduler.py:702— [cron_runtime_provider] from agent.smart_model_routing import resolve_turn_routecron/scheduler.py:703— [cron_runtime_provider] turn_route = resolve_turn_route(cron/scheduler.py:717— [cron_runtime_provider] fallback_model = _cfg.get("fallback_providers") or _cfg.get("fallback_model") or Nonecron/scheduler.py:746— [cron_runtime_provider] fallback_model=fallback_model,agent/auxiliary_client.py:1018— [auxiliary_fallback_chain] def _get_provider_chain() -> List[tuple]:agent/auxiliary_client.py:1107— [auxiliary_fallback_chain] for label, try_fn in _get_provider_chain():agent/auxiliary_client.py:1189— [auxiliary_fallback_chain] # ── Step 2: aggregator / fallback chain ──────────────────────────────agent/auxiliary_client.py:1191— [auxiliary_fallback_chain] for label, try_fn in _get_provider_chain():agent/auxiliary_client.py:2397— [auxiliary_fallback_chain] # error, fall through to the fallback chain below.agent/auxiliary_client.py:2417— [auxiliary_fallback_chain] # auto (the default) = best-effort fallback chain. (#7559)agent/auxiliary_client.py:2589— [auxiliary_fallback_chain] # error, fall through to the fallback chain below.tools/delegate_tool.py:662— [delegate_runtime_provider] # bundle (base_url, api_key, api_mode) via the same runtime provider systemtools/delegate_tool.py:854— [delegate_runtime_provider] provider) is resolved via the runtime provider system — the same path usedtools/delegate_tool.py:909— [delegate_runtime_provider] from hermes_cli.runtime_provider import resolve_runtime_providertools/delegate_tool.py:910— [delegate_runtime_provider] runtime = resolve_runtime_provider(requested=configured_provider)hermes_state.py:115— [session_db] class SessionDB:batch_runner.py:320— [trajectory_export] save_trajectories=False, # We handle saving ourselvesbatch_runner.py:346— [trajectory_export] trajectory = agent._convert_to_trajectory_format(batch_runner.py:460— [trajectory_export] trajectory_entry = {batch_runner.py:474— [trajectory_export] f.write(json.dumps(trajectory_entry, ensure_ascii=False) + "\n")benchmarks/tool_call_benchmark.py:3— [benchmark_suite] Tool-Calling Benchmark — Gemma 4 vs mimo-v2-pro regression test.benchmarks/tool_call_benchmark.py:9— [benchmark_suite] python3 benchmarks/tool_call_benchmark.py # full 100-call suitebenchmarks/tool_call_benchmark.py:10— [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --limit 10 # quick smoke testbenchmarks/tool_call_benchmark.py:11— [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --models nous # single modelbenchmarks/tool_call_benchmark.py:12— [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --category file # single categorybenchmarks/tool_call_benchmark.py:37— [benchmark_suite] class ToolCall:benchmarks/tool_call_benchmark.py:51— [benchmark_suite] ToolCall("file-01", "file", "Read the file /tmp/test_bench.txt and show me its contents.",benchmarks/tool_call_benchmark.py:53— [benchmark_suite] ToolCall("file-02", "file", "Write 'hello benchmark' to /tmp/test_bench_out.txt",benchmarks/tool_call_benchmark.py:55— [benchmark_suite] ToolCall("file-03", "file", "Search for the word 'import' in all Python files in the current directory.",benchmarks/tool_call_benchmark.py:57— [benchmark_suite] ToolCall("file-04", "file", "Read lines 1-20 of /etc/hosts",benchmarks/tool_call_benchmark.py:59— [benchmark_suite] ToolCall("file-05", "file", "Patch /tmp/test_bench_out.txt: replace 'hello' with 'goodbye'",benchmarks/tool_call_benchmark.py:61— [benchmark_suite] ToolCall("file-06", "file", "Search for files matching *.py in the current directory.",benchmarks/tool_call_benchmark.py:63— [benchmark_suite] ToolCall("file-07", "file", "Read the first 10 lines of /etc/passwd",benchmarks/tool_call_benchmark.py:65— [benchmark_suite] ToolCall("file-08", "file", "Write a JSON config to /tmp/bench_config.json with key 'debug': true",benchmarks/tool_call_benchmark.py:67— [benchmark_suite] ToolCall("file-09", "file", "Search for 'def test_' in Python test files.",benchmarks/tool_call_benchmark.py:69— [benchmark_suite] ToolCall("file-10", "file", "Read /tmp/bench_config.json and tell me what's in it.",benchmarks/tool_call_benchmark.py:71— [benchmark_suite] ToolCall("file-11", "file", "Create a file /tmp/bench_readme.md with one line: '# Benchmark'",benchmarks/tool_call_benchmark.py:73— [benchmark_suite] ToolCall("file-12", "file", "Search for 'TODO' comments in all .py files.",benchmarks/tool_call_benchmark.py:75— [benchmark_suite] ToolCall("file-13", "file", "Read /tmp/bench_readme.md",benchmarks/tool_call_benchmark.py:77— [benchmark_suite] ToolCall("file-14", "file", "Patch /tmp/bench_readme.md: replace '# Benchmark' with '# Tool Benchmark'",benchmarks/tool_call_benchmark.py:78— [benchmark_suite] "patch", "Tool Benchmark"),benchmarks/tool_call_benchmark.py:79— [benchmark_suite] ToolCall("file-15", "file", "Write a Python one-liner to /tmp/bench_hello.py that prints hello.",benchmarks/tool_call_benchmark.py:81— [benchmark_suite] ToolCall("file-16", "file", "Search for all .json files in /tmp/.",benchmarks/tool_call_benchmark.py:83— [benchmark_suite] ToolCall("file-17", "file", "Read /tmp/bench_hello.py and verify it has print('hello').",benchmarks/tool_call_benchmark.py:85— [benchmark_suite] ToolCall("file-18", "file", "Patch /tmp/bench_hello.py to print 'hello world' instead of 'hello'.",benchmarks/tool_call_benchmark.py:87— [benchmark_suite] ToolCall("file-19", "file", "List files matching 'bench*' in /tmp/.",benchmarks/tool_call_benchmark.py:89— [benchmark_suite] ToolCall("file-20", "file", "Read /tmp/test_bench.txt again and summarize its contents.",benchmarks/tool_call_benchmark.py:93— [benchmark_suite] ToolCall("term-01", "terminal", "Runecho hello worldin the terminal.",benchmarks/tool_call_benchmark.py:95— [benchmark_suite] ToolCall("term-02", "terminal", "Rundateto get the current date and time.",benchmarks/tool_call_benchmark.py:97— [benchmark_suite] ToolCall("term-03", "terminal", "Rununame -ato get system information.",benchmarks/tool_call_benchmark.py:99— [benchmark_suite] ToolCall("term-04", "terminal", "Runpwdto show the current directory.",benchmarks/tool_call_benchmark.py:101— [benchmark_suite] ToolCall("term-05", "terminal", "Runls -la /tmp/ | head -20to list temp files.",benchmarks/tool_call_benchmark.py:103— [benchmark_suite] ToolCall("term-06", "terminal", "Runwhoamito show the current user.",benchmarks/tool_call_benchmark.py:105— [benchmark_suite] ToolCall("term-07", "terminal", "Rundf -hto show disk usage.",benchmarks/tool_call_benchmark.py:107— [benchmark_suite] ToolCall("term-08", "terminal", "Runpython3 --versionto check Python version.",benchmarks/tool_call_benchmark.py:109— [benchmark_suite] ToolCall("term-09", "terminal", "Runcat /etc/hostnameto get the hostname.",benchmarks/tool_call_benchmark.py:111— [benchmark_suite] ToolCall("term-10", "terminal", "Runuptimeto see system uptime.",benchmarks/tool_call_benchmark.py:113— [benchmark_suite] ToolCall("term-11", "terminal", "Runenv | grep PATHto show the PATH variable.",benchmarks/tool_call_benchmark.py:115— [benchmark_suite] ToolCall("term-12", "terminal", "Runwc -l /etc/passwdto count lines.",benchmarks/tool_call_benchmark.py:117— [benchmark_suite] ToolCall("term-13", "terminal", "Runecho $SHELLto show the current shell.",benchmarks/tool_call_benchmark.py:119— [benchmark_suite] ToolCall("term-14", "terminal", "Runfree -h || vm_statto check memory usage.",benchmarks/tool_call_benchmark.py:121— [benchmark_suite] ToolCall("term-15", "terminal", "Runidto show user and group IDs.",benchmarks/tool_call_benchmark.py:123— [benchmark_suite] ToolCall("term-16", "terminal", "Runhostnameto get the machine hostname.",benchmarks/tool_call_benchmark.py:125— [benchmark_suite] ToolCall("term-17", "terminal", "Runecho {1..5}to test brace expansion.",benchmarks/tool_call_benchmark.py:127— [benchmark_suite] ToolCall("term-18", "terminal", "Runseq 1 5to generate a number sequence.",benchmarks/tool_call_benchmark.py:129— [benchmark_suite] ToolCall("term-19", "terminal", "Runpython3 -c 'print(2+2)'to compute 2+2.",benchmarks/tool_call_benchmark.py:131— [benchmark_suite] ToolCall("term-20", "terminal", "Runls -d /tmp/bench* 2>/dev/null | wc -lto count bench files.",benchmarks/tool_call_benchmark.py:135— [benchmark_suite] ToolCall("code-01", "code", "Execute a Python script that computes factorial of 10.",benchmarks/tool_call_benchmark.py:137— [benchmark_suite] ToolCall("code-02", "code", "Run Python to read /tmp/test_bench.txt and count its words.",benchmarks/tool_call_benchmark.py:139— [benchmark_suite] ToolCall("code-03", "code", "Execute Python to generate the first 20 Fibonacci numbers.",benchmarks/tool_call_benchmark.py:141— [benchmark_suite] ToolCall("code-04", "code", "Run Python to parse JSON from a string and print keys.",benchmarks/tool_call_benchmark.py:143— [benchmark_suite] ToolCall("code-05", "code", "Execute Python to list all files in /tmp/ matching 'bench*'.",benchmarks/tool_call_benchmark.py:145— [benchmark_suite] ToolCall("code-06", "code", "Run Python to compute the sum of squares from 1 to 100.",benchmarks/tool_call_benchmark.py:147— [benchmark_suite] ToolCall("code-07", "code", "Execute Python to check if 'racecar' is a palindrome.",benchmarks/tool_call_benchmark.py:149— [benchmark_suite] ToolCall("code-08", "code", "Run Python to create a CSV string with 5 rows of sample data.",benchmarks/tool_call_benchmark.py:151— [benchmark_suite] ToolCall("code-09", "code", "Execute Python to sort a list [5,2,8,1,9] and print the result.",benchmarks/tool_call_benchmark.py:153— [benchmark_suite] ToolCall("code-10", "code", "Run Python to count lines in /etc/passwd.",benchmarks/tool_call_benchmark.py:155— [benchmark_suite] ToolCall("code-11", "code", "Execute Python to hash the string 'benchmark' with SHA256.",benchmarks/tool_call_benchmark.py:157— [benchmark_suite] ToolCall("code-12", "code", "Run Python to get the current UTC timestamp.",benchmarks/tool_call_benchmark.py:159— [benchmark_suite] ToolCall("code-13", "code", "Execute Python to convert 'hello world' to uppercase and reverse it.",benchmarks/tool_call_benchmark.py:161— [benchmark_suite] ToolCall("code-14", "code", "Run Python to create a dictionary of system info (platform, python version).",benchmarks/tool_call_benchmark.py:163— [benchmark_suite] ToolCall("code-15", "code", "Execute Python to check internet connectivity by resolving google.com.",benchmarks/tool_call_benchmark.py:167— [benchmark_suite] ToolCall("deleg-01", "delegate", "Use a subagent to find all .log files in /tmp/.",benchmarks/tool_call_benchmark.py:169— [benchmark_suite] ToolCall("deleg-02", "delegate", "Delegate to a subagent: what is 15 * 37?",benchmarks/tool_call_benchmark.py:171— [benchmark_suite] ToolCall("deleg-03", "delegate", "Use a subagent to check if Python 3 is installed and its version.",benchmarks/tool_call_benchmark.py:173— [benchmark_suite] ToolCall("deleg-04", "delegate", "Delegate: read /tmp/test_bench.txt and summarize it in one sentence.",benchmarks/tool_call_benchmark.py:175— [benchmark_suite] ToolCall("deleg-05", "delegate", "Use a subagent to list the contents of /tmp/ directory.",benchmarks/tool_call_benchmark.py:177— [benchmark_suite] ToolCall("deleg-06", "delegate", "Delegate: count the number of .py files in the current directory.",benchmarks/tool_call_benchmark.py:179— [benchmark_suite] ToolCall("deleg-07", "delegate", "Use a subagent to check disk space with df -h.",benchmarks/tool_call_benchmark.py:181— [benchmark_suite] ToolCall("deleg-08", "delegate", "Delegate: what OS are we running on?",benchmarks/tool_call_benchmark.py:183— [benchmark_suite] ToolCall("deleg-09", "delegate", "Use a subagent to find the hostname of this machine.",benchmarks/tool_call_benchmark.py:185— [benchmark_suite] ToolCall("deleg-10", "delegate", "Delegate: create a temp file /tmp/bench_deleg.txt with 'done'.",benchmarks/tool_call_benchmark.py:189— [benchmark_suite] ToolCall("todo-01", "todo", "Add a todo item: 'Run benchmark suite'",benchmarks/tool_call_benchmark.py:190— [benchmark_suite] "todo", "benchmark"),benchmarks/tool_call_benchmark.py:191— [benchmark_suite] ToolCall("todo-02", "todo", "Show me the current todo list.",benchmarks/tool_call_benchmark.py:193— [benchmark_suite] ToolCall("todo-03", "todo", "Mark the first todo item as completed.",benchmarks/tool_call_benchmark.py:195— [benchmark_suite] ToolCall("todo-04", "todo", "Add a todo: 'Review benchmark results' with status pending.",benchmarks/tool_call_benchmark.py:197— [benchmark_suite] ToolCall("todo-05", "todo", "Clear all completed todos.",benchmarks/tool_call_benchmark.py:199— [benchmark_suite] ToolCall("todo-06", "memory", "Save this to memory: 'benchmark ran on {date}'".format(benchmarks/tool_call_benchmark.py:201— [benchmark_suite] "memory", "benchmark"),benchmarks/tool_call_benchmark.py:202— [benchmark_suite] ToolCall("todo-07", "memory", "Search memory for 'benchmark'.",benchmarks/tool_call_benchmark.py:203— [benchmark_suite] "memory", "benchmark"),benchmarks/tool_call_benchmark.py:204— [benchmark_suite] ToolCall("todo-08", "memory", "Add a memory note: 'test models are gemma-4 and mimo-v2-pro'.",benchmarks/tool_call_benchmark.py:206— [benchmark_suite] ToolCall("todo-09", "todo", "Add three todo items: 'analyze', 'report', 'cleanup'.",benchmarks/tool_call_benchmark.py:208— [benchmark_suite] ToolCall("todo-10", "memory", "Search memory for any notes about models.",benchmarks/tool_call_benchmark.py:212— [benchmark_suite] ToolCall("skill-01", "skills", "List all available skills.",benchmarks/tool_call_benchmark.py:214— [benchmark_suite] ToolCall("skill-02", "skills", "View the skill called 'test-driven-development'.",benchmarks/tool_call_benchmark.py:216— [benchmark_suite] ToolCall("skill-03", "skills", "Search for skills related to 'git'.",benchmarks/tool_call_benchmark.py:218— [benchmark_suite] ToolCall("skill-04", "skills", "View the 'code-review' skill.",benchmarks/tool_call_benchmark.py:220— [benchmark_suite] ToolCall("skill-05", "skills", "List all skills in the 'devops' category.",benchmarks/tool_call_benchmark.py:222— [benchmark_suite] ToolCall("skill-06", "skills", "View the 'systematic-debugging' skill.",benchmarks/tool_call_benchmark.py:224— [benchmark_suite] ToolCall("skill-07", "skills", "Search for skills about 'testing'.",benchmarks/tool_call_benchmark.py:226— [benchmark_suite] ToolCall("skill-08", "skills", "View the 'writing-plans' skill.",benchmarks/tool_call_benchmark.py:228— [benchmark_suite] ToolCall("skill-09", "skills", "List skills in 'software-development' category.",benchmarks/tool_call_benchmark.py:230— [benchmark_suite] ToolCall("skill-10", "skills", "View the 'pr-review-discipline' skill.",benchmarks/tool_call_benchmark.py:234— [benchmark_suite] ToolCall("file-21", "file", "Write a Python snippet to /tmp/bench_sort.py that sorts [3,1,2].",benchmarks/tool_call_benchmark.py:236— [benchmark_suite] ToolCall("file-22", "file", "Read /tmp/bench_sort.py back and confirm it exists.",benchmarks/tool_call_benchmark.py:238— [benchmark_suite] ToolCall("file-23", "file", "Search for 'class' in all .py files in the benchmarks directory.",benchmarks/tool_call_benchmark.py:240— [benchmark_suite] ToolCall("term-21", "terminal", "Runcat /etc/os-release 2>/dev/null || sw_vers 2>/dev/nullfor OS info.",benchmarks/tool_call_benchmark.py:242— [benchmark_suite] ToolCall("term-22", "terminal", "Runnproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/nullfor CPU count.",benchmarks/tool_call_benchmark.py:244— [benchmark_suite] ToolCall("code-16", "code", "Execute Python to flatten a nested list 1,2],[3,4],[5.",benchmarks/tool_call_benchmark.py:246— [benchmark_suite] ToolCall("code-17", "code", "Run Python to check if a number 17 is prime.",benchmarks/tool_call_benchmark.py:248— [benchmark_suite] ToolCall("deleg-11", "delegate", "Delegate: what is the current working directory?",benchmarks/tool_call_benchmark.py:250— [benchmark_suite] ToolCall("todo-11", "todo", "Add a todo: 'Finalize benchmark report' status pending.",benchmarks/tool_call_benchmark.py:252— [benchmark_suite] ToolCall("todo-12", "memory", "Store fact: 'benchmark categories: file, terminal, code, delegate, todo, memory, skills'.",benchmarks/tool_call_benchmark.py:254— [benchmark_suite] ToolCall("skill-11", "skills", "Search for skills about 'deployment'.",benchmarks/tool_call_benchmark.py:256— [benchmark_suite] ToolCall("skill-12", "skills", "View the 'gitea-burn-cycle' skill.",benchmarks/tool_call_benchmark.py:258— [benchmark_suite] ToolCall("skill-13", "skills", "List all available skill categories.",benchmarks/tool_call_benchmark.py:260— [benchmark_suite] ToolCall("skill-14", "skills", "Search for skills related to 'memory'.",benchmarks/tool_call_benchmark.py:262— [benchmark_suite] ToolCall("skill-15", "skills", "View the 'mimo-swarm' skill.",benchmarks/tool_call_benchmark.py:311— [benchmark_suite] """Create prerequisite files for the benchmark."""benchmarks/tool_call_benchmark.py:313— [benchmark_suite] "This is a benchmark test file.\n"benchmarks/tool_call_benchmark.py:349— [benchmark_suite] "You are a benchmark test runner. Execute the user's request by calling "benchmarks/tool_call_benchmark.py:406— [benchmark_suite] """Generate markdown benchmark report."""benchmarks/tool_call_benchmark.py:428— [benchmark_suite] f"# Tool-Calling Benchmark Report",benchmarks/tool_call_benchmark.py:535— [benchmark_suite] parser = argparse.ArgumentParser(description="Tool-calling benchmark")benchmarks/tool_call_benchmark.py:544— [benchmark_suite] help="Output report path (default: benchmarks/gemma4-tool-calling-YYYY-MM-DD.md)")benchmarks/tool_call_benchmark.py:565— [benchmark_suite] output_path = Path(args.output) if args.output else REPO_ROOT / "benchmarks" / f"gemma4-tool-calling-{date_str}.md"benchmarks/tool_call_benchmark.py:575— [benchmark_suite] print(f"Benchmark: {len(suite)} tests × {len(model_specs)} models = {len(suite) * len(model_specs)} calls")
Suggested next slice
- Build an exporter that emits SessionDB + trajectory data into a TensorZero-friendly offline dataset.
- Define percentage-based canary controls before attempting any gateway replacement.
- Keep Hermes routing authoritative until TensorZero proves parity across CLI, gateway, cron, auxiliary, and delegation surfaces.