Files

Alexander Whitestone 755e7513a1

Lint / lint (pull_request) Successful in 37s

Details

docs: add grounded tensorzero evaluation packet (#860 )

- add a script that inventories Hermes routing/evaluation surfaces relevant to a TensorZero cutover
- generate a markdown and JSON evaluation packet for issue #860
- score gateway replacement, config migration, canary rollout, session feedback, and eval-suite readiness
- add focused regression tests for touchpoint scanning, requirement scoring, and report rendering

Refs #860

2026-04-22 11:33:31 -04:00

26 KiB

Raw Blame History

TensorZero Evaluation Packet

Issue #860: tensorzero LLMOps platform evaluation

Scope

This packet evaluates TensorZero as a possible replacement for Hermes' custom provider-routing stack. It is intentionally grounded in the current repo state rather than a speculative cutover plan.

Issue requirements being evaluated

Deploy tensorzero gateway (Rust binary)
Migrate provider routing config
Test with canary (10% traffic) before full cutover
Feed session data for prompt optimization
Evaluation suite for A/B testing models

Recommendation

Not ready for direct replacement. Recommend a shadow-evaluation phase first: keep Hermes routing live, inventory the migration seams, export SessionDB/trajectory data into an offline TensorZero experiment loop, and only design a canary gateway once percentage-based rollout controls exist.

Requirement matrix

Requirement	Status	Evidence labels	Summary
Gateway replacement scope	partial	fallback_chain, runtime_provider, gateway_provider_routing, cron_runtime_provider, auxiliary_fallback_chain, delegate_runtime_provider	Hermes already spreads provider routing across core agent, runtime provider, gateway, cron, auxiliary, and delegation seams; TensorZero would need parity across all of them before it can replace the gateway layer.
Config migration	partial	provider_routing_config, runtime_provider, smart_model_routing, fallback_chain	Hermes has multiple config concepts to migrate (`provider_routing`, `fallback_providers`, `smart_model_routing`, runtime provider resolution), so TensorZero is not a drop-in config swap.
10% traffic canary	gap	—	The repo shows semantic routing and fallback, but no grounded 10% traffic-split canary mechanism. A TensorZero cutover would need new percentage-based rollout controls and observability hooks.
Session data for prompt optimization	partial	session_db, trajectory_export	Hermes already has SessionDB and trajectory export surfaces that can feed offline optimization data, but not a TensorZero-native ingestion path yet.
Evaluation suite / A/B testing	partial	benchmark_suite, trajectory_export	Hermes already has benchmark/trajectory machinery that can seed TensorZero A/B evaluation, but no integrated TensorZero experiment runner or live evaluation gateway.

Grounded Hermes touchpoints

run_agent.py:601 — [fallback_chain] fallback_model: Dict[str, Any] = None,
run_agent.py:995 — [fallback_chain] # failure). Supports both legacy single-dict fallback_model and
run_agent.py:996 — [fallback_chain] # new list fallback_providers format.
run_agent.py:997 — [fallback_chain] if isinstance(fallback_model, list):
run_agent.py:998 — [fallback_chain] self._fallback_chain = [
run_agent.py:999 — [fallback_chain] f for f in fallback_model
run_agent.py:1002 — [fallback_chain] elif isinstance(fallback_model, dict) and fallback_model.get("provider") and fallback_model.get("model"):
run_agent.py:1003 — [fallback_chain] self._fallback_chain = [fallback_model]
run_agent.py:1005 — [fallback_chain] self._fallback_chain = []
run_agent.py:1009 — [fallback_chain] self._fallback_model = self._fallback_chain[0] if self._fallback_chain else None
run_agent.py:1010 — [fallback_chain] if self._fallback_chain and not self.quiet_mode:
run_agent.py:1011 — [fallback_chain] if len(self._fallback_chain) == 1:
run_agent.py:1012 — [fallback_chain] fb = self._fallback_chain[0]
run_agent.py:1015 — [fallback_chain] print(f"🔄 Fallback chain ({len(self._fallback_chain)} providers): " +
run_agent.py:1016 — [fallback_chain] " → ".join(f"{f['model']} ({f['provider']})" for f in self._fallback_chain))
run_agent.py:5624 — [fallback_chain] if self._fallback_index >= len(self._fallback_chain):
run_agent.py:5627 — [fallback_chain] fb = self._fallback_chain[self._fallback_index]
run_agent.py:8559 — [fallback_chain] if self._fallback_index < len(self._fallback_chain):
run_agent.py:9355 — [fallback_chain] if is_rate_limited and self._fallback_index < len(self._fallback_chain):
run_agent.py:10460 — [fallback_chain] if _truly_empty and self._fallback_chain:
run_agent.py:10514 — [fallback_chain] + (" and fallback attempts." if self._fallback_chain else
cli.py:241 — [provider_routing_config] "smart_model_routing": {
cli.py:370 — [provider_routing_config] # (e.g. platform_toolsets, provider_routing, memory, honcho, etc.)
cli.py:1753 — [provider_routing_config] pr = CLI_CONFIG.get("provider_routing", {}) or {}
cli.py:1762 — [provider_routing_config] # Supports new list format (fallback_providers) and legacy single-dict (fallback_model).
cli.py:1763 — [provider_routing_config] fb = CLI_CONFIG.get("fallback_providers") or CLI_CONFIG.get("fallback_model") or []
cli.py:1770 — [provider_routing_config] self._smart_model_routing = CLI_CONFIG.get("smart_model_routing", {}) or {}
cli.py:2771 — [provider_routing_config] from agent.smart_model_routing import resolve_turn_route
cli.py:2776 — [provider_routing_config] self._smart_model_routing,
hermes_cli/runtime_provider.py:209 — [runtime_provider] def resolve_requested_provider(requested: Optional[str] = None) -> str:
hermes_cli/runtime_provider.py:649 — [runtime_provider] def resolve_runtime_provider(
agent/smart_model_routing.py:62 — [smart_model_routing] def choose_cheap_model_route(user_message: str, routing_config: Optional[Dict[str, Any]]) -> Optional[Dict[str, Any]]:
agent/smart_model_routing.py:110 — [smart_model_routing] def resolve_turn_route(user_message: str, routing_config: Optional[Dict[str, Any]], primary: Dict[str, Any]) -> Dict[str, Any]:
gateway/run.py:1271 — [gateway_provider_routing] def _load_provider_routing() -> dict:
gateway/run.py:1285 — [gateway_provider_routing] def _load_fallback_model() -> list | dict | None:
gateway/run.py:1306 — [gateway_provider_routing] def _load_smart_model_routing() -> dict:
cron/scheduler.py:684 — [cron_runtime_provider] pr = _cfg.get("provider_routing", {})
cron/scheduler.py:688 — [cron_runtime_provider] resolve_runtime_provider,
cron/scheduler.py:697 — [cron_runtime_provider] runtime = resolve_runtime_provider(**runtime_kwargs)
cron/scheduler.py:702 — [cron_runtime_provider] from agent.smart_model_routing import resolve_turn_route
cron/scheduler.py:703 — [cron_runtime_provider] turn_route = resolve_turn_route(
cron/scheduler.py:717 — [cron_runtime_provider] fallback_model = _cfg.get("fallback_providers") or _cfg.get("fallback_model") or None
cron/scheduler.py:746 — [cron_runtime_provider] fallback_model=fallback_model,
agent/auxiliary_client.py:1018 — [auxiliary_fallback_chain] def _get_provider_chain() -> List[tuple]:
agent/auxiliary_client.py:1107 — [auxiliary_fallback_chain] for label, try_fn in _get_provider_chain():
agent/auxiliary_client.py:1189 — [auxiliary_fallback_chain] # ── Step 2: aggregator / fallback chain ──────────────────────────────
agent/auxiliary_client.py:1191 — [auxiliary_fallback_chain] for label, try_fn in _get_provider_chain():
agent/auxiliary_client.py:2397 — [auxiliary_fallback_chain] # error, fall through to the fallback chain below.
agent/auxiliary_client.py:2417 — [auxiliary_fallback_chain] # auto (the default) = best-effort fallback chain. (#7559)
agent/auxiliary_client.py:2589 — [auxiliary_fallback_chain] # error, fall through to the fallback chain below.
tools/delegate_tool.py:662 — [delegate_runtime_provider] # bundle (base_url, api_key, api_mode) via the same runtime provider system
tools/delegate_tool.py:854 — [delegate_runtime_provider] provider) is resolved via the runtime provider system — the same path used
tools/delegate_tool.py:909 — [delegate_runtime_provider] from hermes_cli.runtime_provider import resolve_runtime_provider
tools/delegate_tool.py:910 — [delegate_runtime_provider] runtime = resolve_runtime_provider(requested=configured_provider)
hermes_state.py:115 — [session_db] class SessionDB:
batch_runner.py:320 — [trajectory_export] save_trajectories=False, # We handle saving ourselves
batch_runner.py:346 — [trajectory_export] trajectory = agent._convert_to_trajectory_format(
batch_runner.py:460 — [trajectory_export] trajectory_entry = {
batch_runner.py:474 — [trajectory_export] f.write(json.dumps(trajectory_entry, ensure_ascii=False) + "\n")
benchmarks/tool_call_benchmark.py:3 — [benchmark_suite] Tool-Calling Benchmark — Gemma 4 vs mimo-v2-pro regression test.
benchmarks/tool_call_benchmark.py:9 — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py # full 100-call suite
benchmarks/tool_call_benchmark.py:10 — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --limit 10 # quick smoke test
benchmarks/tool_call_benchmark.py:11 — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --models nous # single model
benchmarks/tool_call_benchmark.py:12 — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --category file # single category
benchmarks/tool_call_benchmark.py:37 — [benchmark_suite] class ToolCall:
benchmarks/tool_call_benchmark.py:51 — [benchmark_suite] ToolCall("file-01", "file", "Read the file /tmp/test_bench.txt and show me its contents.",
benchmarks/tool_call_benchmark.py:53 — [benchmark_suite] ToolCall("file-02", "file", "Write 'hello benchmark' to /tmp/test_bench_out.txt",
benchmarks/tool_call_benchmark.py:55 — [benchmark_suite] ToolCall("file-03", "file", "Search for the word 'import' in all Python files in the current directory.",
benchmarks/tool_call_benchmark.py:57 — [benchmark_suite] ToolCall("file-04", "file", "Read lines 1-20 of /etc/hosts",
benchmarks/tool_call_benchmark.py:59 — [benchmark_suite] ToolCall("file-05", "file", "Patch /tmp/test_bench_out.txt: replace 'hello' with 'goodbye'",
benchmarks/tool_call_benchmark.py:61 — [benchmark_suite] ToolCall("file-06", "file", "Search for files matching *.py in the current directory.",
benchmarks/tool_call_benchmark.py:63 — [benchmark_suite] ToolCall("file-07", "file", "Read the first 10 lines of /etc/passwd",
benchmarks/tool_call_benchmark.py:65 — [benchmark_suite] ToolCall("file-08", "file", "Write a JSON config to /tmp/bench_config.json with key 'debug': true",
benchmarks/tool_call_benchmark.py:67 — [benchmark_suite] ToolCall("file-09", "file", "Search for 'def test_' in Python test files.",
benchmarks/tool_call_benchmark.py:69 — [benchmark_suite] ToolCall("file-10", "file", "Read /tmp/bench_config.json and tell me what's in it.",
benchmarks/tool_call_benchmark.py:71 — [benchmark_suite] ToolCall("file-11", "file", "Create a file /tmp/bench_readme.md with one line: '# Benchmark'",
benchmarks/tool_call_benchmark.py:73 — [benchmark_suite] ToolCall("file-12", "file", "Search for 'TODO' comments in all .py files.",
benchmarks/tool_call_benchmark.py:75 — [benchmark_suite] ToolCall("file-13", "file", "Read /tmp/bench_readme.md",
benchmarks/tool_call_benchmark.py:77 — [benchmark_suite] ToolCall("file-14", "file", "Patch /tmp/bench_readme.md: replace '# Benchmark' with '# Tool Benchmark'",
benchmarks/tool_call_benchmark.py:78 — [benchmark_suite] "patch", "Tool Benchmark"),
benchmarks/tool_call_benchmark.py:79 — [benchmark_suite] ToolCall("file-15", "file", "Write a Python one-liner to /tmp/bench_hello.py that prints hello.",
benchmarks/tool_call_benchmark.py:81 — [benchmark_suite] ToolCall("file-16", "file", "Search for all .json files in /tmp/.",
benchmarks/tool_call_benchmark.py:83 — [benchmark_suite] ToolCall("file-17", "file", "Read /tmp/bench_hello.py and verify it has print('hello').",
benchmarks/tool_call_benchmark.py:85 — [benchmark_suite] ToolCall("file-18", "file", "Patch /tmp/bench_hello.py to print 'hello world' instead of 'hello'.",
benchmarks/tool_call_benchmark.py:87 — [benchmark_suite] ToolCall("file-19", "file", "List files matching 'bench*' in /tmp/.",
benchmarks/tool_call_benchmark.py:89 — [benchmark_suite] ToolCall("file-20", "file", "Read /tmp/test_bench.txt again and summarize its contents.",
benchmarks/tool_call_benchmark.py:93 — [benchmark_suite] ToolCall("term-01", "terminal", "Run echo hello world in the terminal.",
benchmarks/tool_call_benchmark.py:95 — [benchmark_suite] ToolCall("term-02", "terminal", "Run date to get the current date and time.",
benchmarks/tool_call_benchmark.py:97 — [benchmark_suite] ToolCall("term-03", "terminal", "Run uname -a to get system information.",
benchmarks/tool_call_benchmark.py:99 — [benchmark_suite] ToolCall("term-04", "terminal", "Run pwd to show the current directory.",
benchmarks/tool_call_benchmark.py:101 — [benchmark_suite] ToolCall("term-05", "terminal", "Run ls -la /tmp/ | head -20 to list temp files.",
benchmarks/tool_call_benchmark.py:103 — [benchmark_suite] ToolCall("term-06", "terminal", "Run whoami to show the current user.",
benchmarks/tool_call_benchmark.py:105 — [benchmark_suite] ToolCall("term-07", "terminal", "Run df -h to show disk usage.",
benchmarks/tool_call_benchmark.py:107 — [benchmark_suite] ToolCall("term-08", "terminal", "Run python3 --version to check Python version.",
benchmarks/tool_call_benchmark.py:109 — [benchmark_suite] ToolCall("term-09", "terminal", "Run cat /etc/hostname to get the hostname.",
benchmarks/tool_call_benchmark.py:111 — [benchmark_suite] ToolCall("term-10", "terminal", "Run uptime to see system uptime.",
benchmarks/tool_call_benchmark.py:113 — [benchmark_suite] ToolCall("term-11", "terminal", "Run env | grep PATH to show the PATH variable.",
benchmarks/tool_call_benchmark.py:115 — [benchmark_suite] ToolCall("term-12", "terminal", "Run wc -l /etc/passwd to count lines.",
benchmarks/tool_call_benchmark.py:117 — [benchmark_suite] ToolCall("term-13", "terminal", "Run echo $SHELL to show the current shell.",
benchmarks/tool_call_benchmark.py:119 — [benchmark_suite] ToolCall("term-14", "terminal", "Run free -h || vm_stat to check memory usage.",
benchmarks/tool_call_benchmark.py:121 — [benchmark_suite] ToolCall("term-15", "terminal", "Run id to show user and group IDs.",
benchmarks/tool_call_benchmark.py:123 — [benchmark_suite] ToolCall("term-16", "terminal", "Run hostname to get the machine hostname.",
benchmarks/tool_call_benchmark.py:125 — [benchmark_suite] ToolCall("term-17", "terminal", "Run echo {1..5} to test brace expansion.",
benchmarks/tool_call_benchmark.py:127 — [benchmark_suite] ToolCall("term-18", "terminal", "Run seq 1 5 to generate a number sequence.",
benchmarks/tool_call_benchmark.py:129 — [benchmark_suite] ToolCall("term-19", "terminal", "Run python3 -c 'print(2+2)' to compute 2+2.",
benchmarks/tool_call_benchmark.py:131 — [benchmark_suite] ToolCall("term-20", "terminal", "Run ls -d /tmp/bench* 2>/dev/null | wc -l to count bench files.",
benchmarks/tool_call_benchmark.py:135 — [benchmark_suite] ToolCall("code-01", "code", "Execute a Python script that computes factorial of 10.",
benchmarks/tool_call_benchmark.py:137 — [benchmark_suite] ToolCall("code-02", "code", "Run Python to read /tmp/test_bench.txt and count its words.",
benchmarks/tool_call_benchmark.py:139 — [benchmark_suite] ToolCall("code-03", "code", "Execute Python to generate the first 20 Fibonacci numbers.",
benchmarks/tool_call_benchmark.py:141 — [benchmark_suite] ToolCall("code-04", "code", "Run Python to parse JSON from a string and print keys.",
benchmarks/tool_call_benchmark.py:143 — [benchmark_suite] ToolCall("code-05", "code", "Execute Python to list all files in /tmp/ matching 'bench*'.",
benchmarks/tool_call_benchmark.py:145 — [benchmark_suite] ToolCall("code-06", "code", "Run Python to compute the sum of squares from 1 to 100.",
benchmarks/tool_call_benchmark.py:147 — [benchmark_suite] ToolCall("code-07", "code", "Execute Python to check if 'racecar' is a palindrome.",
benchmarks/tool_call_benchmark.py:149 — [benchmark_suite] ToolCall("code-08", "code", "Run Python to create a CSV string with 5 rows of sample data.",
benchmarks/tool_call_benchmark.py:151 — [benchmark_suite] ToolCall("code-09", "code", "Execute Python to sort a list [5,2,8,1,9] and print the result.",
benchmarks/tool_call_benchmark.py:153 — [benchmark_suite] ToolCall("code-10", "code", "Run Python to count lines in /etc/passwd.",
benchmarks/tool_call_benchmark.py:155 — [benchmark_suite] ToolCall("code-11", "code", "Execute Python to hash the string 'benchmark' with SHA256.",
benchmarks/tool_call_benchmark.py:157 — [benchmark_suite] ToolCall("code-12", "code", "Run Python to get the current UTC timestamp.",
benchmarks/tool_call_benchmark.py:159 — [benchmark_suite] ToolCall("code-13", "code", "Execute Python to convert 'hello world' to uppercase and reverse it.",
benchmarks/tool_call_benchmark.py:161 — [benchmark_suite] ToolCall("code-14", "code", "Run Python to create a dictionary of system info (platform, python version).",
benchmarks/tool_call_benchmark.py:163 — [benchmark_suite] ToolCall("code-15", "code", "Execute Python to check internet connectivity by resolving google.com.",
benchmarks/tool_call_benchmark.py:167 — [benchmark_suite] ToolCall("deleg-01", "delegate", "Use a subagent to find all .log files in /tmp/.",
benchmarks/tool_call_benchmark.py:169 — [benchmark_suite] ToolCall("deleg-02", "delegate", "Delegate to a subagent: what is 15 * 37?",
benchmarks/tool_call_benchmark.py:171 — [benchmark_suite] ToolCall("deleg-03", "delegate", "Use a subagent to check if Python 3 is installed and its version.",
benchmarks/tool_call_benchmark.py:173 — [benchmark_suite] ToolCall("deleg-04", "delegate", "Delegate: read /tmp/test_bench.txt and summarize it in one sentence.",
benchmarks/tool_call_benchmark.py:175 — [benchmark_suite] ToolCall("deleg-05", "delegate", "Use a subagent to list the contents of /tmp/ directory.",
benchmarks/tool_call_benchmark.py:177 — [benchmark_suite] ToolCall("deleg-06", "delegate", "Delegate: count the number of .py files in the current directory.",
benchmarks/tool_call_benchmark.py:179 — [benchmark_suite] ToolCall("deleg-07", "delegate", "Use a subagent to check disk space with df -h.",
benchmarks/tool_call_benchmark.py:181 — [benchmark_suite] ToolCall("deleg-08", "delegate", "Delegate: what OS are we running on?",
benchmarks/tool_call_benchmark.py:183 — [benchmark_suite] ToolCall("deleg-09", "delegate", "Use a subagent to find the hostname of this machine.",
benchmarks/tool_call_benchmark.py:185 — [benchmark_suite] ToolCall("deleg-10", "delegate", "Delegate: create a temp file /tmp/bench_deleg.txt with 'done'.",
benchmarks/tool_call_benchmark.py:189 — [benchmark_suite] ToolCall("todo-01", "todo", "Add a todo item: 'Run benchmark suite'",
benchmarks/tool_call_benchmark.py:190 — [benchmark_suite] "todo", "benchmark"),
benchmarks/tool_call_benchmark.py:191 — [benchmark_suite] ToolCall("todo-02", "todo", "Show me the current todo list.",
benchmarks/tool_call_benchmark.py:193 — [benchmark_suite] ToolCall("todo-03", "todo", "Mark the first todo item as completed.",
benchmarks/tool_call_benchmark.py:195 — [benchmark_suite] ToolCall("todo-04", "todo", "Add a todo: 'Review benchmark results' with status pending.",
benchmarks/tool_call_benchmark.py:197 — [benchmark_suite] ToolCall("todo-05", "todo", "Clear all completed todos.",
benchmarks/tool_call_benchmark.py:199 — [benchmark_suite] ToolCall("todo-06", "memory", "Save this to memory: 'benchmark ran on {date}'".format(
benchmarks/tool_call_benchmark.py:201 — [benchmark_suite] "memory", "benchmark"),
benchmarks/tool_call_benchmark.py:202 — [benchmark_suite] ToolCall("todo-07", "memory", "Search memory for 'benchmark'.",
benchmarks/tool_call_benchmark.py:203 — [benchmark_suite] "memory", "benchmark"),
benchmarks/tool_call_benchmark.py:204 — [benchmark_suite] ToolCall("todo-08", "memory", "Add a memory note: 'test models are gemma-4 and mimo-v2-pro'.",
benchmarks/tool_call_benchmark.py:206 — [benchmark_suite] ToolCall("todo-09", "todo", "Add three todo items: 'analyze', 'report', 'cleanup'.",
benchmarks/tool_call_benchmark.py:208 — [benchmark_suite] ToolCall("todo-10", "memory", "Search memory for any notes about models.",
benchmarks/tool_call_benchmark.py:212 — [benchmark_suite] ToolCall("skill-01", "skills", "List all available skills.",
benchmarks/tool_call_benchmark.py:214 — [benchmark_suite] ToolCall("skill-02", "skills", "View the skill called 'test-driven-development'.",
benchmarks/tool_call_benchmark.py:216 — [benchmark_suite] ToolCall("skill-03", "skills", "Search for skills related to 'git'.",
benchmarks/tool_call_benchmark.py:218 — [benchmark_suite] ToolCall("skill-04", "skills", "View the 'code-review' skill.",
benchmarks/tool_call_benchmark.py:220 — [benchmark_suite] ToolCall("skill-05", "skills", "List all skills in the 'devops' category.",
benchmarks/tool_call_benchmark.py:222 — [benchmark_suite] ToolCall("skill-06", "skills", "View the 'systematic-debugging' skill.",
benchmarks/tool_call_benchmark.py:224 — [benchmark_suite] ToolCall("skill-07", "skills", "Search for skills about 'testing'.",
benchmarks/tool_call_benchmark.py:226 — [benchmark_suite] ToolCall("skill-08", "skills", "View the 'writing-plans' skill.",
benchmarks/tool_call_benchmark.py:228 — [benchmark_suite] ToolCall("skill-09", "skills", "List skills in 'software-development' category.",
benchmarks/tool_call_benchmark.py:230 — [benchmark_suite] ToolCall("skill-10", "skills", "View the 'pr-review-discipline' skill.",
benchmarks/tool_call_benchmark.py:234 — [benchmark_suite] ToolCall("file-21", "file", "Write a Python snippet to /tmp/bench_sort.py that sorts [3,1,2].",
benchmarks/tool_call_benchmark.py:236 — [benchmark_suite] ToolCall("file-22", "file", "Read /tmp/bench_sort.py back and confirm it exists.",
benchmarks/tool_call_benchmark.py:238 — [benchmark_suite] ToolCall("file-23", "file", "Search for 'class' in all .py files in the benchmarks directory.",
benchmarks/tool_call_benchmark.py:240 — [benchmark_suite] ToolCall("term-21", "terminal", "Run cat /etc/os-release 2>/dev/null || sw_vers 2>/dev/null for OS info.",
benchmarks/tool_call_benchmark.py:242 — [benchmark_suite] ToolCall("term-22", "terminal", "Run nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null for CPU count.",
benchmarks/tool_call_benchmark.py:244 — [benchmark_suite] ToolCall("code-16", "code", "Execute Python to flatten a nested list 1,2],[3,4],[5.",
benchmarks/tool_call_benchmark.py:246 — [benchmark_suite] ToolCall("code-17", "code", "Run Python to check if a number 17 is prime.",
benchmarks/tool_call_benchmark.py:248 — [benchmark_suite] ToolCall("deleg-11", "delegate", "Delegate: what is the current working directory?",
benchmarks/tool_call_benchmark.py:250 — [benchmark_suite] ToolCall("todo-11", "todo", "Add a todo: 'Finalize benchmark report' status pending.",
benchmarks/tool_call_benchmark.py:252 — [benchmark_suite] ToolCall("todo-12", "memory", "Store fact: 'benchmark categories: file, terminal, code, delegate, todo, memory, skills'.",
benchmarks/tool_call_benchmark.py:254 — [benchmark_suite] ToolCall("skill-11", "skills", "Search for skills about 'deployment'.",
benchmarks/tool_call_benchmark.py:256 — [benchmark_suite] ToolCall("skill-12", "skills", "View the 'gitea-burn-cycle' skill.",
benchmarks/tool_call_benchmark.py:258 — [benchmark_suite] ToolCall("skill-13", "skills", "List all available skill categories.",
benchmarks/tool_call_benchmark.py:260 — [benchmark_suite] ToolCall("skill-14", "skills", "Search for skills related to 'memory'.",
benchmarks/tool_call_benchmark.py:262 — [benchmark_suite] ToolCall("skill-15", "skills", "View the 'mimo-swarm' skill.",
benchmarks/tool_call_benchmark.py:311 — [benchmark_suite] """Create prerequisite files for the benchmark."""
benchmarks/tool_call_benchmark.py:313 — [benchmark_suite] "This is a benchmark test file.\n"
benchmarks/tool_call_benchmark.py:349 — [benchmark_suite] "You are a benchmark test runner. Execute the user's request by calling "
benchmarks/tool_call_benchmark.py:406 — [benchmark_suite] """Generate markdown benchmark report."""
benchmarks/tool_call_benchmark.py:428 — [benchmark_suite] f"# Tool-Calling Benchmark Report",
benchmarks/tool_call_benchmark.py:535 — [benchmark_suite] parser = argparse.ArgumentParser(description="Tool-calling benchmark")
benchmarks/tool_call_benchmark.py:544 — [benchmark_suite] help="Output report path (default: benchmarks/gemma4-tool-calling-YYYY-MM-DD.md)")
benchmarks/tool_call_benchmark.py:565 — [benchmark_suite] output_path = Path(args.output) if args.output else REPO_ROOT / "benchmarks" / f"gemma4-tool-calling-{date_str}.md"
benchmarks/tool_call_benchmark.py:575 — [benchmark_suite] print(f"Benchmark: {len(suite)} tests × {len(model_specs)} models = {len(suite) * len(model_specs)} calls")

Suggested next slice

Build an exporter that emits SessionDB + trajectory data into a TensorZero-friendly offline dataset.
Define percentage-based canary controls before attempting any gateway replacement.
Keep Hermes routing authoritative until TensorZero proves parity across CLI, gateway, cron, auxiliary, and delegation surfaces.

26 KiB Raw Blame History Unescape Escape