docs: add grounded tensorzero evaluation packet (#860 )

- add a script that inventories Hermes routing/evaluation surfaces relevant to a TensorZero cutover - generate a markdown and JSON evaluation packet for issue #860 - score gateway replacement, config migration, canary rollout, session feedback, and eval-suite readiness - add focused regression tests for touchpoint scanning, requirement scoring, and report rendering Refs #860
2026-04-22 11:33:31 -04:00
7 changed files with 1815 additions and 850 deletions
--- a/docs/evaluations/tensorzero-860-evaluation.json
+++ b/docs/evaluations/tensorzero-860-evaluation.json
--- a/docs/evaluations/tensorzero-860-evaluation.md
+++ b/docs/evaluations/tensorzero-860-evaluation.md
@@ -0,0 +1,217 @@
+# TensorZero Evaluation Packet
+
+Issue #860: [tensorzero LLMOps platform evaluation](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/860)
+
+## Scope
+
+This packet evaluates TensorZero as a possible replacement for Hermes' custom provider-routing stack.
+It is intentionally grounded in the current repo state rather than a speculative cutover plan.
+
+## Issue requirements being evaluated
+
+- Deploy tensorzero gateway (Rust binary)
+- Migrate provider routing config
+- Test with canary (10% traffic) before full cutover
+- Feed session data for prompt optimization
+- Evaluation suite for A/B testing models
+
+## Recommendation
+
+Not ready for direct replacement. Recommend a shadow-evaluation phase first: keep Hermes routing live, inventory the migration seams, export SessionDB/trajectory data into an offline TensorZero experiment loop, and only design a canary gateway once percentage-based rollout controls exist.
+
+## Requirement matrix
+
+| Requirement | Status | Evidence labels | Summary |
+| --- | --- | --- | --- |
+| Gateway replacement scope | partial | fallback_chain, runtime_provider, gateway_provider_routing, cron_runtime_provider, auxiliary_fallback_chain, delegate_runtime_provider | Hermes already spreads provider routing across core agent, runtime provider, gateway, cron, auxiliary, and delegation seams; TensorZero would need parity across all of them before it can replace the gateway layer. |
+| Config migration | partial | provider_routing_config, runtime_provider, smart_model_routing, fallback_chain | Hermes has multiple config concepts to migrate (`provider_routing`, `fallback_providers`, `smart_model_routing`, runtime provider resolution), so TensorZero is not a drop-in config swap. |
+| 10% traffic canary | gap | — | The repo shows semantic routing and fallback, but no grounded 10% traffic-split canary mechanism. A TensorZero cutover would need new percentage-based rollout controls and observability hooks. |
+| Session data for prompt optimization | partial | session_db, trajectory_export | Hermes already has SessionDB and trajectory export surfaces that can feed offline optimization data, but not a TensorZero-native ingestion path yet. |
+| Evaluation suite / A/B testing | partial | benchmark_suite, trajectory_export | Hermes already has benchmark/trajectory machinery that can seed TensorZero A/B evaluation, but no integrated TensorZero experiment runner or live evaluation gateway. |
+
+## Grounded Hermes touchpoints
+
+- `run_agent.py:601` — [fallback_chain] fallback_model: Dict[str, Any] = None,
+- `run_agent.py:995` — [fallback_chain] # failure).  Supports both legacy single-dict ``fallback_model`` and
+- `run_agent.py:996` — [fallback_chain] # new list ``fallback_providers`` format.
+- `run_agent.py:997` — [fallback_chain] if isinstance(fallback_model, list):
+- `run_agent.py:998` — [fallback_chain] self._fallback_chain = [
+- `run_agent.py:999` — [fallback_chain] f for f in fallback_model
+- `run_agent.py:1002` — [fallback_chain] elif isinstance(fallback_model, dict) and fallback_model.get("provider") and fallback_model.get("model"):
+- `run_agent.py:1003` — [fallback_chain] self._fallback_chain = [fallback_model]
+- `run_agent.py:1005` — [fallback_chain] self._fallback_chain = []
+- `run_agent.py:1009` — [fallback_chain] self._fallback_model = self._fallback_chain[0] if self._fallback_chain else None
+- `run_agent.py:1010` — [fallback_chain] if self._fallback_chain and not self.quiet_mode:
+- `run_agent.py:1011` — [fallback_chain] if len(self._fallback_chain) == 1:
+- `run_agent.py:1012` — [fallback_chain] fb = self._fallback_chain[0]
+- `run_agent.py:1015` — [fallback_chain] print(f"🔄 Fallback chain ({len(self._fallback_chain)} providers): " +
+- `run_agent.py:1016` — [fallback_chain] " → ".join(f"{f['model']} ({f['provider']})" for f in self._fallback_chain))
+- `run_agent.py:5624` — [fallback_chain] if self._fallback_index >= len(self._fallback_chain):
+- `run_agent.py:5627` — [fallback_chain] fb = self._fallback_chain[self._fallback_index]
+- `run_agent.py:8559` — [fallback_chain] if self._fallback_index < len(self._fallback_chain):
+- `run_agent.py:9355` — [fallback_chain] if is_rate_limited and self._fallback_index < len(self._fallback_chain):
+- `run_agent.py:10460` — [fallback_chain] if _truly_empty and self._fallback_chain:
+- `run_agent.py:10514` — [fallback_chain] + (" and fallback attempts." if self._fallback_chain else
+- `cli.py:241` — [provider_routing_config] "smart_model_routing": {
+- `cli.py:370` — [provider_routing_config] # (e.g. platform_toolsets, provider_routing, memory, honcho, etc.)
+- `cli.py:1753` — [provider_routing_config] pr = CLI_CONFIG.get("provider_routing", {}) or {}
+- `cli.py:1762` — [provider_routing_config] # Supports new list format (fallback_providers) and legacy single-dict (fallback_model).
+- `cli.py:1763` — [provider_routing_config] fb = CLI_CONFIG.get("fallback_providers") or CLI_CONFIG.get("fallback_model") or []
+- `cli.py:1770` — [provider_routing_config] self._smart_model_routing = CLI_CONFIG.get("smart_model_routing", {}) or {}
+- `cli.py:2771` — [provider_routing_config] from agent.smart_model_routing import resolve_turn_route
+- `cli.py:2776` — [provider_routing_config] self._smart_model_routing,
+- `hermes_cli/runtime_provider.py:209` — [runtime_provider] def resolve_requested_provider(requested: Optional[str] = None) -> str:
+- `hermes_cli/runtime_provider.py:649` — [runtime_provider] def resolve_runtime_provider(
+- `agent/smart_model_routing.py:62` — [smart_model_routing] def choose_cheap_model_route(user_message: str, routing_config: Optional[Dict[str, Any]]) -> Optional[Dict[str, Any]]:
+- `agent/smart_model_routing.py:110` — [smart_model_routing] def resolve_turn_route(user_message: str, routing_config: Optional[Dict[str, Any]], primary: Dict[str, Any]) -> Dict[str, Any]:
+- `gateway/run.py:1271` — [gateway_provider_routing] def _load_provider_routing() -> dict:
+- `gateway/run.py:1285` — [gateway_provider_routing] def _load_fallback_model() -> list | dict | None:
+- `gateway/run.py:1306` — [gateway_provider_routing] def _load_smart_model_routing() -> dict:
+- `cron/scheduler.py:684` — [cron_runtime_provider] pr = _cfg.get("provider_routing", {})
+- `cron/scheduler.py:688` — [cron_runtime_provider] resolve_runtime_provider,
+- `cron/scheduler.py:697` — [cron_runtime_provider] runtime = resolve_runtime_provider(**runtime_kwargs)
+- `cron/scheduler.py:702` — [cron_runtime_provider] from agent.smart_model_routing import resolve_turn_route
+- `cron/scheduler.py:703` — [cron_runtime_provider] turn_route = resolve_turn_route(
+- `cron/scheduler.py:717` — [cron_runtime_provider] fallback_model = _cfg.get("fallback_providers") or _cfg.get("fallback_model") or None
+- `cron/scheduler.py:746` — [cron_runtime_provider] fallback_model=fallback_model,
+- `agent/auxiliary_client.py:1018` — [auxiliary_fallback_chain] def _get_provider_chain() -> List[tuple]:
+- `agent/auxiliary_client.py:1107` — [auxiliary_fallback_chain] for label, try_fn in _get_provider_chain():
+- `agent/auxiliary_client.py:1189` — [auxiliary_fallback_chain] # ── Step 2: aggregator / fallback chain ──────────────────────────────
+- `agent/auxiliary_client.py:1191` — [auxiliary_fallback_chain] for label, try_fn in _get_provider_chain():
+- `agent/auxiliary_client.py:2397` — [auxiliary_fallback_chain] # error, fall through to the fallback chain below.
+- `agent/auxiliary_client.py:2417` — [auxiliary_fallback_chain] # auto (the default) = best-effort fallback chain.  (#7559)
+- `agent/auxiliary_client.py:2589` — [auxiliary_fallback_chain] # error, fall through to the fallback chain below.
+- `tools/delegate_tool.py:662` — [delegate_runtime_provider] # bundle (base_url, api_key, api_mode) via the same runtime provider system
+- `tools/delegate_tool.py:854` — [delegate_runtime_provider] provider) is resolved via the runtime provider system — the same path used
+- `tools/delegate_tool.py:909` — [delegate_runtime_provider] from hermes_cli.runtime_provider import resolve_runtime_provider
+- `tools/delegate_tool.py:910` — [delegate_runtime_provider] runtime = resolve_runtime_provider(requested=configured_provider)
+- `hermes_state.py:115` — [session_db] class SessionDB:
+- `batch_runner.py:320` — [trajectory_export] save_trajectories=False,  # We handle saving ourselves
+- `batch_runner.py:346` — [trajectory_export] trajectory = agent._convert_to_trajectory_format(
+- `batch_runner.py:460` — [trajectory_export] trajectory_entry = {
+- `batch_runner.py:474` — [trajectory_export] f.write(json.dumps(trajectory_entry, ensure_ascii=False) + "\n")
+- `benchmarks/tool_call_benchmark.py:3` — [benchmark_suite] Tool-Calling Benchmark — Gemma 4 vs mimo-v2-pro regression test.
+- `benchmarks/tool_call_benchmark.py:9` — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py                  # full 100-call suite
+- `benchmarks/tool_call_benchmark.py:10` — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --limit 10       # quick smoke test
+- `benchmarks/tool_call_benchmark.py:11` — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --models nous     # single model
+- `benchmarks/tool_call_benchmark.py:12` — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --category file   # single category
+- `benchmarks/tool_call_benchmark.py:37` — [benchmark_suite] class ToolCall:
+- `benchmarks/tool_call_benchmark.py:51` — [benchmark_suite] ToolCall("file-01", "file", "Read the file /tmp/test_bench.txt and show me its contents.",
+- `benchmarks/tool_call_benchmark.py:53` — [benchmark_suite] ToolCall("file-02", "file", "Write 'hello benchmark' to /tmp/test_bench_out.txt",
+- `benchmarks/tool_call_benchmark.py:55` — [benchmark_suite] ToolCall("file-03", "file", "Search for the word 'import' in all Python files in the current directory.",
+- `benchmarks/tool_call_benchmark.py:57` — [benchmark_suite] ToolCall("file-04", "file", "Read lines 1-20 of /etc/hosts",
+- `benchmarks/tool_call_benchmark.py:59` — [benchmark_suite] ToolCall("file-05", "file", "Patch /tmp/test_bench_out.txt: replace 'hello' with 'goodbye'",
+- `benchmarks/tool_call_benchmark.py:61` — [benchmark_suite] ToolCall("file-06", "file", "Search for files matching *.py in the current directory.",
+- `benchmarks/tool_call_benchmark.py:63` — [benchmark_suite] ToolCall("file-07", "file", "Read the first 10 lines of /etc/passwd",
+- `benchmarks/tool_call_benchmark.py:65` — [benchmark_suite] ToolCall("file-08", "file", "Write a JSON config to /tmp/bench_config.json with key 'debug': true",
+- `benchmarks/tool_call_benchmark.py:67` — [benchmark_suite] ToolCall("file-09", "file", "Search for 'def test_' in Python test files.",
+- `benchmarks/tool_call_benchmark.py:69` — [benchmark_suite] ToolCall("file-10", "file", "Read /tmp/bench_config.json and tell me what's in it.",
+- `benchmarks/tool_call_benchmark.py:71` — [benchmark_suite] ToolCall("file-11", "file", "Create a file /tmp/bench_readme.md with one line: '# Benchmark'",
+- `benchmarks/tool_call_benchmark.py:73` — [benchmark_suite] ToolCall("file-12", "file", "Search for 'TODO' comments in all .py files.",
+- `benchmarks/tool_call_benchmark.py:75` — [benchmark_suite] ToolCall("file-13", "file", "Read /tmp/bench_readme.md",
+- `benchmarks/tool_call_benchmark.py:77` — [benchmark_suite] ToolCall("file-14", "file", "Patch /tmp/bench_readme.md: replace '# Benchmark' with '# Tool Benchmark'",
+- `benchmarks/tool_call_benchmark.py:78` — [benchmark_suite] "patch", "Tool Benchmark"),
+- `benchmarks/tool_call_benchmark.py:79` — [benchmark_suite] ToolCall("file-15", "file", "Write a Python one-liner to /tmp/bench_hello.py that prints hello.",
+- `benchmarks/tool_call_benchmark.py:81` — [benchmark_suite] ToolCall("file-16", "file", "Search for all .json files in /tmp/.",
+- `benchmarks/tool_call_benchmark.py:83` — [benchmark_suite] ToolCall("file-17", "file", "Read /tmp/bench_hello.py and verify it has print('hello').",
+- `benchmarks/tool_call_benchmark.py:85` — [benchmark_suite] ToolCall("file-18", "file", "Patch /tmp/bench_hello.py to print 'hello world' instead of 'hello'.",
+- `benchmarks/tool_call_benchmark.py:87` — [benchmark_suite] ToolCall("file-19", "file", "List files matching 'bench*' in /tmp/.",
+- `benchmarks/tool_call_benchmark.py:89` — [benchmark_suite] ToolCall("file-20", "file", "Read /tmp/test_bench.txt again and summarize its contents.",
+- `benchmarks/tool_call_benchmark.py:93` — [benchmark_suite] ToolCall("term-01", "terminal", "Run `echo hello world` in the terminal.",
+- `benchmarks/tool_call_benchmark.py:95` — [benchmark_suite] ToolCall("term-02", "terminal", "Run `date` to get the current date and time.",
+- `benchmarks/tool_call_benchmark.py:97` — [benchmark_suite] ToolCall("term-03", "terminal", "Run `uname -a` to get system information.",
+- `benchmarks/tool_call_benchmark.py:99` — [benchmark_suite] ToolCall("term-04", "terminal", "Run `pwd` to show the current directory.",
+- `benchmarks/tool_call_benchmark.py:101` — [benchmark_suite] ToolCall("term-05", "terminal", "Run `ls -la /tmp/ | head -20` to list temp files.",
+- `benchmarks/tool_call_benchmark.py:103` — [benchmark_suite] ToolCall("term-06", "terminal", "Run `whoami` to show the current user.",
+- `benchmarks/tool_call_benchmark.py:105` — [benchmark_suite] ToolCall("term-07", "terminal", "Run `df -h` to show disk usage.",
+- `benchmarks/tool_call_benchmark.py:107` — [benchmark_suite] ToolCall("term-08", "terminal", "Run `python3 --version` to check Python version.",
+- `benchmarks/tool_call_benchmark.py:109` — [benchmark_suite] ToolCall("term-09", "terminal", "Run `cat /etc/hostname` to get the hostname.",
+- `benchmarks/tool_call_benchmark.py:111` — [benchmark_suite] ToolCall("term-10", "terminal", "Run `uptime` to see system uptime.",
+- `benchmarks/tool_call_benchmark.py:113` — [benchmark_suite] ToolCall("term-11", "terminal", "Run `env | grep PATH` to show the PATH variable.",
+- `benchmarks/tool_call_benchmark.py:115` — [benchmark_suite] ToolCall("term-12", "terminal", "Run `wc -l /etc/passwd` to count lines.",
+- `benchmarks/tool_call_benchmark.py:117` — [benchmark_suite] ToolCall("term-13", "terminal", "Run `echo $SHELL` to show the current shell.",
+- `benchmarks/tool_call_benchmark.py:119` — [benchmark_suite] ToolCall("term-14", "terminal", "Run `free -h || vm_stat` to check memory usage.",
+- `benchmarks/tool_call_benchmark.py:121` — [benchmark_suite] ToolCall("term-15", "terminal", "Run `id` to show user and group IDs.",
+- `benchmarks/tool_call_benchmark.py:123` — [benchmark_suite] ToolCall("term-16", "terminal", "Run `hostname` to get the machine hostname.",
+- `benchmarks/tool_call_benchmark.py:125` — [benchmark_suite] ToolCall("term-17", "terminal", "Run `echo {1..5}` to test brace expansion.",
+- `benchmarks/tool_call_benchmark.py:127` — [benchmark_suite] ToolCall("term-18", "terminal", "Run `seq 1 5` to generate a number sequence.",
+- `benchmarks/tool_call_benchmark.py:129` — [benchmark_suite] ToolCall("term-19", "terminal", "Run `python3 -c 'print(2+2)'` to compute 2+2.",
+- `benchmarks/tool_call_benchmark.py:131` — [benchmark_suite] ToolCall("term-20", "terminal", "Run `ls -d /tmp/bench* 2>/dev/null | wc -l` to count bench files.",
+- `benchmarks/tool_call_benchmark.py:135` — [benchmark_suite] ToolCall("code-01", "code", "Execute a Python script that computes factorial of 10.",
+- `benchmarks/tool_call_benchmark.py:137` — [benchmark_suite] ToolCall("code-02", "code", "Run Python to read /tmp/test_bench.txt and count its words.",
+- `benchmarks/tool_call_benchmark.py:139` — [benchmark_suite] ToolCall("code-03", "code", "Execute Python to generate the first 20 Fibonacci numbers.",
+- `benchmarks/tool_call_benchmark.py:141` — [benchmark_suite] ToolCall("code-04", "code", "Run Python to parse JSON from a string and print keys.",
+- `benchmarks/tool_call_benchmark.py:143` — [benchmark_suite] ToolCall("code-05", "code", "Execute Python to list all files in /tmp/ matching 'bench*'.",
+- `benchmarks/tool_call_benchmark.py:145` — [benchmark_suite] ToolCall("code-06", "code", "Run Python to compute the sum of squares from 1 to 100.",
+- `benchmarks/tool_call_benchmark.py:147` — [benchmark_suite] ToolCall("code-07", "code", "Execute Python to check if 'racecar' is a palindrome.",
+- `benchmarks/tool_call_benchmark.py:149` — [benchmark_suite] ToolCall("code-08", "code", "Run Python to create a CSV string with 5 rows of sample data.",
+- `benchmarks/tool_call_benchmark.py:151` — [benchmark_suite] ToolCall("code-09", "code", "Execute Python to sort a list [5,2,8,1,9] and print the result.",
+- `benchmarks/tool_call_benchmark.py:153` — [benchmark_suite] ToolCall("code-10", "code", "Run Python to count lines in /etc/passwd.",
+- `benchmarks/tool_call_benchmark.py:155` — [benchmark_suite] ToolCall("code-11", "code", "Execute Python to hash the string 'benchmark' with SHA256.",
+- `benchmarks/tool_call_benchmark.py:157` — [benchmark_suite] ToolCall("code-12", "code", "Run Python to get the current UTC timestamp.",
+- `benchmarks/tool_call_benchmark.py:159` — [benchmark_suite] ToolCall("code-13", "code", "Execute Python to convert 'hello world' to uppercase and reverse it.",
+- `benchmarks/tool_call_benchmark.py:161` — [benchmark_suite] ToolCall("code-14", "code", "Run Python to create a dictionary of system info (platform, python version).",
+- `benchmarks/tool_call_benchmark.py:163` — [benchmark_suite] ToolCall("code-15", "code", "Execute Python to check internet connectivity by resolving google.com.",
+- `benchmarks/tool_call_benchmark.py:167` — [benchmark_suite] ToolCall("deleg-01", "delegate", "Use a subagent to find all .log files in /tmp/.",
+- `benchmarks/tool_call_benchmark.py:169` — [benchmark_suite] ToolCall("deleg-02", "delegate", "Delegate to a subagent: what is 15 * 37?",
+- `benchmarks/tool_call_benchmark.py:171` — [benchmark_suite] ToolCall("deleg-03", "delegate", "Use a subagent to check if Python 3 is installed and its version.",
+- `benchmarks/tool_call_benchmark.py:173` — [benchmark_suite] ToolCall("deleg-04", "delegate", "Delegate: read /tmp/test_bench.txt and summarize it in one sentence.",
+- `benchmarks/tool_call_benchmark.py:175` — [benchmark_suite] ToolCall("deleg-05", "delegate", "Use a subagent to list the contents of /tmp/ directory.",
+- `benchmarks/tool_call_benchmark.py:177` — [benchmark_suite] ToolCall("deleg-06", "delegate", "Delegate: count the number of .py files in the current directory.",
+- `benchmarks/tool_call_benchmark.py:179` — [benchmark_suite] ToolCall("deleg-07", "delegate", "Use a subagent to check disk space with df -h.",
+- `benchmarks/tool_call_benchmark.py:181` — [benchmark_suite] ToolCall("deleg-08", "delegate", "Delegate: what OS are we running on?",
+- `benchmarks/tool_call_benchmark.py:183` — [benchmark_suite] ToolCall("deleg-09", "delegate", "Use a subagent to find the hostname of this machine.",
+- `benchmarks/tool_call_benchmark.py:185` — [benchmark_suite] ToolCall("deleg-10", "delegate", "Delegate: create a temp file /tmp/bench_deleg.txt with 'done'.",
+- `benchmarks/tool_call_benchmark.py:189` — [benchmark_suite] ToolCall("todo-01", "todo", "Add a todo item: 'Run benchmark suite'",
+- `benchmarks/tool_call_benchmark.py:190` — [benchmark_suite] "todo", "benchmark"),
+- `benchmarks/tool_call_benchmark.py:191` — [benchmark_suite] ToolCall("todo-02", "todo", "Show me the current todo list.",
+- `benchmarks/tool_call_benchmark.py:193` — [benchmark_suite] ToolCall("todo-03", "todo", "Mark the first todo item as completed.",
+- `benchmarks/tool_call_benchmark.py:195` — [benchmark_suite] ToolCall("todo-04", "todo", "Add a todo: 'Review benchmark results' with status pending.",
+- `benchmarks/tool_call_benchmark.py:197` — [benchmark_suite] ToolCall("todo-05", "todo", "Clear all completed todos.",
+- `benchmarks/tool_call_benchmark.py:199` — [benchmark_suite] ToolCall("todo-06", "memory", "Save this to memory: 'benchmark ran on {date}'".format(
+- `benchmarks/tool_call_benchmark.py:201` — [benchmark_suite] "memory", "benchmark"),
+- `benchmarks/tool_call_benchmark.py:202` — [benchmark_suite] ToolCall("todo-07", "memory", "Search memory for 'benchmark'.",
+- `benchmarks/tool_call_benchmark.py:203` — [benchmark_suite] "memory", "benchmark"),
+- `benchmarks/tool_call_benchmark.py:204` — [benchmark_suite] ToolCall("todo-08", "memory", "Add a memory note: 'test models are gemma-4 and mimo-v2-pro'.",
+- `benchmarks/tool_call_benchmark.py:206` — [benchmark_suite] ToolCall("todo-09", "todo", "Add three todo items: 'analyze', 'report', 'cleanup'.",
+- `benchmarks/tool_call_benchmark.py:208` — [benchmark_suite] ToolCall("todo-10", "memory", "Search memory for any notes about models.",
+- `benchmarks/tool_call_benchmark.py:212` — [benchmark_suite] ToolCall("skill-01", "skills", "List all available skills.",
+- `benchmarks/tool_call_benchmark.py:214` — [benchmark_suite] ToolCall("skill-02", "skills", "View the skill called 'test-driven-development'.",
+- `benchmarks/tool_call_benchmark.py:216` — [benchmark_suite] ToolCall("skill-03", "skills", "Search for skills related to 'git'.",
+- `benchmarks/tool_call_benchmark.py:218` — [benchmark_suite] ToolCall("skill-04", "skills", "View the 'code-review' skill.",
+- `benchmarks/tool_call_benchmark.py:220` — [benchmark_suite] ToolCall("skill-05", "skills", "List all skills in the 'devops' category.",
+- `benchmarks/tool_call_benchmark.py:222` — [benchmark_suite] ToolCall("skill-06", "skills", "View the 'systematic-debugging' skill.",
+- `benchmarks/tool_call_benchmark.py:224` — [benchmark_suite] ToolCall("skill-07", "skills", "Search for skills about 'testing'.",
+- `benchmarks/tool_call_benchmark.py:226` — [benchmark_suite] ToolCall("skill-08", "skills", "View the 'writing-plans' skill.",
+- `benchmarks/tool_call_benchmark.py:228` — [benchmark_suite] ToolCall("skill-09", "skills", "List skills in 'software-development' category.",
+- `benchmarks/tool_call_benchmark.py:230` — [benchmark_suite] ToolCall("skill-10", "skills", "View the 'pr-review-discipline' skill.",
+- `benchmarks/tool_call_benchmark.py:234` — [benchmark_suite] ToolCall("file-21", "file", "Write a Python snippet to /tmp/bench_sort.py that sorts [3,1,2].",
+- `benchmarks/tool_call_benchmark.py:236` — [benchmark_suite] ToolCall("file-22", "file", "Read /tmp/bench_sort.py back and confirm it exists.",
+- `benchmarks/tool_call_benchmark.py:238` — [benchmark_suite] ToolCall("file-23", "file", "Search for 'class' in all .py files in the benchmarks directory.",
+- `benchmarks/tool_call_benchmark.py:240` — [benchmark_suite] ToolCall("term-21", "terminal", "Run `cat /etc/os-release 2>/dev/null || sw_vers 2>/dev/null` for OS info.",
+- `benchmarks/tool_call_benchmark.py:242` — [benchmark_suite] ToolCall("term-22", "terminal", "Run `nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null` for CPU count.",
+- `benchmarks/tool_call_benchmark.py:244` — [benchmark_suite] ToolCall("code-16", "code", "Execute Python to flatten a nested list [[1,2],[3,4],[5]].",
+- `benchmarks/tool_call_benchmark.py:246` — [benchmark_suite] ToolCall("code-17", "code", "Run Python to check if a number 17 is prime.",
+- `benchmarks/tool_call_benchmark.py:248` — [benchmark_suite] ToolCall("deleg-11", "delegate", "Delegate: what is the current working directory?",
+- `benchmarks/tool_call_benchmark.py:250` — [benchmark_suite] ToolCall("todo-11", "todo", "Add a todo: 'Finalize benchmark report' status pending.",
+- `benchmarks/tool_call_benchmark.py:252` — [benchmark_suite] ToolCall("todo-12", "memory", "Store fact: 'benchmark categories: file, terminal, code, delegate, todo, memory, skills'.",
+- `benchmarks/tool_call_benchmark.py:254` — [benchmark_suite] ToolCall("skill-11", "skills", "Search for skills about 'deployment'.",
+- `benchmarks/tool_call_benchmark.py:256` — [benchmark_suite] ToolCall("skill-12", "skills", "View the 'gitea-burn-cycle' skill.",
+- `benchmarks/tool_call_benchmark.py:258` — [benchmark_suite] ToolCall("skill-13", "skills", "List all available skill categories.",
+- `benchmarks/tool_call_benchmark.py:260` — [benchmark_suite] ToolCall("skill-14", "skills", "Search for skills related to 'memory'.",
+- `benchmarks/tool_call_benchmark.py:262` — [benchmark_suite] ToolCall("skill-15", "skills", "View the 'mimo-swarm' skill.",
+- `benchmarks/tool_call_benchmark.py:311` — [benchmark_suite] """Create prerequisite files for the benchmark."""
+- `benchmarks/tool_call_benchmark.py:313` — [benchmark_suite] "This is a benchmark test file.\n"
+- `benchmarks/tool_call_benchmark.py:349` — [benchmark_suite] "You are a benchmark test runner. Execute the user's request by calling "
+- `benchmarks/tool_call_benchmark.py:406` — [benchmark_suite] """Generate markdown benchmark report."""
+- `benchmarks/tool_call_benchmark.py:428` — [benchmark_suite] f"# Tool-Calling Benchmark Report",
+- `benchmarks/tool_call_benchmark.py:535` — [benchmark_suite] parser = argparse.ArgumentParser(description="Tool-calling benchmark")
+- `benchmarks/tool_call_benchmark.py:544` — [benchmark_suite] help="Output report path (default: benchmarks/gemma4-tool-calling-YYYY-MM-DD.md)")
+- `benchmarks/tool_call_benchmark.py:565` — [benchmark_suite] output_path = Path(args.output) if args.output else REPO_ROOT / "benchmarks" / f"gemma4-tool-calling-{date_str}.md"
+- `benchmarks/tool_call_benchmark.py:575` — [benchmark_suite] print(f"Benchmark: {len(suite)} tests × {len(model_specs)} models = {len(suite) * len(model_specs)} calls")
+
+## Suggested next slice
+
+1. Build an exporter that emits SessionDB + trajectory data into a TensorZero-friendly offline dataset.
+2. Define percentage-based canary controls before attempting any gateway replacement.
+3. Keep Hermes routing authoritative until TensorZero proves parity across CLI, gateway, cron, auxiliary, and delegation surfaces.
--- a/docs/review_packets/hermes-harness-2026-04-21.md
+++ b/docs/review_packets/hermes-harness-2026-04-21.md
@@ -1,387 +0,0 @@
-# Morning Review Packet
-
-Source epic: [EPIC: Morning review packet — Hermes harness features landed 2026-04-21](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/949)
-
-## Epic context
-
-EPIC: Morning review packet — Hermes harness features landed 2026-04-21
-
-Source: git log on upstream/main since 2026-04-21 00:00 EDT, plus the current local branch `burn/921-poka-yoke-hardcoded-paths` for the branch-only path-guard work.
-
-Important review note:
- Validate upstream-landed features on `upstream/main` or a synced branch.
- Validate the path-guard work on `burn/921-poka-yoke-hardcoded-paths`.
-
-This epic is a morning-review packet: one QA issue per feature cluster, each with concrete acceptance criteria and targeted tests or manual checks.
-
-## Success criteria
- [ ] Every issue has a clear PASS / FAIL outcome.
- [ ] Test output or manual evidence is attached to each issue.
- [ ] Any drift between upstream/main and forge/main is called out explicitly.
-
-## Sub-issues
-### Upstream/main features landed 2026-04-21
- [ ] #950 [QA] Verify AI Gateway provider UX + attribution headers
- [ ] #951 [QA] Verify transport abstraction + AnthropicTransport wiring
- [ ] #952 [QA] Verify CLI voice beep toggle
- [ ] #953 [QA] Verify bundled skill scripts run out of the box
- [ ] #954 [QA] Verify maps skill guest_house / camp_site / bakery expansion
- [ ] #955 [QA] Verify KittenTTS local provider end-to-end
- [ ] #956 [QA] Verify numbered keyboard shortcuts for approval + clarify prompts
- [ ] #957 [QA] Verify optional adversarial-ux-test skill catalog flow
- [ ] #958 [QA] Verify /usage account limits in CLI + gateway
- [ ] #959 [QA] Verify OpenCode-Go curated catalog additions
- [ ] #960 [QA] Verify patch 'did you mean?' suggestions
- [ ] #961 [QA] Verify web dashboard update/restart action buttons
-
-### Local branch-only work
- [ ] #962 [QA] Verify hardcoded-home path guard on burn/921 branch
-
-## Summary
-
-| Issue | State | Commits | Tests |
-| --- | --- | --- | --- |
-| #950 | open | 5 | 2 |
-| #951 | open | 2 | 2 |
-| #952 | open | 1 | 1 |
-| #953 | open | 1 | 2 |
-| #954 | open | 1 | 0 |
-| #955 | open | 2 | 1 |
-| #956 | open | 1 | 0 |
-| #957 | open | 1 | 0 |
-| #958 | open | 2 | 2 |
-| #959 | open | 1 | 1 |
-| #960 | open | 2 | 1 |
-| #961 | closed | 1 | 0 |
-| #962 | closed | 1 | 1 |
-
-## #950 — [QA] Verify AI Gateway provider UX + attribution headers
-
-State: open
-URL: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/950
-
-### Branch / checkout
- Validate on `upstream/main` or an equivalent synced checkout.
-
-### Commits
- `b11753879` — attribution default_headers for ai-gateway provider
- `700437440` — curated picker with live pricing
- `ac26a460f` — promote ai-gateway in provider picker ordering
- `5bb2d11b0` — auto-promote free Moonshot models
- `29f57ec95` — Vercel deep-link for API key creation
-
-### Targeted tests
- `tests/hermes_cli/test_ai_gateway_models.py`
- `tests/run_agent/test_provider_attribution_headers.py`
-
-### Tasks
- [ ] Open `hermes model` and verify `ai-gateway` appears near the top.
- [ ] Verify live pricing appears in the picker.
- [ ] Verify free Moonshot models are promoted.
- [ ] Trigger API-key setup flow and verify the Vercel deep link.
- [ ] Send one ai-gateway request and verify attribution headers are attached.
-
-### Acceptance criteria
- [ ] UI ordering and pricing match the landed behavior.
- [ ] Attribution headers are present on ai-gateway requests.
- [ ] Targeted tests pass.
-
-## #951 — [QA] Verify transport abstraction + AnthropicTransport wiring
-
-State: open
-URL: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/951
-
-### Branch / checkout
- Validate on `upstream/main` or an equivalent synced checkout.
-
-### Commits
- `7ab5eebd0` — transport types + Anthropic normalize migration
- `731f4fbae` — transport ABC + AnthropicTransport wired to all paths
-
-### Targeted tests
- `tests/agent/transports/test_types.py`
- `tests/agent/test_anthropic_normalize_v2.py`
-
-### Tasks
- [ ] Verify plain-text Anthropic responses normalize correctly.
- [ ] Verify tool-call responses preserve IDs, names, and arguments.
- [ ] Verify reasoning/thinking is preserved separately from visible content.
- [ ] Verify finish_reason mapping remains correct across paths.
-
-### Acceptance criteria
- [ ] Normalized response shape is stable.
- [ ] Tool-call and reasoning payloads survive normalization.
- [ ] Targeted tests pass.
-
-## #952 — [QA] Verify CLI voice beep toggle
-
-State: open
-URL: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/952
-
-### Branch / checkout
- Validate on `upstream/main` or an equivalent synced checkout.
-
-### Commits
- `b48ea41d2` — voice: add CLI beep toggle
-
-### Targeted tests
- `tests/tools/test_voice_cli_integration.py`
-
-### Tasks
- [ ] Enable the beep option in config and confirm voice mode emits the beep.
- [ ] Disable the option and confirm the same path is silent.
- [ ] Verify voice mode still strips markdown before speech output.
- [ ] Verify voice mode does not pollute conversation history with TTS-only text.
-
-### Acceptance criteria
- [ ] Beep behavior is actually toggled by config.
- [ ] Existing voice/TTS integration behavior is not regressed.
- [ ] Targeted tests pass.
-
-## #953 — [QA] Verify bundled skill scripts run out of the box
-
-State: open
-URL: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/953
-
-### Branch / checkout
- Validate on `upstream/main` or an equivalent synced checkout.
-
-### Commits
- `328223576` — make bundled skill scripts runnable out of the box
-
-### Targeted tests
- `tests/agent/test_skill_commands.py`
- `tests/tools/test_local_shell_init.py`
-
-### Tasks
- [ ] Pick a bundled skill that ships a script and run it without manual chmod/PATH surgery.
- [ ] Verify local terminal execution resolves the installed skill script correctly.
- [ ] Verify local shell init still behaves correctly.
-
-### Acceptance criteria
- [ ] Bundled skill scripts execute from the installed skill location with no manual prep.
- [ ] Local shell init remains healthy.
- [ ] Targeted tests pass.
-
-## #954 — [QA] Verify maps skill guest_house / camp_site / bakery expansion
-
-State: open
-URL: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/954
-
-### Branch / checkout
- Validate on `upstream/main` or an equivalent synced checkout.
-
-### Commits
- `c5a814b23` — maps: add guest_house, camp_site, and dual-key bakery lookup
-
-### Tasks
- [ ] Use the maps skill to search for a guest house in a known populated area.
- [ ] Use the maps skill to search for a camp site in a known populated area.
- [ ] Use the maps skill to search for a bakery and verify both supported keys resolve correctly.
- [ ] Confirm results are sensible and non-empty.
-
-### Acceptance criteria
- [ ] All three place types resolve correctly.
- [ ] Bakery lookup works through both supported keys.
- [ ] Manual evidence is attached in the issue.
-
-## #955 — [QA] Verify KittenTTS local provider end-to-end
-
-State: open
-URL: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/955
-
-### Branch / checkout
- Validate on `upstream/main` or an equivalent synced checkout.
-
-### Commits
- `1830ebfc5` — add KittenTTS provider
- `2d7ff9c5b` — complete KittenTTS integration across tools/setup/docs/tests
-
-### Targeted tests
- `tests/tools/test_tts_kittentts.py`
-
-### Tasks
- [ ] Configure TTS to use `kittentts`.
- [ ] Generate speech to `.wav` and verify playable output.
- [ ] Verify voice / speed / cleaned text are passed correctly.
- [ ] Generate repeated requests and verify model caching behavior.
- [ ] Generate a non-wav output and verify ffmpeg conversion path.
- [ ] Verify missing-package behavior returns a helpful error.
-
-### Acceptance criteria
- [ ] KittenTTS works end-to-end when installed.
- [ ] Failure mode is operator-friendly when not installed.
- [ ] Targeted tests pass.
-
-## #956 — [QA] Verify numbered keyboard shortcuts for approval + clarify prompts
-
-State: open
-URL: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/956
-
-### Branch / checkout
- Validate on `upstream/main` or an equivalent synced checkout.
-
-### Commits
- `d1ed6f4fb` — CLI: add numbered keyboard shortcuts to approval and clarify prompts
-
-### Tasks
- [ ] Trigger an approval prompt and choose an option with number keys.
- [ ] Trigger a clarify prompt and choose an option with number keys.
- [ ] Verify the correct option is submitted both times.
- [ ] Verify normal keyboard navigation still works.
-
-### Acceptance criteria
- [ ] Number-key selection works for both prompt types.
- [ ] Legacy keyboard navigation is not broken.
- [ ] Manual evidence is attached in the issue.
-
-## #957 — [QA] Verify optional adversarial-ux-test skill catalog flow
-
-State: open
-URL: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/957
-
-### Branch / checkout
- Validate on `upstream/main` or an equivalent synced checkout.
-
-### Commits
- `e50e7f11b` — skills: add adversarial-ux-test optional skill
-
-### Tasks
- [ ] Verify the optional skill appears in the optional skill catalog.
- [ ] Install or enable the skill.
- [ ] Load it successfully through Hermes.
- [ ] Disable or remove it and verify catalog state updates cleanly.
-
-### Acceptance criteria
- [ ] Catalog listing is correct.
- [ ] Install / load / disable lifecycle works cleanly.
- [ ] Manual evidence is attached in the issue.
-
-## #958 — [QA] Verify /usage account limits in CLI + gateway
-
-State: open
-URL: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/958
-
-### Branch / checkout
- Validate on `upstream/main` or an equivalent synced checkout.
-
-### Commits
- `8a11b0a20` — per-provider account limits module
- `bcc5d7b67` — append account limits section in CLI and gateway
-
-### Targeted tests
- `tests/test_account_usage.py`
- `tests/gateway/test_usage_command.py`
-
-### Tasks
- [ ] Run `/usage` in CLI for a provider with account limits.
- [ ] Verify provider, remaining quota, total limit, and reset window render correctly.
- [ ] Run `/usage` through the gateway and verify the same section appears.
- [ ] Verify zero-value cache read/write sections stay hidden when appropriate.
-
-### Acceptance criteria
- [ ] CLI and gateway both show the landed account-limits section correctly.
- [ ] Targeted tests pass.
-
-## #959 — [QA] Verify OpenCode-Go curated catalog additions
-
-State: open
-URL: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/959
-
-### Branch / checkout
- Validate on `upstream/main` or an equivalent synced checkout.
-
-### Commits
- `4fea1769d` — opencode-go: add Kimi K2.6 and Qwen3.5/3.6 Plus to curated catalog
-
-### Targeted tests
- `tests/hermes_cli/test_opencode_go_in_model_list.py`
-
-### Tasks
- [ ] With valid OpenCode-Go credentials, open `hermes model`.
- [ ] Verify Kimi K2.6 appears.
- [ ] Verify Qwen 3.5 Plus and 3.6 Plus appear.
- [ ] Unset credentials and verify the provider/catalog hides correctly.
-
-### Acceptance criteria
- [ ] New curated models are present when credentials exist.
- [ ] Catalog visibility still respects credential gating.
- [ ] Targeted tests pass.
-
-## #960 — [QA] Verify patch 'did you mean?' suggestions
-
-State: open
-URL: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/960
-
-### Branch / checkout
- Validate on `upstream/main` or an equivalent synced checkout.
-
-### Commits
- `15abf4ed8` — add `did you mean?` feedback when patch fails to match
- `5e6427a42` — gate it to true no-match cases and extend to v4a / skill_manage
-
-### Targeted tests
- `tests/tools/test_fuzzy_match.py`
-
-### Tasks
- [ ] Intentionally run a replace/patch with a near-miss `old_string`.
- [ ] Verify the tool suggests a useful nearby line/context.
- [ ] Verify suggestions only appear on true no-match failures.
- [ ] Verify the behavior also works via file tools, v4a patching, and skill_manage.
-
-### Acceptance criteria
- [ ] Suggestion quality is helpful, not noisy.
- [ ] Suggestions are correctly gated to no-match cases.
- [ ] Targeted tests pass.
-
-## #961 — [QA] Verify web dashboard update/restart action buttons
-
-State: closed
-URL: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/961
-
-### Branch / checkout
- Validate on `upstream/main` or an equivalent synced checkout.
-
-### Commits
- `fc21c1420` — add buttons to update Hermes and restart gateway
-
-### Files touched
- `web/src/pages/StatusPage.tsx`
- `web/src/lib/api.ts`
- `web/src/i18n/en.ts`
-
-### Tasks
- [ ] Open the Web UI status page and verify both buttons are present.
- [ ] Click Restart Gateway in a safe environment and verify running/output/success-or-failure states render.
- [ ] Click Update Hermes and verify the same action lifecycle.
- [ ] Verify the page remains responsive while actions are running.
-
-### Acceptance criteria
- [ ] Both action buttons are present and wired.
- [ ] Action status polling and result rendering work end-to-end.
- [ ] Manual evidence is attached in the issue.
-
-## #962 — [QA] Verify hardcoded-home path guard on burn/921 branch
-
-State: closed
-URL: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/962
-
-### Branch / checkout
- Validate specifically on `burn/921-poka-yoke-hardcoded-paths` (not upstream/main).
-
-### Commits
- `5dcb90531` — Poka-yoke: prevent hardcoded home-directory paths
-
-### Targeted tests
- `tests/test_path_guard.py`
-
-### Tasks
- [ ] Verify hardcoded `/Users/...` paths are rejected.
- [ ] Verify hardcoded `~/.hermes/...` paths are rejected in guarded contexts.
- [ ] Verify valid relative paths still pass.
- [ ] Verify appropriate absolute paths still pass where intended.
- [ ] Verify linting catches violations in non-test files.
-
-### Acceptance criteria
- [ ] Guard blocks the dangerous patterns and preserves allowed ones.
- [ ] Targeted tests pass.
--- a/scripts/morning_review_packet.py
+++ b/scripts/morning_review_packet.py
@@ -1,301 +0,0 @@
-#!/usr/bin/env python3
-"""Build a morning review packet from a Gitea epic and its child QA issues.
-
-This script fetches a parent epic plus its sub-issues, extracts the structured
-sections from each QA issue body, and renders a single markdown packet suitable
-for morning review.
-
-Usage:
-    python scripts/morning_review_packet.py --epic-number 949
-    python scripts/morning_review_packet.py --epic-number 949 --children 950-962
-    python scripts/morning_review_packet.py --epic-number 949 --output docs/review_packets/hermes-harness-2026-04-21.md
-"""
-
-from __future__ import annotations
-
-import argparse
-import json
-import os
-import re
-import urllib.request
-from dataclasses import dataclass, field
-from pathlib import Path
-from typing import Iterable
-
-DEFAULT_BASE_URL = "https://forge.alexanderwhitestone.com"
-DEFAULT_OWNER = "Timmy_Foundation"
-DEFAULT_REPO = "hermes-agent"
-DEFAULT_TOKEN_PATH = Path.home() / ".config" / "gitea" / "token"
-
-
-@dataclass(frozen=True)
-class CommitEvidence:
-    sha: str
-    summary: str
-
-
-@dataclass
-class ReviewIssue:
-    number: int
-    title: str
-    state: str
-    url: str
-    comments: int = 0
-    parent_issue: int | None = None
-    checkout_notes: list[str] = field(default_factory=list)
-    commits: list[CommitEvidence] = field(default_factory=list)
-    targeted_tests: list[str] = field(default_factory=list)
-    files_touched: list[str] = field(default_factory=list)
-    tasks: list[str] = field(default_factory=list)
-    acceptance_criteria: list[str] = field(default_factory=list)
-
-
-def parse_issue_number_spec(spec: str) -> list[int]:
-    """Parse a comma-separated issue list like ``950-952,955,962``."""
-    numbers: list[int] = []
-    seen: set[int] = set()
-    for chunk in (part.strip() for part in spec.split(",")):
-        if not chunk:
-            continue
-        if "-" in chunk:
-            start_str, end_str = (part.strip() for part in chunk.split("-", 1))
-            start = int(start_str)
-            end = int(end_str)
-            if end < start:
-                raise ValueError(f"Invalid descending issue range: {chunk}")
-            for number in range(start, end + 1):
-                if number not in seen:
-                    numbers.append(number)
-                    seen.add(number)
-        else:
-            number = int(chunk)
-            if number not in seen:
-                numbers.append(number)
-                seen.add(number)
-    return numbers
-
-
-def _parse_sections(body: str) -> dict[str, list[str]]:
-    sections: dict[str, list[str]] = {}
-    current: str | None = None
-    for raw_line in body.splitlines():
-        line = raw_line.rstrip()
-        if line.startswith("## "):
-            current = line[3:].strip()
-            sections[current] = []
-            continue
-        if current is not None:
-            sections[current].append(line)
-    return sections
-
-
-def _clean_bullet(line: str) -> str | None:
-    stripped = line.strip()
-    if not stripped:
-        return None
-    stripped = re.sub(r"^-\s*\[(?: |x|X)\]\s*", "", stripped)
-    stripped = re.sub(r"^-\s*", "", stripped)
-    return stripped.strip() or None
-
-
-def _extract_bullets(lines: Iterable[str]) -> list[str]:
-    items: list[str] = []
-    for line in lines:
-        cleaned = _clean_bullet(line)
-        if cleaned:
-            items.append(cleaned)
-    return items
-
-
-def _extract_parent_issue(body: str, sections: dict[str, list[str]]) -> int | None:
-    parent_lines = sections.get("Parent", [])
-    for line in parent_lines:
-        match = re.search(r"#(\d+)", line)
-        if match:
-            return int(match.group(1))
-    match = re.search(r"Linked to Epic\s+#(\d+)", body, flags=re.IGNORECASE)
-    if match:
-        return int(match.group(1))
-    return None
-
-
-def _extract_commits(lines: Iterable[str]) -> list[CommitEvidence]:
-    commits: list[CommitEvidence] = []
-    for item in _extract_bullets(lines):
-        match = re.match(r"`([^`]+)`\s*(.*)", item)
-        if match:
-            commits.append(CommitEvidence(sha=match.group(1).strip(), summary=match.group(2).strip()))
-        else:
-            commits.append(CommitEvidence(sha="", summary=item))
-    return commits
-
-
-def _strip_backticks(items: Iterable[str]) -> list[str]:
-    cleaned: list[str] = []
-    for item in items:
-        cleaned.append(item.replace("`", "").strip())
-    return cleaned
-
-
-def discover_child_issue_numbers(epic_body: str) -> list[int]:
-    """Discover sub-issue numbers from an epic body."""
-    sections = _parse_sections(epic_body)
-    sub_lines = sections.get("Sub-issues")
-    if not sub_lines:
-        return []
-    numbers: list[int] = []
-    seen: set[int] = set()
-    for line in sub_lines:
-        for match in re.finditer(r"#(\d+)", line):
-            number = int(match.group(1))
-            if number not in seen:
-                numbers.append(number)
-                seen.add(number)
-    return numbers
-
-
-def parse_child_issue(issue: dict) -> ReviewIssue:
-    body = issue.get("body") or ""
-    sections = _parse_sections(body)
-    commit_lines = sections.get("Commits landed today", []) or sections.get("Commit landed today", [])
-
-    return ReviewIssue(
-        number=int(issue["number"]),
-        title=issue.get("title") or "",
-        state=(issue.get("state") or "unknown").lower(),
-        url=issue.get("html_url") or issue.get("url") or "",
-        comments=int(issue.get("comments") or 0),
-        parent_issue=_extract_parent_issue(body, sections),
-        checkout_notes=_extract_bullets(sections.get("Branch / checkout", [])),
-        commits=_extract_commits(commit_lines),
-        targeted_tests=_strip_backticks(_extract_bullets(sections.get("Targeted tests", []))),
-        files_touched=_strip_backticks(_extract_bullets(sections.get("Files touched", []))),
-        tasks=_extract_bullets(sections.get("Tasks", [])),
-        acceptance_criteria=_extract_bullets(sections.get("Acceptance Criteria", [])),
-    )
-
-
-def build_packet_markdown(epic_issue: dict, child_issues: list[ReviewIssue]) -> str:
-    title = epic_issue.get("title") or f"Epic #{epic_issue.get('number')}"
-    url = epic_issue.get("html_url") or epic_issue.get("url") or ""
-    body = epic_issue.get("body") or ""
-    children = sorted(child_issues, key=lambda item: item.number)
-
-    lines: list[str] = []
-    lines.append("# Morning Review Packet")
-    lines.append("")
-    lines.append(f"Source epic: [{title}]({url})")
-    lines.append("")
-    lines.append("## Epic context")
-    lines.append("")
-    lines.append(title)
-    lines.append("")
-    for line in body.splitlines():
-        if line.strip():
-            lines.append(line)
-        else:
-            lines.append("")
-    lines.append("")
-    lines.append("## Summary")
-    lines.append("")
-    lines.append("| Issue | State | Commits | Tests |")
-    lines.append("| --- | --- | --- | --- |")
-    for child in children:
-        lines.append(
-            f"| #{child.number} | {child.state} | {len(child.commits)} | {len(child.targeted_tests)} |"
-        )
-    lines.append("")
-
-    for child in children:
-        lines.append(f"## #{child.number} — {child.title}")
-        lines.append("")
-        lines.append(f"State: {child.state}")
-        lines.append(f"URL: {child.url}")
-        lines.append("")
-        if child.checkout_notes:
-            lines.append("### Branch / checkout")
-            for note in child.checkout_notes:
-                lines.append(f"- {note}")
-            lines.append("")
-        if child.commits:
-            lines.append("### Commits")
-            for commit in child.commits:
-                if commit.sha:
-                    lines.append(f"- `{commit.sha}` — {commit.summary}")
-                else:
-                    lines.append(f"- {commit.summary}")
-            lines.append("")
-        if child.targeted_tests:
-            lines.append("### Targeted tests")
-            for test_path in child.targeted_tests:
-                lines.append(f"- `{test_path}`")
-            lines.append("")
-        if child.files_touched:
-            lines.append("### Files touched")
-            for file_path in child.files_touched:
-                lines.append(f"- `{file_path}`")
-            lines.append("")
-        if child.tasks:
-            lines.append("### Tasks")
-            for task in child.tasks:
-                lines.append(f"- [ ] {task}")
-            lines.append("")
-        if child.acceptance_criteria:
-            lines.append("### Acceptance criteria")
-            for item in child.acceptance_criteria:
-                lines.append(f"- [ ] {item}")
-            lines.append("")
-
-    return "\n".join(lines).rstrip() + "\n"
-
-
-def _resolve_token(explicit_token: str | None = None) -> str:
-    if explicit_token:
-        return explicit_token.strip()
-    env_token = os.getenv("GITEA_TOKEN")
-    if env_token:
-        return env_token.strip()
-    if DEFAULT_TOKEN_PATH.exists():
-        return DEFAULT_TOKEN_PATH.read_text().strip()
-    raise FileNotFoundError(f"No Gitea token found. Set GITEA_TOKEN or create {DEFAULT_TOKEN_PATH}")
-
-
-def fetch_issue(base_url: str, owner: str, repo: str, number: int, token: str) -> dict:
-    url = f"{base_url.rstrip('/')}/api/v1/repos/{owner}/{repo}/issues/{number}"
-    request = urllib.request.Request(url, headers={"Authorization": f"token {token}"})
-    with urllib.request.urlopen(request, timeout=30) as response:
-        return json.loads(response.read().decode())
-
-
-def collect_child_issues(base_url: str, owner: str, repo: str, epic_issue: dict, token: str, children_spec: str | None = None) -> list[dict]:
-    numbers = parse_issue_number_spec(children_spec) if children_spec else discover_child_issue_numbers(epic_issue.get("body") or "")
-    return [fetch_issue(base_url, owner, repo, number, token) for number in numbers]
-
-
-def main(argv: list[str] | None = None) -> int:
-    parser = argparse.ArgumentParser(description="Build a markdown morning review packet from a Gitea epic")
-    parser.add_argument("--base-url", default=DEFAULT_BASE_URL)
-    parser.add_argument("--owner", default=DEFAULT_OWNER)
-    parser.add_argument("--repo", default=DEFAULT_REPO)
-    parser.add_argument("--epic-number", type=int, required=True)
-    parser.add_argument("--children", help="Explicit issue list/ranges, e.g. 950-962")
-    parser.add_argument("--token", help="Gitea token (defaults to GITEA_TOKEN or ~/.config/gitea/token)")
-    parser.add_argument("--output", help="Write markdown packet to this path instead of stdout")
-    args = parser.parse_args(argv)
-
-    token = _resolve_token(args.token)
-    epic_issue = fetch_issue(args.base_url, args.owner, args.repo, args.epic_number, token)
-    child_issue_dicts = collect_child_issues(args.base_url, args.owner, args.repo, epic_issue, token, args.children)
-    packet = build_packet_markdown(epic_issue, [parse_child_issue(issue) for issue in child_issue_dicts])
-
-    if args.output:
-        output_path = Path(args.output)
-        output_path.parent.mkdir(parents=True, exist_ok=True)
-        output_path.write_text(packet)
-    else:
-        print(packet, end="")
-    return 0
-
-
-if __name__ == "__main__":
-    raise SystemExit(main())
--- a/scripts/tensorzero_eval_packet.py
+++ b/scripts/tensorzero_eval_packet.py
@@ -0,0 +1,318 @@
+#!/usr/bin/env python3
+"""Generate a grounded TensorZero evaluation packet for Hermes.
+
+This script inventories the current Hermes routing/evaluation surfaces, then
+builds a markdown packet assessing how much of issue #860 can be satisfied by
+TensorZero and where the migration risk still lives.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import re
+from dataclasses import asdict, dataclass
+from pathlib import Path
+from typing import Iterable
+
+ISSUE_NUMBER = 860
+ISSUE_TITLE = "tensorzero LLMOps platform evaluation"
+ISSUE_URL = "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/860"
+DEFAULT_OUTPUT = Path("docs/evaluations/tensorzero-860-evaluation.md")
+DEFAULT_JSON_OUTPUT = Path("docs/evaluations/tensorzero-860-evaluation.json")
+
+
+@dataclass(frozen=True)
+class TouchpointPattern:
+    label: str
+    file_path: str
+    regex: str
+    description: str
+
+
+@dataclass(frozen=True)
+class Touchpoint:
+    label: str
+    file_path: str
+    line_number: int
+    matched_text: str
+
+
+@dataclass(frozen=True)
+class RequirementStatus:
+    key: str
+    name: str
+    status: str
+    evidence_labels: tuple[str, ...]
+    summary: str
+
+
+@dataclass(frozen=True)
+class EvaluationReport:
+    issue_number: int
+    issue_title: str
+    issue_url: str
+    recommendation: str
+    touchpoints: tuple[Touchpoint, ...]
+    requirements: tuple[RequirementStatus, ...]
+
+
+PATTERNS: tuple[TouchpointPattern, ...] = (
+    TouchpointPattern(
+        label="fallback_chain",
+        file_path="run_agent.py",
+        regex=r"_fallback_chain|fallback_providers|fallback_model",
+        description="Primary agent fallback-provider chain in the core conversation loop.",
+    ),
+    TouchpointPattern(
+        label="provider_routing_config",
+        file_path="cli.py",
+        regex=r"provider_routing|fallback_providers|smart_model_routing",
+        description="CLI-owned provider routing and fallback configuration surfaces.",
+    ),
+    TouchpointPattern(
+        label="runtime_provider",
+        file_path="hermes_cli/runtime_provider.py",
+        regex=r"def resolve_runtime_provider|def resolve_requested_provider",
+        description="Central runtime provider resolution for CLI, gateway, cron, and helpers.",
+    ),
+    TouchpointPattern(
+        label="smart_model_routing",
+        file_path="agent/smart_model_routing.py",
+        regex=r"def resolve_turn_route|def choose_cheap_model_route",
+        description="Cheap-vs-strong turn routing that TensorZero would need to absorb or replace.",
+    ),
+    TouchpointPattern(
+        label="gateway_provider_routing",
+        file_path="gateway/run.py",
+        regex=r"def _load_provider_routing|def _load_fallback_model|def _load_smart_model_routing",
+        description="Gateway-specific loading of routing, fallback, and smart-model policies.",
+    ),
+    TouchpointPattern(
+        label="cron_runtime_provider",
+        file_path="cron/scheduler.py",
+        regex=r"resolve_runtime_provider|resolve_turn_route|provider_routing|fallback_model",
+        description="Cron execution path that re-resolves providers and routing on every run.",
+    ),
+    TouchpointPattern(
+        label="auxiliary_fallback_chain",
+        file_path="agent/auxiliary_client.py",
+        regex=r"fallback chain|_get_provider_chain|provider chain",
+        description="Auxiliary task routing/fallback chain outside the main inference path.",
+    ),
+    TouchpointPattern(
+        label="delegate_runtime_provider",
+        file_path="tools/delegate_tool.py",
+        regex=r"runtime provider system|resolve the full credential bundle|resolve_runtime_provider",
+        description="Subagent/delegation routing path that would also need TensorZero parity.",
+    ),
+    TouchpointPattern(
+        label="session_db",
+        file_path="hermes_state.py",
+        regex=r"class SessionDB",
+        description="Session persistence surface that could feed TensorZero optimization/eval data.",
+    ),
+    TouchpointPattern(
+        label="trajectory_export",
+        file_path="batch_runner.py",
+        regex=r"trajectory_entry|save_trajectories|_convert_to_trajectory_format",
+        description="Trajectory export surface for offline optimization and replay data.",
+    ),
+    TouchpointPattern(
+        label="benchmark_suite",
+        file_path="benchmarks/tool_call_benchmark.py",
+        regex=r"ToolCall\(|class ToolCall|benchmark",
+        description="Existing benchmark/evaluation harness that could map to TensorZero experiments.",
+    ),
+)
+
+
+def _iter_matches(pattern: TouchpointPattern, text: str) -> Iterable[Touchpoint]:
+    regex = re.compile(pattern.regex, re.IGNORECASE)
+    for line_number, line in enumerate(text.splitlines(), start=1):
+        if regex.search(line):
+            yield Touchpoint(
+                label=pattern.label,
+                file_path=pattern.file_path,
+                line_number=line_number,
+                matched_text=line.strip(),
+            )
+
+
+def scan_touchpoints(repo_root: Path) -> list[Touchpoint]:
+    touchpoints: list[Touchpoint] = []
+    for pattern in PATTERNS:
+        path = repo_root / pattern.file_path
+        if not path.exists():
+            continue
+        text = path.read_text(encoding="utf-8")
+        touchpoints.extend(_iter_matches(pattern, text))
+    return touchpoints
+
+
+def build_requirement_matrix(touchpoints: list[Touchpoint]) -> list[RequirementStatus]:
+    labels = {tp.label for tp in touchpoints}
+
+    matrix: list[RequirementStatus] = []
+    gateway_labels = (
+        "fallback_chain",
+        "runtime_provider",
+        "gateway_provider_routing",
+        "cron_runtime_provider",
+        "auxiliary_fallback_chain",
+        "delegate_runtime_provider",
+    )
+    gateway_hits = tuple(label for label in gateway_labels if label in labels)
+    gateway_status = "partial" if len(gateway_hits) >= 4 else "gap"
+    gateway_summary = (
+        "Hermes already spreads provider routing across core agent, runtime provider, gateway, cron, auxiliary, and delegation seams; "
+        "TensorZero would need parity across all of them before it can replace the gateway layer."
+        if gateway_hits else
+        "No grounded routing surfaces were found for a gateway replacement assessment."
+    )
+    matrix.append(RequirementStatus("gateway_replacement", "Gateway replacement scope", gateway_status, gateway_hits, gateway_summary))
+
+    config_labels = (
+        "provider_routing_config",
+        "runtime_provider",
+        "smart_model_routing",
+        "fallback_chain",
+    )
+    config_hits = tuple(label for label in config_labels if label in labels)
+    config_status = "partial" if len(config_hits) >= 3 else "gap"
+    config_summary = (
+        "Hermes has multiple config concepts to migrate (`provider_routing`, `fallback_providers`, `smart_model_routing`, runtime provider resolution), "
+        "so TensorZero is not a drop-in config swap."
+        if config_hits else
+        "No current config migration surface was found."
+    )
+    matrix.append(RequirementStatus("config_migration", "Config migration", config_status, config_hits, config_summary))
+
+    canary_hits: tuple[str, ...] = tuple()
+    canary_summary = (
+        "The repo shows semantic routing and fallback, but no grounded 10% traffic-split canary mechanism. "
+        "A TensorZero cutover would need new percentage-based rollout controls and observability hooks."
+    )
+    matrix.append(RequirementStatus("canary_rollout", "10% traffic canary", "gap", canary_hits, canary_summary))
+
+    session_labels = ("session_db", "trajectory_export")
+    session_hits = tuple(label for label in session_labels if label in labels)
+    session_status = "partial" if len(session_hits) == len(session_labels) else "gap"
+    session_summary = (
+        "Hermes already has SessionDB and trajectory export surfaces that can feed offline optimization data, "
+        "but not a TensorZero-native ingestion path yet."
+        if session_hits else
+        "No session-data surface was found for prompt optimization."
+    )
+    matrix.append(RequirementStatus("session_feedback", "Session data for prompt optimization", session_status, session_hits, session_summary))
+
+    eval_labels = ("benchmark_suite", "trajectory_export")
+    eval_hits = tuple(label for label in eval_labels if label in labels)
+    eval_status = "partial" if "benchmark_suite" in eval_hits else "gap"
+    eval_summary = (
+        "Hermes already has benchmark/trajectory machinery that can seed TensorZero A/B evaluation, "
+        "but no integrated TensorZero experiment runner or live evaluation gateway."
+        if eval_hits else
+        "No evaluation harness was found to support TensorZero A/B testing."
+    )
+    matrix.append(RequirementStatus("evaluation_suite", "Evaluation suite / A/B testing", eval_status, eval_hits, eval_summary))
+
+    return matrix
+
+
+def build_report(touchpoints: list[Touchpoint], requirement_matrix: list[RequirementStatus]) -> EvaluationReport:
+    recommendation = (
+        "Not ready for direct replacement. Recommend a shadow-evaluation phase first: keep Hermes routing live, "
+        "inventory the migration seams, export SessionDB/trajectory data into an offline TensorZero experiment loop, "
+        "and only design a canary gateway once percentage-based rollout controls exist."
+    )
+    return EvaluationReport(
+        issue_number=ISSUE_NUMBER,
+        issue_title=ISSUE_TITLE,
+        issue_url=ISSUE_URL,
+        recommendation=recommendation,
+        touchpoints=tuple(touchpoints),
+        requirements=tuple(requirement_matrix),
+    )
+
+
+def build_markdown(report: EvaluationReport) -> str:
+    lines: list[str] = []
+    lines.append("# TensorZero Evaluation Packet")
+    lines.append("")
+    lines.append(f"Issue #{report.issue_number}: [{report.issue_title}]({report.issue_url})")
+    lines.append("")
+    lines.append("## Scope")
+    lines.append("")
+    lines.append("This packet evaluates TensorZero as a possible replacement for Hermes' custom provider-routing stack.")
+    lines.append("It is intentionally grounded in the current repo state rather than a speculative cutover plan.")
+    lines.append("")
+    lines.append("## Issue requirements being evaluated")
+    lines.append("")
+    lines.append("- Deploy tensorzero gateway (Rust binary)")
+    lines.append("- Migrate provider routing config")
+    lines.append("- Test with canary (10% traffic) before full cutover")
+    lines.append("- Feed session data for prompt optimization")
+    lines.append("- Evaluation suite for A/B testing models")
+    lines.append("")
+    lines.append("## Recommendation")
+    lines.append("")
+    lines.append(report.recommendation)
+    lines.append("")
+    lines.append("## Requirement matrix")
+    lines.append("")
+    lines.append("| Requirement | Status | Evidence labels | Summary |")
+    lines.append("| --- | --- | --- | --- |")
+    for row in report.requirements:
+        evidence = ", ".join(row.evidence_labels) if row.evidence_labels else "—"
+        lines.append(f"| {row.name} | {row.status} | {evidence} | {row.summary} |")
+    lines.append("")
+    lines.append("## Grounded Hermes touchpoints")
+    lines.append("")
+    if report.touchpoints:
+        for tp in report.touchpoints:
+            lines.append(f"- `{tp.file_path}:{tp.line_number}` — [{tp.label}] {tp.matched_text}")
+    else:
+        lines.append("- No routing/evaluation touchpoints were found.")
+    lines.append("")
+    lines.append("## Suggested next slice")
+    lines.append("")
+    lines.append("1. Build an exporter that emits SessionDB + trajectory data into a TensorZero-friendly offline dataset.")
+    lines.append("2. Define percentage-based canary controls before attempting any gateway replacement.")
+    lines.append("3. Keep Hermes routing authoritative until TensorZero proves parity across CLI, gateway, cron, auxiliary, and delegation surfaces.")
+    lines.append("")
+    return "\n".join(lines).rstrip() + "\n"
+
+
+def write_outputs(report: EvaluationReport, markdown_path: Path, json_path: Path | None = None) -> None:
+    markdown_path.parent.mkdir(parents=True, exist_ok=True)
+    markdown_path.write_text(build_markdown(report), encoding="utf-8")
+    if json_path is not None:
+        json_path.parent.mkdir(parents=True, exist_ok=True)
+        json_path.write_text(json.dumps(asdict(report), indent=2), encoding="utf-8")
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Generate a grounded TensorZero evaluation packet for Hermes")
+    parser.add_argument("--repo-root", default=".", help="Hermes repo root to scan")
+    parser.add_argument("--output", default=str(DEFAULT_OUTPUT), help="Markdown output path")
+    parser.add_argument("--json-output", default=str(DEFAULT_JSON_OUTPUT), help="Optional JSON output path")
+    return parser.parse_args()
+
+
+def main() -> int:
+    args = parse_args()
+    repo_root = Path(args.repo_root).resolve()
+    touchpoints = scan_touchpoints(repo_root)
+    matrix = build_requirement_matrix(touchpoints)
+    report = build_report(touchpoints, matrix)
+    json_output = Path(args.json_output) if args.json_output else None
+    write_outputs(report, Path(args.output), json_output)
+    print(f"Wrote {args.output}")
+    if json_output is not None:
+        print(f"Wrote {json_output}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/tests/test_morning_review_packet.py
+++ b/tests/test_morning_review_packet.py
@@ -1,162 +0,0 @@
-from pathlib import Path
-import sys
-
-SCRIPT_DIR = Path(__file__).resolve().parents[1] / "scripts"
-sys.path.insert(0, str(SCRIPT_DIR))
-
-import morning_review_packet as mrp
-
-
-EPIC_BODY = """Source: git log on upstream/main since 2026-04-21 00:00 EDT.
-
-## Success criteria
- [ ] Every issue has a clear PASS / FAIL outcome.
-
-## Sub-issues
- [ ] #950 [QA] Verify AI Gateway provider UX + attribution headers
- [ ] #951 [QA] Verify transport abstraction + AnthropicTransport wiring
- [x] #962 [QA] Verify hardcoded-home path guard on burn/921 branch
-"""
-
-
-CHILD_BODY_PLURAL = """## Parent
-#949
-
-## Branch / checkout
- Validate on `upstream/main` or an equivalent synced checkout.
-
-## Commits landed today
- `b11753879` attribution default_headers for ai-gateway provider
- `700437440` curated picker with live pricing
-
-## Targeted tests
- `tests/hermes_cli/test_ai_gateway_models.py`
- `tests/run_agent/test_provider_attribution_headers.py`
-
-## Tasks
- [ ] Verify the picker ordering.
- [ ] Verify attribution headers.
-
-## Acceptance Criteria
- [ ] Picker shows AI Gateway prominently.
- [ ] Headers appear on OpenRouter calls.
-"""
-
-
-CHILD_BODY_SINGULAR = """## Parent
-#949
-
-## Branch / checkout
- Validate on `upstream/main` or an equivalent synced checkout.
-
-## Commit landed today
- `fc21c1420` add buttons to update Hermes and restart gateway
-
-## Files touched
- `web/src/pages/StatusPage.tsx`
- `web/src/lib/api.ts`
- `web/src/i18n/en.ts`
-
-## Tasks
- [ ] Open the Web UI status page and verify both buttons are present.
- [ ] Click Restart Gateway in a safe environment.
-"""
-
-
-def test_discover_child_issue_numbers_from_epic_body():
-    assert mrp.discover_child_issue_numbers(EPIC_BODY) == [950, 951, 962]
-
-
-def test_parse_issue_number_spec_supports_ranges_and_lists():
-    assert mrp.parse_issue_number_spec("950-952,955,962") == [950, 951, 952, 955, 962]
-
-
-def test_parse_child_issue_extracts_structured_sections():
-    issue = {
-        "number": 950,
-        "title": "[QA] Verify AI Gateway provider UX + attribution headers",
-        "state": "open",
-        "html_url": "https://forge.example/950",
-        "comments": 0,
-        "body": CHILD_BODY_PLURAL,
-    }
-
-    parsed = mrp.parse_child_issue(issue)
-
-    assert parsed.number == 950
-    assert parsed.parent_issue == 949
-    assert parsed.checkout_notes == ["Validate on `upstream/main` or an equivalent synced checkout."]
-    assert [c.sha for c in parsed.commits] == ["b11753879", "700437440"]
-    assert parsed.targeted_tests == [
-        "tests/hermes_cli/test_ai_gateway_models.py",
-        "tests/run_agent/test_provider_attribution_headers.py",
-    ]
-    assert parsed.tasks == [
-        "Verify the picker ordering.",
-        "Verify attribution headers.",
-    ]
-    assert parsed.acceptance_criteria == [
-        "Picker shows AI Gateway prominently.",
-        "Headers appear on OpenRouter calls.",
-    ]
-
-
-def test_parse_child_issue_handles_singular_commit_heading_and_files_touched():
-    issue = {
-        "number": 961,
-        "title": "[QA] Verify web dashboard update/restart action buttons",
-        "state": "closed",
-        "html_url": "https://forge.example/961",
-        "comments": 16,
-        "body": CHILD_BODY_SINGULAR,
-    }
-
-    parsed = mrp.parse_child_issue(issue)
-
-    assert [c.sha for c in parsed.commits] == ["fc21c1420"]
-    assert parsed.files_touched == [
-        "web/src/pages/StatusPage.tsx",
-        "web/src/lib/api.ts",
-        "web/src/i18n/en.ts",
-    ]
-    assert parsed.tasks == [
-        "Open the Web UI status page and verify both buttons are present.",
-        "Click Restart Gateway in a safe environment.",
-    ]
-
-
-def test_build_packet_markdown_renders_summary_and_details():
-    epic_issue = {
-        "number": 949,
-        "title": "EPIC: Morning review packet — Hermes harness features landed 2026-04-21",
-        "state": "open",
-        "html_url": "https://forge.example/949",
-        "body": EPIC_BODY,
-    }
-    child_a = mrp.parse_child_issue({
-        "number": 950,
-        "title": "[QA] Verify AI Gateway provider UX + attribution headers",
-        "state": "open",
-        "html_url": "https://forge.example/950",
-        "comments": 0,
-        "body": CHILD_BODY_PLURAL,
-    })
-    child_b = mrp.parse_child_issue({
-        "number": 961,
-        "title": "[QA] Verify web dashboard update/restart action buttons",
-        "state": "closed",
-        "html_url": "https://forge.example/961",
-        "comments": 16,
-        "body": CHILD_BODY_SINGULAR,
-    })
-
-    markdown = mrp.build_packet_markdown(epic_issue, [child_a, child_b])
-
-    assert "# Morning Review Packet" in markdown
-    assert "EPIC: Morning review packet — Hermes harness features landed 2026-04-21" in markdown
-    assert "| #950 | open | 2 | 2 |" in markdown
-    assert "| #961 | closed | 1 | 0 |" in markdown
-    assert "## #950 — [QA] Verify AI Gateway provider UX + attribution headers" in markdown
-    assert "## #961 — [QA] Verify web dashboard update/restart action buttons" in markdown
-    assert "`b11753879` — attribution default_headers for ai-gateway provider" in markdown
-    assert "`web/src/pages/StatusPage.tsx`" in markdown
--- a/tests/test_tensorzero_eval_packet.py
+++ b/tests/test_tensorzero_eval_packet.py
@@ -0,0 +1,149 @@
+from pathlib import Path
+import sys
+
+SCRIPT_DIR = Path(__file__).resolve().parents[1] / "scripts"
+sys.path.insert(0, str(SCRIPT_DIR))
+
+import tensorzero_eval_packet as tz
+
+
+def test_scan_touchpoints_finds_expected_matches(tmp_path):
+    (tmp_path / "run_agent.py").write_text(
+        "self._fallback_chain = []\n# Provider fallback chain\n"
+    )
+    (tmp_path / "hermes_cli").mkdir()
+    (tmp_path / "hermes_cli" / "runtime_provider.py").write_text(
+        "def resolve_runtime_provider():\n    return {}\n"
+    )
+    (tmp_path / "agent").mkdir()
+    (tmp_path / "agent" / "smart_model_routing.py").write_text(
+        "def resolve_turn_route(user_message, routing_config, primary):\n    return primary\n"
+    )
+    (tmp_path / "gateway").mkdir()
+    (tmp_path / "gateway" / "run.py").write_text(
+        "def _load_provider_routing():\n    return {}\n"
+    )
+    (tmp_path / "cron").mkdir()
+    (tmp_path / "cron" / "scheduler.py").write_text(
+        "runtime = resolve_runtime_provider()\nturn_route = resolve_turn_route('x', {}, {})\n"
+    )
+    (tmp_path / "hermes_state.py").write_text("class SessionDB:\n    pass\n")
+    (tmp_path / "benchmarks").mkdir()
+    (tmp_path / "benchmarks" / "tool_call_benchmark.py").write_text(
+        "class ToolCall: ...\n"
+    )
+
+    touchpoints = tz.scan_touchpoints(tmp_path)
+
+    labels = {tp.label for tp in touchpoints}
+    assert "fallback_chain" in labels
+    assert "runtime_provider" in labels
+    assert "smart_model_routing" in labels
+    assert "gateway_provider_routing" in labels
+    assert "cron_runtime_provider" in labels
+    assert "session_db" in labels
+    assert "benchmark_suite" in labels
+
+
+def test_build_requirement_matrix_marks_canary_as_gap_without_split_support():
+    touchpoints = [
+        tz.Touchpoint(
+            label="runtime_provider",
+            file_path="hermes_cli/runtime_provider.py",
+            line_number=10,
+            matched_text="def resolve_runtime_provider",
+        ),
+        tz.Touchpoint(
+            label="provider_routing_config",
+            file_path="cli.py",
+            line_number=20,
+            matched_text='provider_routing',
+        ),
+        tz.Touchpoint(
+            label="fallback_chain",
+            file_path="run_agent.py",
+            line_number=21,
+            matched_text='_fallback_chain = []',
+        ),
+        tz.Touchpoint(
+            label="smart_model_routing",
+            file_path="agent/smart_model_routing.py",
+            line_number=30,
+            matched_text='resolve_turn_route',
+        ),
+        tz.Touchpoint(
+            label="gateway_provider_routing",
+            file_path="gateway/run.py",
+            line_number=35,
+            matched_text='def _load_provider_routing',
+        ),
+        tz.Touchpoint(
+            label="cron_runtime_provider",
+            file_path="cron/scheduler.py",
+            line_number=36,
+            matched_text='runtime = resolve_runtime_provider()',
+        ),
+        tz.Touchpoint(
+            label="session_db",
+            file_path="hermes_state.py",
+            line_number=40,
+            matched_text='class SessionDB',
+        ),
+        tz.Touchpoint(
+            label="trajectory_export",
+            file_path="batch_runner.py",
+            line_number=50,
+            matched_text='trajectory_entry',
+        ),
+        tz.Touchpoint(
+            label="benchmark_suite",
+            file_path="benchmarks/tool_call_benchmark.py",
+            line_number=60,
+            matched_text='ToolCall',
+        ),
+    ]
+
+    matrix = tz.build_requirement_matrix(touchpoints)
+    by_key = {row.key: row for row in matrix}
+
+    assert by_key["gateway_replacement"].status == "partial"
+    assert by_key["config_migration"].status == "partial"
+    assert by_key["canary_rollout"].status == "gap"
+    assert by_key["session_feedback"].status == "partial"
+    assert by_key["evaluation_suite"].status == "partial"
+
+
+def test_build_markdown_renders_recommendation_and_touchpoints():
+    touchpoints = [
+        tz.Touchpoint(
+            label="runtime_provider",
+            file_path="hermes_cli/runtime_provider.py",
+            line_number=10,
+            matched_text="def resolve_runtime_provider",
+        ),
+        tz.Touchpoint(
+            label="session_db",
+            file_path="hermes_state.py",
+            line_number=40,
+            matched_text='class SessionDB',
+        ),
+    ]
+    matrix = tz.build_requirement_matrix(touchpoints)
+    report = tz.build_report(touchpoints, matrix)
+    markdown = tz.build_markdown(report)
+
+    assert "# TensorZero Evaluation Packet" in markdown
+    assert "gateway_replacement" not in markdown  # human labels, not raw keys
+    assert "Gateway replacement scope" in markdown
+    assert "Not ready for direct replacement" in markdown
+    assert "hermes_cli/runtime_provider.py:10" in markdown
+    assert "hermes_state.py:40" in markdown
+
+
+def test_issue_context_is_embedded_in_report():
+    report = tz.build_report([], [])
+    markdown = tz.build_markdown(report)
+
+    assert "Issue #860" in markdown
+    assert "tensorzero" in markdown.lower()
+    assert "10% traffic" in markdown