Compare commits
1 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
755e7513a1 |
1131
docs/evaluations/tensorzero-860-evaluation.json
Normal file
1131
docs/evaluations/tensorzero-860-evaluation.json
Normal file
File diff suppressed because it is too large
Load Diff
217
docs/evaluations/tensorzero-860-evaluation.md
Normal file
217
docs/evaluations/tensorzero-860-evaluation.md
Normal file
@@ -0,0 +1,217 @@
|
||||
# TensorZero Evaluation Packet
|
||||
|
||||
Issue #860: [tensorzero LLMOps platform evaluation](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/860)
|
||||
|
||||
## Scope
|
||||
|
||||
This packet evaluates TensorZero as a possible replacement for Hermes' custom provider-routing stack.
|
||||
It is intentionally grounded in the current repo state rather than a speculative cutover plan.
|
||||
|
||||
## Issue requirements being evaluated
|
||||
|
||||
- Deploy tensorzero gateway (Rust binary)
|
||||
- Migrate provider routing config
|
||||
- Test with canary (10% traffic) before full cutover
|
||||
- Feed session data for prompt optimization
|
||||
- Evaluation suite for A/B testing models
|
||||
|
||||
## Recommendation
|
||||
|
||||
Not ready for direct replacement. Recommend a shadow-evaluation phase first: keep Hermes routing live, inventory the migration seams, export SessionDB/trajectory data into an offline TensorZero experiment loop, and only design a canary gateway once percentage-based rollout controls exist.
|
||||
|
||||
## Requirement matrix
|
||||
|
||||
| Requirement | Status | Evidence labels | Summary |
|
||||
| --- | --- | --- | --- |
|
||||
| Gateway replacement scope | partial | fallback_chain, runtime_provider, gateway_provider_routing, cron_runtime_provider, auxiliary_fallback_chain, delegate_runtime_provider | Hermes already spreads provider routing across core agent, runtime provider, gateway, cron, auxiliary, and delegation seams; TensorZero would need parity across all of them before it can replace the gateway layer. |
|
||||
| Config migration | partial | provider_routing_config, runtime_provider, smart_model_routing, fallback_chain | Hermes has multiple config concepts to migrate (`provider_routing`, `fallback_providers`, `smart_model_routing`, runtime provider resolution), so TensorZero is not a drop-in config swap. |
|
||||
| 10% traffic canary | gap | — | The repo shows semantic routing and fallback, but no grounded 10% traffic-split canary mechanism. A TensorZero cutover would need new percentage-based rollout controls and observability hooks. |
|
||||
| Session data for prompt optimization | partial | session_db, trajectory_export | Hermes already has SessionDB and trajectory export surfaces that can feed offline optimization data, but not a TensorZero-native ingestion path yet. |
|
||||
| Evaluation suite / A/B testing | partial | benchmark_suite, trajectory_export | Hermes already has benchmark/trajectory machinery that can seed TensorZero A/B evaluation, but no integrated TensorZero experiment runner or live evaluation gateway. |
|
||||
|
||||
## Grounded Hermes touchpoints
|
||||
|
||||
- `run_agent.py:601` — [fallback_chain] fallback_model: Dict[str, Any] = None,
|
||||
- `run_agent.py:995` — [fallback_chain] # failure). Supports both legacy single-dict ``fallback_model`` and
|
||||
- `run_agent.py:996` — [fallback_chain] # new list ``fallback_providers`` format.
|
||||
- `run_agent.py:997` — [fallback_chain] if isinstance(fallback_model, list):
|
||||
- `run_agent.py:998` — [fallback_chain] self._fallback_chain = [
|
||||
- `run_agent.py:999` — [fallback_chain] f for f in fallback_model
|
||||
- `run_agent.py:1002` — [fallback_chain] elif isinstance(fallback_model, dict) and fallback_model.get("provider") and fallback_model.get("model"):
|
||||
- `run_agent.py:1003` — [fallback_chain] self._fallback_chain = [fallback_model]
|
||||
- `run_agent.py:1005` — [fallback_chain] self._fallback_chain = []
|
||||
- `run_agent.py:1009` — [fallback_chain] self._fallback_model = self._fallback_chain[0] if self._fallback_chain else None
|
||||
- `run_agent.py:1010` — [fallback_chain] if self._fallback_chain and not self.quiet_mode:
|
||||
- `run_agent.py:1011` — [fallback_chain] if len(self._fallback_chain) == 1:
|
||||
- `run_agent.py:1012` — [fallback_chain] fb = self._fallback_chain[0]
|
||||
- `run_agent.py:1015` — [fallback_chain] print(f"🔄 Fallback chain ({len(self._fallback_chain)} providers): " +
|
||||
- `run_agent.py:1016` — [fallback_chain] " → ".join(f"{f['model']} ({f['provider']})" for f in self._fallback_chain))
|
||||
- `run_agent.py:5624` — [fallback_chain] if self._fallback_index >= len(self._fallback_chain):
|
||||
- `run_agent.py:5627` — [fallback_chain] fb = self._fallback_chain[self._fallback_index]
|
||||
- `run_agent.py:8559` — [fallback_chain] if self._fallback_index < len(self._fallback_chain):
|
||||
- `run_agent.py:9355` — [fallback_chain] if is_rate_limited and self._fallback_index < len(self._fallback_chain):
|
||||
- `run_agent.py:10460` — [fallback_chain] if _truly_empty and self._fallback_chain:
|
||||
- `run_agent.py:10514` — [fallback_chain] + (" and fallback attempts." if self._fallback_chain else
|
||||
- `cli.py:241` — [provider_routing_config] "smart_model_routing": {
|
||||
- `cli.py:370` — [provider_routing_config] # (e.g. platform_toolsets, provider_routing, memory, honcho, etc.)
|
||||
- `cli.py:1753` — [provider_routing_config] pr = CLI_CONFIG.get("provider_routing", {}) or {}
|
||||
- `cli.py:1762` — [provider_routing_config] # Supports new list format (fallback_providers) and legacy single-dict (fallback_model).
|
||||
- `cli.py:1763` — [provider_routing_config] fb = CLI_CONFIG.get("fallback_providers") or CLI_CONFIG.get("fallback_model") or []
|
||||
- `cli.py:1770` — [provider_routing_config] self._smart_model_routing = CLI_CONFIG.get("smart_model_routing", {}) or {}
|
||||
- `cli.py:2771` — [provider_routing_config] from agent.smart_model_routing import resolve_turn_route
|
||||
- `cli.py:2776` — [provider_routing_config] self._smart_model_routing,
|
||||
- `hermes_cli/runtime_provider.py:209` — [runtime_provider] def resolve_requested_provider(requested: Optional[str] = None) -> str:
|
||||
- `hermes_cli/runtime_provider.py:649` — [runtime_provider] def resolve_runtime_provider(
|
||||
- `agent/smart_model_routing.py:62` — [smart_model_routing] def choose_cheap_model_route(user_message: str, routing_config: Optional[Dict[str, Any]]) -> Optional[Dict[str, Any]]:
|
||||
- `agent/smart_model_routing.py:110` — [smart_model_routing] def resolve_turn_route(user_message: str, routing_config: Optional[Dict[str, Any]], primary: Dict[str, Any]) -> Dict[str, Any]:
|
||||
- `gateway/run.py:1271` — [gateway_provider_routing] def _load_provider_routing() -> dict:
|
||||
- `gateway/run.py:1285` — [gateway_provider_routing] def _load_fallback_model() -> list | dict | None:
|
||||
- `gateway/run.py:1306` — [gateway_provider_routing] def _load_smart_model_routing() -> dict:
|
||||
- `cron/scheduler.py:684` — [cron_runtime_provider] pr = _cfg.get("provider_routing", {})
|
||||
- `cron/scheduler.py:688` — [cron_runtime_provider] resolve_runtime_provider,
|
||||
- `cron/scheduler.py:697` — [cron_runtime_provider] runtime = resolve_runtime_provider(**runtime_kwargs)
|
||||
- `cron/scheduler.py:702` — [cron_runtime_provider] from agent.smart_model_routing import resolve_turn_route
|
||||
- `cron/scheduler.py:703` — [cron_runtime_provider] turn_route = resolve_turn_route(
|
||||
- `cron/scheduler.py:717` — [cron_runtime_provider] fallback_model = _cfg.get("fallback_providers") or _cfg.get("fallback_model") or None
|
||||
- `cron/scheduler.py:746` — [cron_runtime_provider] fallback_model=fallback_model,
|
||||
- `agent/auxiliary_client.py:1018` — [auxiliary_fallback_chain] def _get_provider_chain() -> List[tuple]:
|
||||
- `agent/auxiliary_client.py:1107` — [auxiliary_fallback_chain] for label, try_fn in _get_provider_chain():
|
||||
- `agent/auxiliary_client.py:1189` — [auxiliary_fallback_chain] # ── Step 2: aggregator / fallback chain ──────────────────────────────
|
||||
- `agent/auxiliary_client.py:1191` — [auxiliary_fallback_chain] for label, try_fn in _get_provider_chain():
|
||||
- `agent/auxiliary_client.py:2397` — [auxiliary_fallback_chain] # error, fall through to the fallback chain below.
|
||||
- `agent/auxiliary_client.py:2417` — [auxiliary_fallback_chain] # auto (the default) = best-effort fallback chain. (#7559)
|
||||
- `agent/auxiliary_client.py:2589` — [auxiliary_fallback_chain] # error, fall through to the fallback chain below.
|
||||
- `tools/delegate_tool.py:662` — [delegate_runtime_provider] # bundle (base_url, api_key, api_mode) via the same runtime provider system
|
||||
- `tools/delegate_tool.py:854` — [delegate_runtime_provider] provider) is resolved via the runtime provider system — the same path used
|
||||
- `tools/delegate_tool.py:909` — [delegate_runtime_provider] from hermes_cli.runtime_provider import resolve_runtime_provider
|
||||
- `tools/delegate_tool.py:910` — [delegate_runtime_provider] runtime = resolve_runtime_provider(requested=configured_provider)
|
||||
- `hermes_state.py:115` — [session_db] class SessionDB:
|
||||
- `batch_runner.py:320` — [trajectory_export] save_trajectories=False, # We handle saving ourselves
|
||||
- `batch_runner.py:346` — [trajectory_export] trajectory = agent._convert_to_trajectory_format(
|
||||
- `batch_runner.py:460` — [trajectory_export] trajectory_entry = {
|
||||
- `batch_runner.py:474` — [trajectory_export] f.write(json.dumps(trajectory_entry, ensure_ascii=False) + "\n")
|
||||
- `benchmarks/tool_call_benchmark.py:3` — [benchmark_suite] Tool-Calling Benchmark — Gemma 4 vs mimo-v2-pro regression test.
|
||||
- `benchmarks/tool_call_benchmark.py:9` — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py # full 100-call suite
|
||||
- `benchmarks/tool_call_benchmark.py:10` — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --limit 10 # quick smoke test
|
||||
- `benchmarks/tool_call_benchmark.py:11` — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --models nous # single model
|
||||
- `benchmarks/tool_call_benchmark.py:12` — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --category file # single category
|
||||
- `benchmarks/tool_call_benchmark.py:37` — [benchmark_suite] class ToolCall:
|
||||
- `benchmarks/tool_call_benchmark.py:51` — [benchmark_suite] ToolCall("file-01", "file", "Read the file /tmp/test_bench.txt and show me its contents.",
|
||||
- `benchmarks/tool_call_benchmark.py:53` — [benchmark_suite] ToolCall("file-02", "file", "Write 'hello benchmark' to /tmp/test_bench_out.txt",
|
||||
- `benchmarks/tool_call_benchmark.py:55` — [benchmark_suite] ToolCall("file-03", "file", "Search for the word 'import' in all Python files in the current directory.",
|
||||
- `benchmarks/tool_call_benchmark.py:57` — [benchmark_suite] ToolCall("file-04", "file", "Read lines 1-20 of /etc/hosts",
|
||||
- `benchmarks/tool_call_benchmark.py:59` — [benchmark_suite] ToolCall("file-05", "file", "Patch /tmp/test_bench_out.txt: replace 'hello' with 'goodbye'",
|
||||
- `benchmarks/tool_call_benchmark.py:61` — [benchmark_suite] ToolCall("file-06", "file", "Search for files matching *.py in the current directory.",
|
||||
- `benchmarks/tool_call_benchmark.py:63` — [benchmark_suite] ToolCall("file-07", "file", "Read the first 10 lines of /etc/passwd",
|
||||
- `benchmarks/tool_call_benchmark.py:65` — [benchmark_suite] ToolCall("file-08", "file", "Write a JSON config to /tmp/bench_config.json with key 'debug': true",
|
||||
- `benchmarks/tool_call_benchmark.py:67` — [benchmark_suite] ToolCall("file-09", "file", "Search for 'def test_' in Python test files.",
|
||||
- `benchmarks/tool_call_benchmark.py:69` — [benchmark_suite] ToolCall("file-10", "file", "Read /tmp/bench_config.json and tell me what's in it.",
|
||||
- `benchmarks/tool_call_benchmark.py:71` — [benchmark_suite] ToolCall("file-11", "file", "Create a file /tmp/bench_readme.md with one line: '# Benchmark'",
|
||||
- `benchmarks/tool_call_benchmark.py:73` — [benchmark_suite] ToolCall("file-12", "file", "Search for 'TODO' comments in all .py files.",
|
||||
- `benchmarks/tool_call_benchmark.py:75` — [benchmark_suite] ToolCall("file-13", "file", "Read /tmp/bench_readme.md",
|
||||
- `benchmarks/tool_call_benchmark.py:77` — [benchmark_suite] ToolCall("file-14", "file", "Patch /tmp/bench_readme.md: replace '# Benchmark' with '# Tool Benchmark'",
|
||||
- `benchmarks/tool_call_benchmark.py:78` — [benchmark_suite] "patch", "Tool Benchmark"),
|
||||
- `benchmarks/tool_call_benchmark.py:79` — [benchmark_suite] ToolCall("file-15", "file", "Write a Python one-liner to /tmp/bench_hello.py that prints hello.",
|
||||
- `benchmarks/tool_call_benchmark.py:81` — [benchmark_suite] ToolCall("file-16", "file", "Search for all .json files in /tmp/.",
|
||||
- `benchmarks/tool_call_benchmark.py:83` — [benchmark_suite] ToolCall("file-17", "file", "Read /tmp/bench_hello.py and verify it has print('hello').",
|
||||
- `benchmarks/tool_call_benchmark.py:85` — [benchmark_suite] ToolCall("file-18", "file", "Patch /tmp/bench_hello.py to print 'hello world' instead of 'hello'.",
|
||||
- `benchmarks/tool_call_benchmark.py:87` — [benchmark_suite] ToolCall("file-19", "file", "List files matching 'bench*' in /tmp/.",
|
||||
- `benchmarks/tool_call_benchmark.py:89` — [benchmark_suite] ToolCall("file-20", "file", "Read /tmp/test_bench.txt again and summarize its contents.",
|
||||
- `benchmarks/tool_call_benchmark.py:93` — [benchmark_suite] ToolCall("term-01", "terminal", "Run `echo hello world` in the terminal.",
|
||||
- `benchmarks/tool_call_benchmark.py:95` — [benchmark_suite] ToolCall("term-02", "terminal", "Run `date` to get the current date and time.",
|
||||
- `benchmarks/tool_call_benchmark.py:97` — [benchmark_suite] ToolCall("term-03", "terminal", "Run `uname -a` to get system information.",
|
||||
- `benchmarks/tool_call_benchmark.py:99` — [benchmark_suite] ToolCall("term-04", "terminal", "Run `pwd` to show the current directory.",
|
||||
- `benchmarks/tool_call_benchmark.py:101` — [benchmark_suite] ToolCall("term-05", "terminal", "Run `ls -la /tmp/ | head -20` to list temp files.",
|
||||
- `benchmarks/tool_call_benchmark.py:103` — [benchmark_suite] ToolCall("term-06", "terminal", "Run `whoami` to show the current user.",
|
||||
- `benchmarks/tool_call_benchmark.py:105` — [benchmark_suite] ToolCall("term-07", "terminal", "Run `df -h` to show disk usage.",
|
||||
- `benchmarks/tool_call_benchmark.py:107` — [benchmark_suite] ToolCall("term-08", "terminal", "Run `python3 --version` to check Python version.",
|
||||
- `benchmarks/tool_call_benchmark.py:109` — [benchmark_suite] ToolCall("term-09", "terminal", "Run `cat /etc/hostname` to get the hostname.",
|
||||
- `benchmarks/tool_call_benchmark.py:111` — [benchmark_suite] ToolCall("term-10", "terminal", "Run `uptime` to see system uptime.",
|
||||
- `benchmarks/tool_call_benchmark.py:113` — [benchmark_suite] ToolCall("term-11", "terminal", "Run `env | grep PATH` to show the PATH variable.",
|
||||
- `benchmarks/tool_call_benchmark.py:115` — [benchmark_suite] ToolCall("term-12", "terminal", "Run `wc -l /etc/passwd` to count lines.",
|
||||
- `benchmarks/tool_call_benchmark.py:117` — [benchmark_suite] ToolCall("term-13", "terminal", "Run `echo $SHELL` to show the current shell.",
|
||||
- `benchmarks/tool_call_benchmark.py:119` — [benchmark_suite] ToolCall("term-14", "terminal", "Run `free -h || vm_stat` to check memory usage.",
|
||||
- `benchmarks/tool_call_benchmark.py:121` — [benchmark_suite] ToolCall("term-15", "terminal", "Run `id` to show user and group IDs.",
|
||||
- `benchmarks/tool_call_benchmark.py:123` — [benchmark_suite] ToolCall("term-16", "terminal", "Run `hostname` to get the machine hostname.",
|
||||
- `benchmarks/tool_call_benchmark.py:125` — [benchmark_suite] ToolCall("term-17", "terminal", "Run `echo {1..5}` to test brace expansion.",
|
||||
- `benchmarks/tool_call_benchmark.py:127` — [benchmark_suite] ToolCall("term-18", "terminal", "Run `seq 1 5` to generate a number sequence.",
|
||||
- `benchmarks/tool_call_benchmark.py:129` — [benchmark_suite] ToolCall("term-19", "terminal", "Run `python3 -c 'print(2+2)'` to compute 2+2.",
|
||||
- `benchmarks/tool_call_benchmark.py:131` — [benchmark_suite] ToolCall("term-20", "terminal", "Run `ls -d /tmp/bench* 2>/dev/null | wc -l` to count bench files.",
|
||||
- `benchmarks/tool_call_benchmark.py:135` — [benchmark_suite] ToolCall("code-01", "code", "Execute a Python script that computes factorial of 10.",
|
||||
- `benchmarks/tool_call_benchmark.py:137` — [benchmark_suite] ToolCall("code-02", "code", "Run Python to read /tmp/test_bench.txt and count its words.",
|
||||
- `benchmarks/tool_call_benchmark.py:139` — [benchmark_suite] ToolCall("code-03", "code", "Execute Python to generate the first 20 Fibonacci numbers.",
|
||||
- `benchmarks/tool_call_benchmark.py:141` — [benchmark_suite] ToolCall("code-04", "code", "Run Python to parse JSON from a string and print keys.",
|
||||
- `benchmarks/tool_call_benchmark.py:143` — [benchmark_suite] ToolCall("code-05", "code", "Execute Python to list all files in /tmp/ matching 'bench*'.",
|
||||
- `benchmarks/tool_call_benchmark.py:145` — [benchmark_suite] ToolCall("code-06", "code", "Run Python to compute the sum of squares from 1 to 100.",
|
||||
- `benchmarks/tool_call_benchmark.py:147` — [benchmark_suite] ToolCall("code-07", "code", "Execute Python to check if 'racecar' is a palindrome.",
|
||||
- `benchmarks/tool_call_benchmark.py:149` — [benchmark_suite] ToolCall("code-08", "code", "Run Python to create a CSV string with 5 rows of sample data.",
|
||||
- `benchmarks/tool_call_benchmark.py:151` — [benchmark_suite] ToolCall("code-09", "code", "Execute Python to sort a list [5,2,8,1,9] and print the result.",
|
||||
- `benchmarks/tool_call_benchmark.py:153` — [benchmark_suite] ToolCall("code-10", "code", "Run Python to count lines in /etc/passwd.",
|
||||
- `benchmarks/tool_call_benchmark.py:155` — [benchmark_suite] ToolCall("code-11", "code", "Execute Python to hash the string 'benchmark' with SHA256.",
|
||||
- `benchmarks/tool_call_benchmark.py:157` — [benchmark_suite] ToolCall("code-12", "code", "Run Python to get the current UTC timestamp.",
|
||||
- `benchmarks/tool_call_benchmark.py:159` — [benchmark_suite] ToolCall("code-13", "code", "Execute Python to convert 'hello world' to uppercase and reverse it.",
|
||||
- `benchmarks/tool_call_benchmark.py:161` — [benchmark_suite] ToolCall("code-14", "code", "Run Python to create a dictionary of system info (platform, python version).",
|
||||
- `benchmarks/tool_call_benchmark.py:163` — [benchmark_suite] ToolCall("code-15", "code", "Execute Python to check internet connectivity by resolving google.com.",
|
||||
- `benchmarks/tool_call_benchmark.py:167` — [benchmark_suite] ToolCall("deleg-01", "delegate", "Use a subagent to find all .log files in /tmp/.",
|
||||
- `benchmarks/tool_call_benchmark.py:169` — [benchmark_suite] ToolCall("deleg-02", "delegate", "Delegate to a subagent: what is 15 * 37?",
|
||||
- `benchmarks/tool_call_benchmark.py:171` — [benchmark_suite] ToolCall("deleg-03", "delegate", "Use a subagent to check if Python 3 is installed and its version.",
|
||||
- `benchmarks/tool_call_benchmark.py:173` — [benchmark_suite] ToolCall("deleg-04", "delegate", "Delegate: read /tmp/test_bench.txt and summarize it in one sentence.",
|
||||
- `benchmarks/tool_call_benchmark.py:175` — [benchmark_suite] ToolCall("deleg-05", "delegate", "Use a subagent to list the contents of /tmp/ directory.",
|
||||
- `benchmarks/tool_call_benchmark.py:177` — [benchmark_suite] ToolCall("deleg-06", "delegate", "Delegate: count the number of .py files in the current directory.",
|
||||
- `benchmarks/tool_call_benchmark.py:179` — [benchmark_suite] ToolCall("deleg-07", "delegate", "Use a subagent to check disk space with df -h.",
|
||||
- `benchmarks/tool_call_benchmark.py:181` — [benchmark_suite] ToolCall("deleg-08", "delegate", "Delegate: what OS are we running on?",
|
||||
- `benchmarks/tool_call_benchmark.py:183` — [benchmark_suite] ToolCall("deleg-09", "delegate", "Use a subagent to find the hostname of this machine.",
|
||||
- `benchmarks/tool_call_benchmark.py:185` — [benchmark_suite] ToolCall("deleg-10", "delegate", "Delegate: create a temp file /tmp/bench_deleg.txt with 'done'.",
|
||||
- `benchmarks/tool_call_benchmark.py:189` — [benchmark_suite] ToolCall("todo-01", "todo", "Add a todo item: 'Run benchmark suite'",
|
||||
- `benchmarks/tool_call_benchmark.py:190` — [benchmark_suite] "todo", "benchmark"),
|
||||
- `benchmarks/tool_call_benchmark.py:191` — [benchmark_suite] ToolCall("todo-02", "todo", "Show me the current todo list.",
|
||||
- `benchmarks/tool_call_benchmark.py:193` — [benchmark_suite] ToolCall("todo-03", "todo", "Mark the first todo item as completed.",
|
||||
- `benchmarks/tool_call_benchmark.py:195` — [benchmark_suite] ToolCall("todo-04", "todo", "Add a todo: 'Review benchmark results' with status pending.",
|
||||
- `benchmarks/tool_call_benchmark.py:197` — [benchmark_suite] ToolCall("todo-05", "todo", "Clear all completed todos.",
|
||||
- `benchmarks/tool_call_benchmark.py:199` — [benchmark_suite] ToolCall("todo-06", "memory", "Save this to memory: 'benchmark ran on {date}'".format(
|
||||
- `benchmarks/tool_call_benchmark.py:201` — [benchmark_suite] "memory", "benchmark"),
|
||||
- `benchmarks/tool_call_benchmark.py:202` — [benchmark_suite] ToolCall("todo-07", "memory", "Search memory for 'benchmark'.",
|
||||
- `benchmarks/tool_call_benchmark.py:203` — [benchmark_suite] "memory", "benchmark"),
|
||||
- `benchmarks/tool_call_benchmark.py:204` — [benchmark_suite] ToolCall("todo-08", "memory", "Add a memory note: 'test models are gemma-4 and mimo-v2-pro'.",
|
||||
- `benchmarks/tool_call_benchmark.py:206` — [benchmark_suite] ToolCall("todo-09", "todo", "Add three todo items: 'analyze', 'report', 'cleanup'.",
|
||||
- `benchmarks/tool_call_benchmark.py:208` — [benchmark_suite] ToolCall("todo-10", "memory", "Search memory for any notes about models.",
|
||||
- `benchmarks/tool_call_benchmark.py:212` — [benchmark_suite] ToolCall("skill-01", "skills", "List all available skills.",
|
||||
- `benchmarks/tool_call_benchmark.py:214` — [benchmark_suite] ToolCall("skill-02", "skills", "View the skill called 'test-driven-development'.",
|
||||
- `benchmarks/tool_call_benchmark.py:216` — [benchmark_suite] ToolCall("skill-03", "skills", "Search for skills related to 'git'.",
|
||||
- `benchmarks/tool_call_benchmark.py:218` — [benchmark_suite] ToolCall("skill-04", "skills", "View the 'code-review' skill.",
|
||||
- `benchmarks/tool_call_benchmark.py:220` — [benchmark_suite] ToolCall("skill-05", "skills", "List all skills in the 'devops' category.",
|
||||
- `benchmarks/tool_call_benchmark.py:222` — [benchmark_suite] ToolCall("skill-06", "skills", "View the 'systematic-debugging' skill.",
|
||||
- `benchmarks/tool_call_benchmark.py:224` — [benchmark_suite] ToolCall("skill-07", "skills", "Search for skills about 'testing'.",
|
||||
- `benchmarks/tool_call_benchmark.py:226` — [benchmark_suite] ToolCall("skill-08", "skills", "View the 'writing-plans' skill.",
|
||||
- `benchmarks/tool_call_benchmark.py:228` — [benchmark_suite] ToolCall("skill-09", "skills", "List skills in 'software-development' category.",
|
||||
- `benchmarks/tool_call_benchmark.py:230` — [benchmark_suite] ToolCall("skill-10", "skills", "View the 'pr-review-discipline' skill.",
|
||||
- `benchmarks/tool_call_benchmark.py:234` — [benchmark_suite] ToolCall("file-21", "file", "Write a Python snippet to /tmp/bench_sort.py that sorts [3,1,2].",
|
||||
- `benchmarks/tool_call_benchmark.py:236` — [benchmark_suite] ToolCall("file-22", "file", "Read /tmp/bench_sort.py back and confirm it exists.",
|
||||
- `benchmarks/tool_call_benchmark.py:238` — [benchmark_suite] ToolCall("file-23", "file", "Search for 'class' in all .py files in the benchmarks directory.",
|
||||
- `benchmarks/tool_call_benchmark.py:240` — [benchmark_suite] ToolCall("term-21", "terminal", "Run `cat /etc/os-release 2>/dev/null || sw_vers 2>/dev/null` for OS info.",
|
||||
- `benchmarks/tool_call_benchmark.py:242` — [benchmark_suite] ToolCall("term-22", "terminal", "Run `nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null` for CPU count.",
|
||||
- `benchmarks/tool_call_benchmark.py:244` — [benchmark_suite] ToolCall("code-16", "code", "Execute Python to flatten a nested list [[1,2],[3,4],[5]].",
|
||||
- `benchmarks/tool_call_benchmark.py:246` — [benchmark_suite] ToolCall("code-17", "code", "Run Python to check if a number 17 is prime.",
|
||||
- `benchmarks/tool_call_benchmark.py:248` — [benchmark_suite] ToolCall("deleg-11", "delegate", "Delegate: what is the current working directory?",
|
||||
- `benchmarks/tool_call_benchmark.py:250` — [benchmark_suite] ToolCall("todo-11", "todo", "Add a todo: 'Finalize benchmark report' status pending.",
|
||||
- `benchmarks/tool_call_benchmark.py:252` — [benchmark_suite] ToolCall("todo-12", "memory", "Store fact: 'benchmark categories: file, terminal, code, delegate, todo, memory, skills'.",
|
||||
- `benchmarks/tool_call_benchmark.py:254` — [benchmark_suite] ToolCall("skill-11", "skills", "Search for skills about 'deployment'.",
|
||||
- `benchmarks/tool_call_benchmark.py:256` — [benchmark_suite] ToolCall("skill-12", "skills", "View the 'gitea-burn-cycle' skill.",
|
||||
- `benchmarks/tool_call_benchmark.py:258` — [benchmark_suite] ToolCall("skill-13", "skills", "List all available skill categories.",
|
||||
- `benchmarks/tool_call_benchmark.py:260` — [benchmark_suite] ToolCall("skill-14", "skills", "Search for skills related to 'memory'.",
|
||||
- `benchmarks/tool_call_benchmark.py:262` — [benchmark_suite] ToolCall("skill-15", "skills", "View the 'mimo-swarm' skill.",
|
||||
- `benchmarks/tool_call_benchmark.py:311` — [benchmark_suite] """Create prerequisite files for the benchmark."""
|
||||
- `benchmarks/tool_call_benchmark.py:313` — [benchmark_suite] "This is a benchmark test file.\n"
|
||||
- `benchmarks/tool_call_benchmark.py:349` — [benchmark_suite] "You are a benchmark test runner. Execute the user's request by calling "
|
||||
- `benchmarks/tool_call_benchmark.py:406` — [benchmark_suite] """Generate markdown benchmark report."""
|
||||
- `benchmarks/tool_call_benchmark.py:428` — [benchmark_suite] f"# Tool-Calling Benchmark Report",
|
||||
- `benchmarks/tool_call_benchmark.py:535` — [benchmark_suite] parser = argparse.ArgumentParser(description="Tool-calling benchmark")
|
||||
- `benchmarks/tool_call_benchmark.py:544` — [benchmark_suite] help="Output report path (default: benchmarks/gemma4-tool-calling-YYYY-MM-DD.md)")
|
||||
- `benchmarks/tool_call_benchmark.py:565` — [benchmark_suite] output_path = Path(args.output) if args.output else REPO_ROOT / "benchmarks" / f"gemma4-tool-calling-{date_str}.md"
|
||||
- `benchmarks/tool_call_benchmark.py:575` — [benchmark_suite] print(f"Benchmark: {len(suite)} tests × {len(model_specs)} models = {len(suite) * len(model_specs)} calls")
|
||||
|
||||
## Suggested next slice
|
||||
|
||||
1. Build an exporter that emits SessionDB + trajectory data into a TensorZero-friendly offline dataset.
|
||||
2. Define percentage-based canary controls before attempting any gateway replacement.
|
||||
3. Keep Hermes routing authoritative until TensorZero proves parity across CLI, gateway, cron, auxiliary, and delegation surfaces.
|
||||
@@ -1,66 +0,0 @@
|
||||
# Morning Review Packet Status — #949
|
||||
|
||||
Generated: 2026-04-22T14:57:44.332419+00:00
|
||||
Epic: [EPIC: Morning review packet — Hermes harness features landed 2026-04-21](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/949)
|
||||
|
||||
## Summary
|
||||
|
||||
- Child QA issues tracked: 13
|
||||
- Open child issues: 11
|
||||
- Closed child issues: 2
|
||||
- Open child issues already backed by PRs: 7
|
||||
- Open child issues still unowned on forge: 4
|
||||
|
||||
## Child QA Matrix
|
||||
|
||||
| Issue | State | Open PRs | Title |
|
||||
|------:|-------|----------|-------|
|
||||
| #950 | open | — | [QA] Verify AI Gateway provider UX + attribution headers |
|
||||
| #951 | open | — | [QA] Verify transport abstraction + AnthropicTransport wiring |
|
||||
| #952 | open | — | [QA] Verify CLI voice beep toggle |
|
||||
| #953 | open | [#1020](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/1020) | [QA] Verify bundled skill scripts run out of the box |
|
||||
| #954 | open | [#1021](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/1021) | [QA] Verify maps skill guest_house / camp_site / bakery expansion |
|
||||
| #955 | open | — | [QA] Verify KittenTTS local provider end-to-end |
|
||||
| #956 | open | [#1018](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/1018) | [QA] Verify numbered keyboard shortcuts for approval + clarify prompts |
|
||||
| #957 | open | [#1015](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/1015) | [QA] Verify optional adversarial-ux-test skill catalog flow |
|
||||
| #958 | open | [#1016](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/1016) | [QA] Verify /usage account limits in CLI + gateway |
|
||||
| #959 | open | [#1014](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/1014) | [QA] Verify OpenCode-Go curated catalog additions |
|
||||
| #960 | open | [#1017](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/1017) | [QA] Verify patch 'did you mean?' suggestions |
|
||||
| #961 | closed | — | [QA] Verify web dashboard update/restart action buttons |
|
||||
| #962 | closed | — | [QA] Verify hardcoded-home path guard on burn/921 branch |
|
||||
|
||||
## Drift Signals
|
||||
|
||||
forge/main is still catching up to the upstream packet.
|
||||
|
||||
Active PR-backed child lanes:
|
||||
- #953 -> #1020 ([QA] Verify bundled skill scripts run out of the box)
|
||||
- #954 -> #1021 ([QA] Verify maps skill guest_house / camp_site / bakery expansion)
|
||||
- #956 -> #1018 ([QA] Verify numbered keyboard shortcuts for approval + clarify prompts)
|
||||
- #957 -> #1015 ([QA] Verify optional adversarial-ux-test skill catalog flow)
|
||||
- #958 -> #1016 ([QA] Verify /usage account limits in CLI + gateway)
|
||||
- #959 -> #1014 ([QA] Verify OpenCode-Go curated catalog additions)
|
||||
- #960 -> #1017 ([QA] Verify patch 'did you mean?' suggestions)
|
||||
|
||||
## Unowned Open QA Issues
|
||||
|
||||
- #950 [QA] Verify AI Gateway provider UX + attribution headers
|
||||
- #951 [QA] Verify transport abstraction + AnthropicTransport wiring
|
||||
- #952 [QA] Verify CLI voice beep toggle
|
||||
- #955 [QA] Verify KittenTTS local provider end-to-end
|
||||
|
||||
## Decomposition Follow-Ups
|
||||
|
||||
- #965 [open] [EPIC: Morning review packet — Hermes harness features landed 2026-04-21] Phase 1: Landscape Analysis & Scaffolding
|
||||
- #966 [open] [EPIC: Morning review packet — Hermes harness features landed 2026-04-21] Phase 2: Core Logic Implementation
|
||||
- #967 [closed] [EPIC: Morning review packet — Hermes harness features landed 2026-04-21] Phase 3: Poka-yoke Integration & Fleet Verification
|
||||
|
||||
## Conclusion
|
||||
|
||||
Refs #949 only. This epic remains open until every child QA issue has a truthful PASS/FAIL outcome, attached evidence, and any upstream/main versus forge/main drift is resolved or explicitly documented.
|
||||
|
||||
## Regeneration
|
||||
|
||||
```bash
|
||||
python3 scripts/morning_review_packet_status.py --fetch-live --json-out docs/morning-review-packet-2026-04-21.snapshot.json --markdown-out docs/morning-review-packet-2026-04-21-status.md
|
||||
```
|
||||
@@ -1,172 +0,0 @@
|
||||
{
|
||||
"generated_at": "2026-04-22T14:57:44.332419+00:00",
|
||||
"repo": "Timmy_Foundation/hermes-agent",
|
||||
"epic": {
|
||||
"number": 949,
|
||||
"title": "EPIC: Morning review packet \u2014 Hermes harness features landed 2026-04-21",
|
||||
"state": "open",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/949"
|
||||
},
|
||||
"children": [
|
||||
{
|
||||
"number": 950,
|
||||
"title": "[QA] Verify AI Gateway provider UX + attribution headers",
|
||||
"state": "open",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/950",
|
||||
"open_prs": []
|
||||
},
|
||||
{
|
||||
"number": 951,
|
||||
"title": "[QA] Verify transport abstraction + AnthropicTransport wiring",
|
||||
"state": "open",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/951",
|
||||
"open_prs": []
|
||||
},
|
||||
{
|
||||
"number": 952,
|
||||
"title": "[QA] Verify CLI voice beep toggle",
|
||||
"state": "open",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/952",
|
||||
"open_prs": []
|
||||
},
|
||||
{
|
||||
"number": 953,
|
||||
"title": "[QA] Verify bundled skill scripts run out of the box",
|
||||
"state": "open",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/953",
|
||||
"open_prs": [
|
||||
{
|
||||
"number": 1020,
|
||||
"title": "fix: ship bundled skill scripts executable",
|
||||
"head": "fix/953",
|
||||
"url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/1020"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"number": 954,
|
||||
"title": "[QA] Verify maps skill guest_house / camp_site / bakery expansion",
|
||||
"state": "open",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/954",
|
||||
"open_prs": [
|
||||
{
|
||||
"number": 1021,
|
||||
"title": "feat: sync maps skill and verify guest_house/camp_site/bakery (#954)",
|
||||
"head": "fix/954",
|
||||
"url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/1021"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"number": 955,
|
||||
"title": "[QA] Verify KittenTTS local provider end-to-end",
|
||||
"state": "open",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/955",
|
||||
"open_prs": []
|
||||
},
|
||||
{
|
||||
"number": 956,
|
||||
"title": "[QA] Verify numbered keyboard shortcuts for approval + clarify prompts",
|
||||
"state": "open",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/956",
|
||||
"open_prs": [
|
||||
{
|
||||
"number": 1018,
|
||||
"title": "fix: add numbered approval and clarify shortcuts (#956)",
|
||||
"head": "fix/956",
|
||||
"url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/1018"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"number": 957,
|
||||
"title": "[QA] Verify optional adversarial-ux-test skill catalog flow",
|
||||
"state": "open",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/957",
|
||||
"open_prs": [
|
||||
{
|
||||
"number": 1015,
|
||||
"title": "feat(skills): backport adversarial-ux-test optional skill",
|
||||
"head": "fix/957",
|
||||
"url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/1015"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"number": 958,
|
||||
"title": "[QA] Verify /usage account limits in CLI + gateway",
|
||||
"state": "open",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/958",
|
||||
"open_prs": [
|
||||
{
|
||||
"number": 1016,
|
||||
"title": "fix: restore /usage account limits in CLI + gateway (#958)",
|
||||
"head": "fix/958",
|
||||
"url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/1016"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"number": 959,
|
||||
"title": "[QA] Verify OpenCode-Go curated catalog additions",
|
||||
"state": "open",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/959",
|
||||
"open_prs": [
|
||||
{
|
||||
"number": 1014,
|
||||
"title": "fix(opencode-go): restore curated catalog additions",
|
||||
"head": "fix/959",
|
||||
"url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/1014"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"number": 960,
|
||||
"title": "[QA] Verify patch 'did you mean?' suggestions",
|
||||
"state": "open",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/960",
|
||||
"open_prs": [
|
||||
{
|
||||
"number": 1017,
|
||||
"title": "fix(patch): port and verify did-you-mean suggestions (#960)",
|
||||
"head": "fix/960",
|
||||
"url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/1017"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"number": 961,
|
||||
"title": "[QA] Verify web dashboard update/restart action buttons",
|
||||
"state": "closed",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/961",
|
||||
"open_prs": []
|
||||
},
|
||||
{
|
||||
"number": 962,
|
||||
"title": "[QA] Verify hardcoded-home path guard on burn/921 branch",
|
||||
"state": "closed",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/962",
|
||||
"open_prs": []
|
||||
}
|
||||
],
|
||||
"decomposition_issues": [
|
||||
{
|
||||
"number": 965,
|
||||
"title": "[EPIC: Morning review packet \u2014 Hermes harness features landed 2026-04-21] Phase 1: Landscape Analysis & Scaffolding",
|
||||
"state": "open",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/965"
|
||||
},
|
||||
{
|
||||
"number": 966,
|
||||
"title": "[EPIC: Morning review packet \u2014 Hermes harness features landed 2026-04-21] Phase 2: Core Logic Implementation",
|
||||
"state": "open",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/966"
|
||||
},
|
||||
{
|
||||
"number": 967,
|
||||
"title": "[EPIC: Morning review packet \u2014 Hermes harness features landed 2026-04-21] Phase 3: Poka-yoke Integration & Fleet Verification",
|
||||
"state": "closed",
|
||||
"html_url": "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/967"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,288 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Generate a grounded status report for hermes-agent morning review packet epic #949."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import base64
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import ssl
|
||||
import urllib.request
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
BASE_API = "https://forge.alexanderwhitestone.com/api/v1"
|
||||
REPO = "Timmy_Foundation/hermes-agent"
|
||||
TOKEN_PATH = Path("~/.config/gitea/token").expanduser()
|
||||
DEFAULT_JSON_OUT = Path("docs/morning-review-packet-2026-04-21.snapshot.json")
|
||||
DEFAULT_MARKDOWN_OUT = Path("docs/morning-review-packet-2026-04-21-status.md")
|
||||
|
||||
|
||||
def extract_issue_numbers(text: str) -> list[int]:
|
||||
seen: set[int] = set()
|
||||
numbers: list[int] = []
|
||||
for match in re.finditer(r"#(\d+)", text or ""):
|
||||
num = int(match.group(1))
|
||||
if num not in seen:
|
||||
seen.add(num)
|
||||
numbers.append(num)
|
||||
return numbers
|
||||
|
||||
|
||||
def _auth_headers(token: str) -> list[dict[str, str]]:
|
||||
basic = base64.b64encode(f"{token}:".encode()).decode()
|
||||
return [
|
||||
{"Authorization": f"token {token}", "Accept": "application/json"},
|
||||
{"Authorization": f"Basic {basic}", "Accept": "application/json"},
|
||||
]
|
||||
|
||||
|
||||
def api_get(path: str, *, headers_options: list[dict[str, str]] | None = None) -> Any:
|
||||
token = TOKEN_PATH.read_text(encoding="utf-8").strip()
|
||||
headers_options = headers_options or _auth_headers(token)
|
||||
ctx = ssl.create_default_context()
|
||||
url = f"{BASE_API}{path}"
|
||||
last_error: Exception | None = None
|
||||
for headers in headers_options:
|
||||
try:
|
||||
req = urllib.request.Request(url, headers=headers)
|
||||
with urllib.request.urlopen(req, context=ctx, timeout=30) as resp:
|
||||
return json.loads(resp.read().decode())
|
||||
except Exception as exc: # pragma: no cover - exercised via live CLI use
|
||||
last_error = exc
|
||||
raise RuntimeError(f"GET {url} failed: {last_error}")
|
||||
|
||||
|
||||
def issue_pr_matches(pr: dict[str, Any], issue_num: int) -> bool:
|
||||
title = pr.get("title") or ""
|
||||
body = pr.get("body") or ""
|
||||
head = (pr.get("head") or {}).get("ref") or ""
|
||||
exact_ref = re.compile(rf"(?<!\d)#{issue_num}(?!\d)")
|
||||
body_ref = re.compile(rf"(?i)(closes|close|fixes|fix|resolves|resolve|refs|ref)\s+#?{issue_num}(?!\d)")
|
||||
branch_variants = {
|
||||
f"fix/{issue_num}",
|
||||
f"issue-{issue_num}",
|
||||
f"burn/{issue_num}",
|
||||
f"fix/issue-{issue_num}",
|
||||
}
|
||||
return bool(
|
||||
exact_ref.search(title)
|
||||
or exact_ref.search(body)
|
||||
or body_ref.search(body)
|
||||
or head in branch_variants
|
||||
)
|
||||
|
||||
|
||||
def fetch_open_prs(*, headers_options: list[dict[str, str]]) -> list[dict[str, Any]]:
|
||||
prs: list[dict[str, Any]] = []
|
||||
page = 1
|
||||
while True:
|
||||
batch = api_get(
|
||||
f"/repos/{REPO}/pulls?state=open&limit=100&page={page}",
|
||||
headers_options=headers_options,
|
||||
)
|
||||
if not batch:
|
||||
break
|
||||
prs.extend(batch)
|
||||
if len(batch) < 100:
|
||||
break
|
||||
page += 1
|
||||
return prs
|
||||
|
||||
|
||||
def fetch_live_snapshot(epic_issue_num: int = 949) -> dict[str, Any]:
|
||||
token = TOKEN_PATH.read_text(encoding="utf-8").strip()
|
||||
headers_options = _auth_headers(token)
|
||||
|
||||
epic = api_get(f"/repos/{REPO}/issues/{epic_issue_num}", headers_options=headers_options)
|
||||
comments = api_get(f"/repos/{REPO}/issues/{epic_issue_num}/comments", headers_options=headers_options)
|
||||
child_numbers = [n for n in extract_issue_numbers(epic.get("body") or "") if n != epic_issue_num]
|
||||
decomposition_numbers = [
|
||||
n
|
||||
for comment in comments
|
||||
for n in extract_issue_numbers(comment.get("body") or "")
|
||||
if n not in child_numbers and n != epic_issue_num
|
||||
]
|
||||
|
||||
open_prs = fetch_open_prs(headers_options=headers_options)
|
||||
|
||||
children = []
|
||||
for number in child_numbers:
|
||||
issue = api_get(f"/repos/{REPO}/issues/{number}", headers_options=headers_options)
|
||||
matching_prs = [
|
||||
{
|
||||
"number": pr["number"],
|
||||
"title": pr["title"],
|
||||
"head": pr.get("head", {}).get("ref", ""),
|
||||
"url": pr["html_url"],
|
||||
}
|
||||
for pr in open_prs
|
||||
if issue_pr_matches(pr, number)
|
||||
]
|
||||
children.append(
|
||||
{
|
||||
"number": issue["number"],
|
||||
"title": issue["title"],
|
||||
"state": issue["state"],
|
||||
"html_url": issue["html_url"],
|
||||
"open_prs": matching_prs,
|
||||
}
|
||||
)
|
||||
|
||||
decomposition_issues = []
|
||||
for number in decomposition_numbers:
|
||||
issue = api_get(f"/repos/{REPO}/issues/{number}", headers_options=headers_options)
|
||||
decomposition_issues.append(
|
||||
{
|
||||
"number": issue["number"],
|
||||
"title": issue["title"],
|
||||
"state": issue["state"],
|
||||
"html_url": issue["html_url"],
|
||||
}
|
||||
)
|
||||
|
||||
return {
|
||||
"generated_at": datetime.now(timezone.utc).isoformat(),
|
||||
"repo": REPO,
|
||||
"epic": {
|
||||
"number": epic["number"],
|
||||
"title": epic["title"],
|
||||
"state": epic["state"],
|
||||
"html_url": epic["html_url"],
|
||||
},
|
||||
"children": children,
|
||||
"decomposition_issues": decomposition_issues,
|
||||
}
|
||||
|
||||
|
||||
def summarize_snapshot(snapshot: dict[str, Any]) -> dict[str, int]:
|
||||
children = snapshot.get("children", [])
|
||||
open_children = [issue for issue in children if issue.get("state") == "open"]
|
||||
closed_children = [issue for issue in children if issue.get("state") == "closed"]
|
||||
open_with_pr = [issue for issue in open_children if issue.get("open_prs")]
|
||||
open_without_pr = [issue for issue in open_children if not issue.get("open_prs")]
|
||||
return {
|
||||
"total_children": len(children),
|
||||
"open_children": len(open_children),
|
||||
"closed_children": len(closed_children),
|
||||
"open_with_pr": len(open_with_pr),
|
||||
"open_without_pr": len(open_without_pr),
|
||||
}
|
||||
|
||||
|
||||
def render_markdown(snapshot: dict[str, Any]) -> str:
|
||||
epic = snapshot["epic"]
|
||||
children = snapshot.get("children", [])
|
||||
summary = summarize_snapshot(snapshot)
|
||||
open_with_pr = [issue for issue in children if issue.get("state") == "open" and issue.get("open_prs")]
|
||||
open_without_pr = [issue for issue in children if issue.get("state") == "open" and not issue.get("open_prs")]
|
||||
decomposition = snapshot.get("decomposition_issues", [])
|
||||
|
||||
lines = [
|
||||
f"# Morning Review Packet Status — #{epic['number']}",
|
||||
"",
|
||||
f"Generated: {snapshot.get('generated_at', '')}",
|
||||
f"Epic: [{epic['title']}]({epic.get('html_url', '')})",
|
||||
"",
|
||||
"## Summary",
|
||||
"",
|
||||
f"- Child QA issues tracked: {summary['total_children']}",
|
||||
f"- Open child issues: {summary['open_children']}",
|
||||
f"- Closed child issues: {summary['closed_children']}",
|
||||
f"- Open child issues already backed by PRs: {summary['open_with_pr']}",
|
||||
f"- Open child issues still unowned on forge: {summary['open_without_pr']}",
|
||||
"",
|
||||
"## Child QA Matrix",
|
||||
"",
|
||||
"| Issue | State | Open PRs | Title |",
|
||||
"|------:|-------|----------|-------|",
|
||||
]
|
||||
|
||||
for issue in children:
|
||||
rendered_prs = []
|
||||
for pr in issue.get("open_prs", []):
|
||||
pr_num = pr.get("number", "?")
|
||||
pr_url = pr.get("url") or pr.get("html_url") or ""
|
||||
rendered_prs.append(f"[#{pr_num}]({pr_url})" if pr_url else f"#{pr_num}")
|
||||
pr_text = ", ".join(rendered_prs) or "—"
|
||||
lines.append(
|
||||
f"| #{issue['number']} | {issue['state']} | {pr_text} | {issue['title']} |"
|
||||
)
|
||||
|
||||
lines.extend([
|
||||
"",
|
||||
"## Drift Signals",
|
||||
"",
|
||||
"forge/main is still catching up to the upstream packet.",
|
||||
])
|
||||
|
||||
if open_with_pr:
|
||||
lines.append("")
|
||||
lines.append("Active PR-backed child lanes:")
|
||||
for issue in open_with_pr:
|
||||
pr_numbers = ", ".join(f"#{pr['number']}" for pr in issue.get("open_prs", []))
|
||||
lines.append(f"- #{issue['number']} -> {pr_numbers} ({issue['title']})")
|
||||
|
||||
if open_without_pr:
|
||||
lines.extend([
|
||||
"",
|
||||
"## Unowned Open QA Issues",
|
||||
"",
|
||||
])
|
||||
for issue in open_without_pr:
|
||||
lines.append(f"- #{issue['number']} {issue['title']}")
|
||||
|
||||
if decomposition:
|
||||
lines.extend([
|
||||
"",
|
||||
"## Decomposition Follow-Ups",
|
||||
"",
|
||||
])
|
||||
for issue in decomposition:
|
||||
lines.append(f"- #{issue['number']} [{issue['state']}] {issue['title']}")
|
||||
|
||||
lines.extend([
|
||||
"",
|
||||
"## Conclusion",
|
||||
"",
|
||||
"Refs #949 only. This epic remains open until every child QA issue has a truthful PASS/FAIL outcome, attached evidence, and any upstream/main versus forge/main drift is resolved or explicitly documented.",
|
||||
"",
|
||||
"## Regeneration",
|
||||
"",
|
||||
"```bash",
|
||||
"python3 scripts/morning_review_packet_status.py --fetch-live --json-out docs/morning-review-packet-2026-04-21.snapshot.json --markdown-out docs/morning-review-packet-2026-04-21-status.md",
|
||||
"```",
|
||||
])
|
||||
|
||||
return "\n".join(lines) + "\n"
|
||||
|
||||
|
||||
def write_json(path: Path, data: dict[str, Any]) -> None:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
path.write_text(json.dumps(data, indent=2) + "\n", encoding="utf-8")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Generate grounded status docs for epic #949")
|
||||
parser.add_argument("--fetch-live", action="store_true", help="Fetch the current packet state from Forge")
|
||||
parser.add_argument("--snapshot", type=Path, help="Read a local JSON snapshot instead of hitting the API")
|
||||
parser.add_argument("--json-out", type=Path, default=DEFAULT_JSON_OUT, help="Path to write JSON snapshot")
|
||||
parser.add_argument("--markdown-out", type=Path, default=DEFAULT_MARKDOWN_OUT, help="Path to write markdown report")
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.fetch_live or not args.snapshot:
|
||||
snapshot = fetch_live_snapshot()
|
||||
else:
|
||||
snapshot = json.loads(args.snapshot.read_text(encoding="utf-8"))
|
||||
|
||||
write_json(args.json_out, snapshot)
|
||||
args.markdown_out.parent.mkdir(parents=True, exist_ok=True)
|
||||
args.markdown_out.write_text(render_markdown(snapshot), encoding="utf-8")
|
||||
print(args.markdown_out)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
318
scripts/tensorzero_eval_packet.py
Normal file
318
scripts/tensorzero_eval_packet.py
Normal file
@@ -0,0 +1,318 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Generate a grounded TensorZero evaluation packet for Hermes.
|
||||
|
||||
This script inventories the current Hermes routing/evaluation surfaces, then
|
||||
builds a markdown packet assessing how much of issue #860 can be satisfied by
|
||||
TensorZero and where the migration risk still lives.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
from dataclasses import asdict, dataclass
|
||||
from pathlib import Path
|
||||
from typing import Iterable
|
||||
|
||||
ISSUE_NUMBER = 860
|
||||
ISSUE_TITLE = "tensorzero LLMOps platform evaluation"
|
||||
ISSUE_URL = "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/860"
|
||||
DEFAULT_OUTPUT = Path("docs/evaluations/tensorzero-860-evaluation.md")
|
||||
DEFAULT_JSON_OUTPUT = Path("docs/evaluations/tensorzero-860-evaluation.json")
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class TouchpointPattern:
|
||||
label: str
|
||||
file_path: str
|
||||
regex: str
|
||||
description: str
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class Touchpoint:
|
||||
label: str
|
||||
file_path: str
|
||||
line_number: int
|
||||
matched_text: str
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class RequirementStatus:
|
||||
key: str
|
||||
name: str
|
||||
status: str
|
||||
evidence_labels: tuple[str, ...]
|
||||
summary: str
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class EvaluationReport:
|
||||
issue_number: int
|
||||
issue_title: str
|
||||
issue_url: str
|
||||
recommendation: str
|
||||
touchpoints: tuple[Touchpoint, ...]
|
||||
requirements: tuple[RequirementStatus, ...]
|
||||
|
||||
|
||||
PATTERNS: tuple[TouchpointPattern, ...] = (
|
||||
TouchpointPattern(
|
||||
label="fallback_chain",
|
||||
file_path="run_agent.py",
|
||||
regex=r"_fallback_chain|fallback_providers|fallback_model",
|
||||
description="Primary agent fallback-provider chain in the core conversation loop.",
|
||||
),
|
||||
TouchpointPattern(
|
||||
label="provider_routing_config",
|
||||
file_path="cli.py",
|
||||
regex=r"provider_routing|fallback_providers|smart_model_routing",
|
||||
description="CLI-owned provider routing and fallback configuration surfaces.",
|
||||
),
|
||||
TouchpointPattern(
|
||||
label="runtime_provider",
|
||||
file_path="hermes_cli/runtime_provider.py",
|
||||
regex=r"def resolve_runtime_provider|def resolve_requested_provider",
|
||||
description="Central runtime provider resolution for CLI, gateway, cron, and helpers.",
|
||||
),
|
||||
TouchpointPattern(
|
||||
label="smart_model_routing",
|
||||
file_path="agent/smart_model_routing.py",
|
||||
regex=r"def resolve_turn_route|def choose_cheap_model_route",
|
||||
description="Cheap-vs-strong turn routing that TensorZero would need to absorb or replace.",
|
||||
),
|
||||
TouchpointPattern(
|
||||
label="gateway_provider_routing",
|
||||
file_path="gateway/run.py",
|
||||
regex=r"def _load_provider_routing|def _load_fallback_model|def _load_smart_model_routing",
|
||||
description="Gateway-specific loading of routing, fallback, and smart-model policies.",
|
||||
),
|
||||
TouchpointPattern(
|
||||
label="cron_runtime_provider",
|
||||
file_path="cron/scheduler.py",
|
||||
regex=r"resolve_runtime_provider|resolve_turn_route|provider_routing|fallback_model",
|
||||
description="Cron execution path that re-resolves providers and routing on every run.",
|
||||
),
|
||||
TouchpointPattern(
|
||||
label="auxiliary_fallback_chain",
|
||||
file_path="agent/auxiliary_client.py",
|
||||
regex=r"fallback chain|_get_provider_chain|provider chain",
|
||||
description="Auxiliary task routing/fallback chain outside the main inference path.",
|
||||
),
|
||||
TouchpointPattern(
|
||||
label="delegate_runtime_provider",
|
||||
file_path="tools/delegate_tool.py",
|
||||
regex=r"runtime provider system|resolve the full credential bundle|resolve_runtime_provider",
|
||||
description="Subagent/delegation routing path that would also need TensorZero parity.",
|
||||
),
|
||||
TouchpointPattern(
|
||||
label="session_db",
|
||||
file_path="hermes_state.py",
|
||||
regex=r"class SessionDB",
|
||||
description="Session persistence surface that could feed TensorZero optimization/eval data.",
|
||||
),
|
||||
TouchpointPattern(
|
||||
label="trajectory_export",
|
||||
file_path="batch_runner.py",
|
||||
regex=r"trajectory_entry|save_trajectories|_convert_to_trajectory_format",
|
||||
description="Trajectory export surface for offline optimization and replay data.",
|
||||
),
|
||||
TouchpointPattern(
|
||||
label="benchmark_suite",
|
||||
file_path="benchmarks/tool_call_benchmark.py",
|
||||
regex=r"ToolCall\(|class ToolCall|benchmark",
|
||||
description="Existing benchmark/evaluation harness that could map to TensorZero experiments.",
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
def _iter_matches(pattern: TouchpointPattern, text: str) -> Iterable[Touchpoint]:
|
||||
regex = re.compile(pattern.regex, re.IGNORECASE)
|
||||
for line_number, line in enumerate(text.splitlines(), start=1):
|
||||
if regex.search(line):
|
||||
yield Touchpoint(
|
||||
label=pattern.label,
|
||||
file_path=pattern.file_path,
|
||||
line_number=line_number,
|
||||
matched_text=line.strip(),
|
||||
)
|
||||
|
||||
|
||||
def scan_touchpoints(repo_root: Path) -> list[Touchpoint]:
|
||||
touchpoints: list[Touchpoint] = []
|
||||
for pattern in PATTERNS:
|
||||
path = repo_root / pattern.file_path
|
||||
if not path.exists():
|
||||
continue
|
||||
text = path.read_text(encoding="utf-8")
|
||||
touchpoints.extend(_iter_matches(pattern, text))
|
||||
return touchpoints
|
||||
|
||||
|
||||
def build_requirement_matrix(touchpoints: list[Touchpoint]) -> list[RequirementStatus]:
|
||||
labels = {tp.label for tp in touchpoints}
|
||||
|
||||
matrix: list[RequirementStatus] = []
|
||||
gateway_labels = (
|
||||
"fallback_chain",
|
||||
"runtime_provider",
|
||||
"gateway_provider_routing",
|
||||
"cron_runtime_provider",
|
||||
"auxiliary_fallback_chain",
|
||||
"delegate_runtime_provider",
|
||||
)
|
||||
gateway_hits = tuple(label for label in gateway_labels if label in labels)
|
||||
gateway_status = "partial" if len(gateway_hits) >= 4 else "gap"
|
||||
gateway_summary = (
|
||||
"Hermes already spreads provider routing across core agent, runtime provider, gateway, cron, auxiliary, and delegation seams; "
|
||||
"TensorZero would need parity across all of them before it can replace the gateway layer."
|
||||
if gateway_hits else
|
||||
"No grounded routing surfaces were found for a gateway replacement assessment."
|
||||
)
|
||||
matrix.append(RequirementStatus("gateway_replacement", "Gateway replacement scope", gateway_status, gateway_hits, gateway_summary))
|
||||
|
||||
config_labels = (
|
||||
"provider_routing_config",
|
||||
"runtime_provider",
|
||||
"smart_model_routing",
|
||||
"fallback_chain",
|
||||
)
|
||||
config_hits = tuple(label for label in config_labels if label in labels)
|
||||
config_status = "partial" if len(config_hits) >= 3 else "gap"
|
||||
config_summary = (
|
||||
"Hermes has multiple config concepts to migrate (`provider_routing`, `fallback_providers`, `smart_model_routing`, runtime provider resolution), "
|
||||
"so TensorZero is not a drop-in config swap."
|
||||
if config_hits else
|
||||
"No current config migration surface was found."
|
||||
)
|
||||
matrix.append(RequirementStatus("config_migration", "Config migration", config_status, config_hits, config_summary))
|
||||
|
||||
canary_hits: tuple[str, ...] = tuple()
|
||||
canary_summary = (
|
||||
"The repo shows semantic routing and fallback, but no grounded 10% traffic-split canary mechanism. "
|
||||
"A TensorZero cutover would need new percentage-based rollout controls and observability hooks."
|
||||
)
|
||||
matrix.append(RequirementStatus("canary_rollout", "10% traffic canary", "gap", canary_hits, canary_summary))
|
||||
|
||||
session_labels = ("session_db", "trajectory_export")
|
||||
session_hits = tuple(label for label in session_labels if label in labels)
|
||||
session_status = "partial" if len(session_hits) == len(session_labels) else "gap"
|
||||
session_summary = (
|
||||
"Hermes already has SessionDB and trajectory export surfaces that can feed offline optimization data, "
|
||||
"but not a TensorZero-native ingestion path yet."
|
||||
if session_hits else
|
||||
"No session-data surface was found for prompt optimization."
|
||||
)
|
||||
matrix.append(RequirementStatus("session_feedback", "Session data for prompt optimization", session_status, session_hits, session_summary))
|
||||
|
||||
eval_labels = ("benchmark_suite", "trajectory_export")
|
||||
eval_hits = tuple(label for label in eval_labels if label in labels)
|
||||
eval_status = "partial" if "benchmark_suite" in eval_hits else "gap"
|
||||
eval_summary = (
|
||||
"Hermes already has benchmark/trajectory machinery that can seed TensorZero A/B evaluation, "
|
||||
"but no integrated TensorZero experiment runner or live evaluation gateway."
|
||||
if eval_hits else
|
||||
"No evaluation harness was found to support TensorZero A/B testing."
|
||||
)
|
||||
matrix.append(RequirementStatus("evaluation_suite", "Evaluation suite / A/B testing", eval_status, eval_hits, eval_summary))
|
||||
|
||||
return matrix
|
||||
|
||||
|
||||
def build_report(touchpoints: list[Touchpoint], requirement_matrix: list[RequirementStatus]) -> EvaluationReport:
|
||||
recommendation = (
|
||||
"Not ready for direct replacement. Recommend a shadow-evaluation phase first: keep Hermes routing live, "
|
||||
"inventory the migration seams, export SessionDB/trajectory data into an offline TensorZero experiment loop, "
|
||||
"and only design a canary gateway once percentage-based rollout controls exist."
|
||||
)
|
||||
return EvaluationReport(
|
||||
issue_number=ISSUE_NUMBER,
|
||||
issue_title=ISSUE_TITLE,
|
||||
issue_url=ISSUE_URL,
|
||||
recommendation=recommendation,
|
||||
touchpoints=tuple(touchpoints),
|
||||
requirements=tuple(requirement_matrix),
|
||||
)
|
||||
|
||||
|
||||
def build_markdown(report: EvaluationReport) -> str:
|
||||
lines: list[str] = []
|
||||
lines.append("# TensorZero Evaluation Packet")
|
||||
lines.append("")
|
||||
lines.append(f"Issue #{report.issue_number}: [{report.issue_title}]({report.issue_url})")
|
||||
lines.append("")
|
||||
lines.append("## Scope")
|
||||
lines.append("")
|
||||
lines.append("This packet evaluates TensorZero as a possible replacement for Hermes' custom provider-routing stack.")
|
||||
lines.append("It is intentionally grounded in the current repo state rather than a speculative cutover plan.")
|
||||
lines.append("")
|
||||
lines.append("## Issue requirements being evaluated")
|
||||
lines.append("")
|
||||
lines.append("- Deploy tensorzero gateway (Rust binary)")
|
||||
lines.append("- Migrate provider routing config")
|
||||
lines.append("- Test with canary (10% traffic) before full cutover")
|
||||
lines.append("- Feed session data for prompt optimization")
|
||||
lines.append("- Evaluation suite for A/B testing models")
|
||||
lines.append("")
|
||||
lines.append("## Recommendation")
|
||||
lines.append("")
|
||||
lines.append(report.recommendation)
|
||||
lines.append("")
|
||||
lines.append("## Requirement matrix")
|
||||
lines.append("")
|
||||
lines.append("| Requirement | Status | Evidence labels | Summary |")
|
||||
lines.append("| --- | --- | --- | --- |")
|
||||
for row in report.requirements:
|
||||
evidence = ", ".join(row.evidence_labels) if row.evidence_labels else "—"
|
||||
lines.append(f"| {row.name} | {row.status} | {evidence} | {row.summary} |")
|
||||
lines.append("")
|
||||
lines.append("## Grounded Hermes touchpoints")
|
||||
lines.append("")
|
||||
if report.touchpoints:
|
||||
for tp in report.touchpoints:
|
||||
lines.append(f"- `{tp.file_path}:{tp.line_number}` — [{tp.label}] {tp.matched_text}")
|
||||
else:
|
||||
lines.append("- No routing/evaluation touchpoints were found.")
|
||||
lines.append("")
|
||||
lines.append("## Suggested next slice")
|
||||
lines.append("")
|
||||
lines.append("1. Build an exporter that emits SessionDB + trajectory data into a TensorZero-friendly offline dataset.")
|
||||
lines.append("2. Define percentage-based canary controls before attempting any gateway replacement.")
|
||||
lines.append("3. Keep Hermes routing authoritative until TensorZero proves parity across CLI, gateway, cron, auxiliary, and delegation surfaces.")
|
||||
lines.append("")
|
||||
return "\n".join(lines).rstrip() + "\n"
|
||||
|
||||
|
||||
def write_outputs(report: EvaluationReport, markdown_path: Path, json_path: Path | None = None) -> None:
|
||||
markdown_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
markdown_path.write_text(build_markdown(report), encoding="utf-8")
|
||||
if json_path is not None:
|
||||
json_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
json_path.write_text(json.dumps(asdict(report), indent=2), encoding="utf-8")
|
||||
|
||||
|
||||
def parse_args() -> argparse.Namespace:
|
||||
parser = argparse.ArgumentParser(description="Generate a grounded TensorZero evaluation packet for Hermes")
|
||||
parser.add_argument("--repo-root", default=".", help="Hermes repo root to scan")
|
||||
parser.add_argument("--output", default=str(DEFAULT_OUTPUT), help="Markdown output path")
|
||||
parser.add_argument("--json-output", default=str(DEFAULT_JSON_OUTPUT), help="Optional JSON output path")
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def main() -> int:
|
||||
args = parse_args()
|
||||
repo_root = Path(args.repo_root).resolve()
|
||||
touchpoints = scan_touchpoints(repo_root)
|
||||
matrix = build_requirement_matrix(touchpoints)
|
||||
report = build_report(touchpoints, matrix)
|
||||
json_output = Path(args.json_output) if args.json_output else None
|
||||
write_outputs(report, Path(args.output), json_output)
|
||||
print(f"Wrote {args.output}")
|
||||
if json_output is not None:
|
||||
print(f"Wrote {json_output}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -1,94 +0,0 @@
|
||||
"""Tests for the morning review packet status report generator."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import importlib.util
|
||||
from pathlib import Path
|
||||
|
||||
SCRIPT_PATH = Path(__file__).resolve().parents[1] / "scripts" / "morning_review_packet_status.py"
|
||||
DOC_PATH = Path(__file__).resolve().parents[1] / "docs" / "morning-review-packet-2026-04-21-status.md"
|
||||
|
||||
|
||||
def load_module():
|
||||
assert SCRIPT_PATH.exists(), f"missing status script: {SCRIPT_PATH}"
|
||||
spec = importlib.util.spec_from_file_location("morning_review_packet_status_test", SCRIPT_PATH)
|
||||
module = importlib.util.module_from_spec(spec)
|
||||
assert spec.loader is not None
|
||||
spec.loader.exec_module(module)
|
||||
return module
|
||||
|
||||
|
||||
def sample_snapshot():
|
||||
return {
|
||||
"epic": {"number": 949, "title": "Morning review packet", "state": "open"},
|
||||
"children": [
|
||||
{
|
||||
"number": 950,
|
||||
"title": "Verify AI Gateway provider UX + attribution headers",
|
||||
"state": "open",
|
||||
"open_prs": [],
|
||||
},
|
||||
{
|
||||
"number": 954,
|
||||
"title": "Verify maps skill guest_house / camp_site / bakery expansion",
|
||||
"state": "open",
|
||||
"open_prs": [
|
||||
{"number": 1021, "head": "fix/954", "title": "feat: sync maps skill and verify guest_house/camp_site/bakery (#954)"}
|
||||
],
|
||||
},
|
||||
{
|
||||
"number": 961,
|
||||
"title": "Verify web dashboard update/restart action buttons",
|
||||
"state": "closed",
|
||||
"open_prs": [],
|
||||
},
|
||||
],
|
||||
"decomposition_issues": [
|
||||
{"number": 965, "title": "Phase 1: Landscape Analysis & Scaffolding", "state": "open"},
|
||||
{"number": 967, "title": "Phase 3: Poka-yoke Integration & Fleet Verification", "state": "closed"},
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
def test_extract_child_issue_numbers_from_epic_body():
|
||||
module = load_module()
|
||||
body = """
|
||||
- [ ] #950 one
|
||||
- [ ] #951 two
|
||||
- [ ] #962 three
|
||||
"""
|
||||
assert module.extract_issue_numbers(body) == [950, 951, 962]
|
||||
|
||||
|
||||
def test_summarize_snapshot_counts_open_closed_and_pr_backing():
|
||||
module = load_module()
|
||||
summary = module.summarize_snapshot(sample_snapshot())
|
||||
|
||||
assert summary["total_children"] == 3
|
||||
assert summary["open_children"] == 2
|
||||
assert summary["closed_children"] == 1
|
||||
assert summary["open_with_pr"] == 1
|
||||
assert summary["open_without_pr"] == 1
|
||||
|
||||
|
||||
def test_render_markdown_includes_issue_matrix_and_drift_sections():
|
||||
module = load_module()
|
||||
md = module.render_markdown(sample_snapshot())
|
||||
|
||||
assert "# Morning Review Packet Status — #949" in md
|
||||
assert "## Child QA Matrix" in md
|
||||
assert "#950" in md
|
||||
assert "#954" in md
|
||||
assert "#1021" in md
|
||||
assert "## Unowned Open QA Issues" in md
|
||||
assert "## Drift Signals" in md
|
||||
assert "forge/main is still catching up to the upstream packet" in md
|
||||
|
||||
|
||||
def test_committed_status_doc_exists_and_mentions_live_examples():
|
||||
assert DOC_PATH.exists(), f"missing generated status doc: {DOC_PATH}"
|
||||
text = DOC_PATH.read_text(encoding="utf-8")
|
||||
assert "# Morning Review Packet Status — #949" in text
|
||||
assert "#954" in text
|
||||
assert "#1021" in text
|
||||
assert "#950" in text
|
||||
149
tests/test_tensorzero_eval_packet.py
Normal file
149
tests/test_tensorzero_eval_packet.py
Normal file
@@ -0,0 +1,149 @@
|
||||
from pathlib import Path
|
||||
import sys
|
||||
|
||||
SCRIPT_DIR = Path(__file__).resolve().parents[1] / "scripts"
|
||||
sys.path.insert(0, str(SCRIPT_DIR))
|
||||
|
||||
import tensorzero_eval_packet as tz
|
||||
|
||||
|
||||
def test_scan_touchpoints_finds_expected_matches(tmp_path):
|
||||
(tmp_path / "run_agent.py").write_text(
|
||||
"self._fallback_chain = []\n# Provider fallback chain\n"
|
||||
)
|
||||
(tmp_path / "hermes_cli").mkdir()
|
||||
(tmp_path / "hermes_cli" / "runtime_provider.py").write_text(
|
||||
"def resolve_runtime_provider():\n return {}\n"
|
||||
)
|
||||
(tmp_path / "agent").mkdir()
|
||||
(tmp_path / "agent" / "smart_model_routing.py").write_text(
|
||||
"def resolve_turn_route(user_message, routing_config, primary):\n return primary\n"
|
||||
)
|
||||
(tmp_path / "gateway").mkdir()
|
||||
(tmp_path / "gateway" / "run.py").write_text(
|
||||
"def _load_provider_routing():\n return {}\n"
|
||||
)
|
||||
(tmp_path / "cron").mkdir()
|
||||
(tmp_path / "cron" / "scheduler.py").write_text(
|
||||
"runtime = resolve_runtime_provider()\nturn_route = resolve_turn_route('x', {}, {})\n"
|
||||
)
|
||||
(tmp_path / "hermes_state.py").write_text("class SessionDB:\n pass\n")
|
||||
(tmp_path / "benchmarks").mkdir()
|
||||
(tmp_path / "benchmarks" / "tool_call_benchmark.py").write_text(
|
||||
"class ToolCall: ...\n"
|
||||
)
|
||||
|
||||
touchpoints = tz.scan_touchpoints(tmp_path)
|
||||
|
||||
labels = {tp.label for tp in touchpoints}
|
||||
assert "fallback_chain" in labels
|
||||
assert "runtime_provider" in labels
|
||||
assert "smart_model_routing" in labels
|
||||
assert "gateway_provider_routing" in labels
|
||||
assert "cron_runtime_provider" in labels
|
||||
assert "session_db" in labels
|
||||
assert "benchmark_suite" in labels
|
||||
|
||||
|
||||
def test_build_requirement_matrix_marks_canary_as_gap_without_split_support():
|
||||
touchpoints = [
|
||||
tz.Touchpoint(
|
||||
label="runtime_provider",
|
||||
file_path="hermes_cli/runtime_provider.py",
|
||||
line_number=10,
|
||||
matched_text="def resolve_runtime_provider",
|
||||
),
|
||||
tz.Touchpoint(
|
||||
label="provider_routing_config",
|
||||
file_path="cli.py",
|
||||
line_number=20,
|
||||
matched_text='provider_routing',
|
||||
),
|
||||
tz.Touchpoint(
|
||||
label="fallback_chain",
|
||||
file_path="run_agent.py",
|
||||
line_number=21,
|
||||
matched_text='_fallback_chain = []',
|
||||
),
|
||||
tz.Touchpoint(
|
||||
label="smart_model_routing",
|
||||
file_path="agent/smart_model_routing.py",
|
||||
line_number=30,
|
||||
matched_text='resolve_turn_route',
|
||||
),
|
||||
tz.Touchpoint(
|
||||
label="gateway_provider_routing",
|
||||
file_path="gateway/run.py",
|
||||
line_number=35,
|
||||
matched_text='def _load_provider_routing',
|
||||
),
|
||||
tz.Touchpoint(
|
||||
label="cron_runtime_provider",
|
||||
file_path="cron/scheduler.py",
|
||||
line_number=36,
|
||||
matched_text='runtime = resolve_runtime_provider()',
|
||||
),
|
||||
tz.Touchpoint(
|
||||
label="session_db",
|
||||
file_path="hermes_state.py",
|
||||
line_number=40,
|
||||
matched_text='class SessionDB',
|
||||
),
|
||||
tz.Touchpoint(
|
||||
label="trajectory_export",
|
||||
file_path="batch_runner.py",
|
||||
line_number=50,
|
||||
matched_text='trajectory_entry',
|
||||
),
|
||||
tz.Touchpoint(
|
||||
label="benchmark_suite",
|
||||
file_path="benchmarks/tool_call_benchmark.py",
|
||||
line_number=60,
|
||||
matched_text='ToolCall',
|
||||
),
|
||||
]
|
||||
|
||||
matrix = tz.build_requirement_matrix(touchpoints)
|
||||
by_key = {row.key: row for row in matrix}
|
||||
|
||||
assert by_key["gateway_replacement"].status == "partial"
|
||||
assert by_key["config_migration"].status == "partial"
|
||||
assert by_key["canary_rollout"].status == "gap"
|
||||
assert by_key["session_feedback"].status == "partial"
|
||||
assert by_key["evaluation_suite"].status == "partial"
|
||||
|
||||
|
||||
def test_build_markdown_renders_recommendation_and_touchpoints():
|
||||
touchpoints = [
|
||||
tz.Touchpoint(
|
||||
label="runtime_provider",
|
||||
file_path="hermes_cli/runtime_provider.py",
|
||||
line_number=10,
|
||||
matched_text="def resolve_runtime_provider",
|
||||
),
|
||||
tz.Touchpoint(
|
||||
label="session_db",
|
||||
file_path="hermes_state.py",
|
||||
line_number=40,
|
||||
matched_text='class SessionDB',
|
||||
),
|
||||
]
|
||||
matrix = tz.build_requirement_matrix(touchpoints)
|
||||
report = tz.build_report(touchpoints, matrix)
|
||||
markdown = tz.build_markdown(report)
|
||||
|
||||
assert "# TensorZero Evaluation Packet" in markdown
|
||||
assert "gateway_replacement" not in markdown # human labels, not raw keys
|
||||
assert "Gateway replacement scope" in markdown
|
||||
assert "Not ready for direct replacement" in markdown
|
||||
assert "hermes_cli/runtime_provider.py:10" in markdown
|
||||
assert "hermes_state.py:40" in markdown
|
||||
|
||||
|
||||
def test_issue_context_is_embedded_in_report():
|
||||
report = tz.build_report([], [])
|
||||
markdown = tz.build_markdown(report)
|
||||
|
||||
assert "Issue #860" in markdown
|
||||
assert "tensorzero" in markdown.lower()
|
||||
assert "10% traffic" in markdown
|
||||
Reference in New Issue
Block a user