docs(research): update crisis model quality report (#877 )

2026-04-22 11:31:39 -04:00
6 changed files with 138 additions and 2067 deletions
--- a/docs/evaluations/tensorzero-860-evaluation.json
+++ b/docs/evaluations/tensorzero-860-evaluation.json
--- a/docs/evaluations/tensorzero-860-evaluation.md
+++ b/docs/evaluations/tensorzero-860-evaluation.md
@@ -1,217 +0,0 @@
-# TensorZero Evaluation Packet
-
-Issue #860: [tensorzero LLMOps platform evaluation](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/860)
-
-## Scope
-
-This packet evaluates TensorZero as a possible replacement for Hermes' custom provider-routing stack.
-It is intentionally grounded in the current repo state rather than a speculative cutover plan.
-
-## Issue requirements being evaluated
-
- Deploy tensorzero gateway (Rust binary)
- Migrate provider routing config
- Test with canary (10% traffic) before full cutover
- Feed session data for prompt optimization
- Evaluation suite for A/B testing models
-
-## Recommendation
-
-Not ready for direct replacement. Recommend a shadow-evaluation phase first: keep Hermes routing live, inventory the migration seams, export SessionDB/trajectory data into an offline TensorZero experiment loop, and only design a canary gateway once percentage-based rollout controls exist.
-
-## Requirement matrix
-
-| Requirement | Status | Evidence labels | Summary |
-| --- | --- | --- | --- |
-| Gateway replacement scope | partial | fallback_chain, runtime_provider, gateway_provider_routing, cron_runtime_provider, auxiliary_fallback_chain, delegate_runtime_provider | Hermes already spreads provider routing across core agent, runtime provider, gateway, cron, auxiliary, and delegation seams; TensorZero would need parity across all of them before it can replace the gateway layer. |
-| Config migration | partial | provider_routing_config, runtime_provider, smart_model_routing, fallback_chain | Hermes has multiple config concepts to migrate (`provider_routing`, `fallback_providers`, `smart_model_routing`, runtime provider resolution), so TensorZero is not a drop-in config swap. |
-| 10% traffic canary | gap | — | The repo shows semantic routing and fallback, but no grounded 10% traffic-split canary mechanism. A TensorZero cutover would need new percentage-based rollout controls and observability hooks. |
-| Session data for prompt optimization | partial | session_db, trajectory_export | Hermes already has SessionDB and trajectory export surfaces that can feed offline optimization data, but not a TensorZero-native ingestion path yet. |
-| Evaluation suite / A/B testing | partial | benchmark_suite, trajectory_export | Hermes already has benchmark/trajectory machinery that can seed TensorZero A/B evaluation, but no integrated TensorZero experiment runner or live evaluation gateway. |
-
-## Grounded Hermes touchpoints
-
- `run_agent.py:601` — [fallback_chain] fallback_model: Dict[str, Any] = None,
- `run_agent.py:995` — [fallback_chain] # failure).  Supports both legacy single-dict ``fallback_model`` and
- `run_agent.py:996` — [fallback_chain] # new list ``fallback_providers`` format.
- `run_agent.py:997` — [fallback_chain] if isinstance(fallback_model, list):
- `run_agent.py:998` — [fallback_chain] self._fallback_chain = [
- `run_agent.py:999` — [fallback_chain] f for f in fallback_model
- `run_agent.py:1002` — [fallback_chain] elif isinstance(fallback_model, dict) and fallback_model.get("provider") and fallback_model.get("model"):
- `run_agent.py:1003` — [fallback_chain] self._fallback_chain = [fallback_model]
- `run_agent.py:1005` — [fallback_chain] self._fallback_chain = []
- `run_agent.py:1009` — [fallback_chain] self._fallback_model = self._fallback_chain[0] if self._fallback_chain else None
- `run_agent.py:1010` — [fallback_chain] if self._fallback_chain and not self.quiet_mode:
- `run_agent.py:1011` — [fallback_chain] if len(self._fallback_chain) == 1:
- `run_agent.py:1012` — [fallback_chain] fb = self._fallback_chain[0]
- `run_agent.py:1015` — [fallback_chain] print(f"🔄 Fallback chain ({len(self._fallback_chain)} providers): " +
- `run_agent.py:1016` — [fallback_chain] " → ".join(f"{f['model']} ({f['provider']})" for f in self._fallback_chain))
- `run_agent.py:5624` — [fallback_chain] if self._fallback_index >= len(self._fallback_chain):
- `run_agent.py:5627` — [fallback_chain] fb = self._fallback_chain[self._fallback_index]
- `run_agent.py:8559` — [fallback_chain] if self._fallback_index < len(self._fallback_chain):
- `run_agent.py:9355` — [fallback_chain] if is_rate_limited and self._fallback_index < len(self._fallback_chain):
- `run_agent.py:10460` — [fallback_chain] if _truly_empty and self._fallback_chain:
- `run_agent.py:10514` — [fallback_chain] + (" and fallback attempts." if self._fallback_chain else
- `cli.py:241` — [provider_routing_config] "smart_model_routing": {
- `cli.py:370` — [provider_routing_config] # (e.g. platform_toolsets, provider_routing, memory, honcho, etc.)
- `cli.py:1753` — [provider_routing_config] pr = CLI_CONFIG.get("provider_routing", {}) or {}
- `cli.py:1762` — [provider_routing_config] # Supports new list format (fallback_providers) and legacy single-dict (fallback_model).
- `cli.py:1763` — [provider_routing_config] fb = CLI_CONFIG.get("fallback_providers") or CLI_CONFIG.get("fallback_model") or []
- `cli.py:1770` — [provider_routing_config] self._smart_model_routing = CLI_CONFIG.get("smart_model_routing", {}) or {}
- `cli.py:2771` — [provider_routing_config] from agent.smart_model_routing import resolve_turn_route
- `cli.py:2776` — [provider_routing_config] self._smart_model_routing,
- `hermes_cli/runtime_provider.py:209` — [runtime_provider] def resolve_requested_provider(requested: Optional[str] = None) -> str:
- `hermes_cli/runtime_provider.py:649` — [runtime_provider] def resolve_runtime_provider(
- `agent/smart_model_routing.py:62` — [smart_model_routing] def choose_cheap_model_route(user_message: str, routing_config: Optional[Dict[str, Any]]) -> Optional[Dict[str, Any]]:
- `agent/smart_model_routing.py:110` — [smart_model_routing] def resolve_turn_route(user_message: str, routing_config: Optional[Dict[str, Any]], primary: Dict[str, Any]) -> Dict[str, Any]:
- `gateway/run.py:1271` — [gateway_provider_routing] def _load_provider_routing() -> dict:
- `gateway/run.py:1285` — [gateway_provider_routing] def _load_fallback_model() -> list | dict | None:
- `gateway/run.py:1306` — [gateway_provider_routing] def _load_smart_model_routing() -> dict:
- `cron/scheduler.py:684` — [cron_runtime_provider] pr = _cfg.get("provider_routing", {})
- `cron/scheduler.py:688` — [cron_runtime_provider] resolve_runtime_provider,
- `cron/scheduler.py:697` — [cron_runtime_provider] runtime = resolve_runtime_provider(**runtime_kwargs)
- `cron/scheduler.py:702` — [cron_runtime_provider] from agent.smart_model_routing import resolve_turn_route
- `cron/scheduler.py:703` — [cron_runtime_provider] turn_route = resolve_turn_route(
- `cron/scheduler.py:717` — [cron_runtime_provider] fallback_model = _cfg.get("fallback_providers") or _cfg.get("fallback_model") or None
- `cron/scheduler.py:746` — [cron_runtime_provider] fallback_model=fallback_model,
- `agent/auxiliary_client.py:1018` — [auxiliary_fallback_chain] def _get_provider_chain() -> List[tuple]:
- `agent/auxiliary_client.py:1107` — [auxiliary_fallback_chain] for label, try_fn in _get_provider_chain():
- `agent/auxiliary_client.py:1189` — [auxiliary_fallback_chain] # ── Step 2: aggregator / fallback chain ──────────────────────────────
- `agent/auxiliary_client.py:1191` — [auxiliary_fallback_chain] for label, try_fn in _get_provider_chain():
- `agent/auxiliary_client.py:2397` — [auxiliary_fallback_chain] # error, fall through to the fallback chain below.
- `agent/auxiliary_client.py:2417` — [auxiliary_fallback_chain] # auto (the default) = best-effort fallback chain.  (#7559)
- `agent/auxiliary_client.py:2589` — [auxiliary_fallback_chain] # error, fall through to the fallback chain below.
- `tools/delegate_tool.py:662` — [delegate_runtime_provider] # bundle (base_url, api_key, api_mode) via the same runtime provider system
- `tools/delegate_tool.py:854` — [delegate_runtime_provider] provider) is resolved via the runtime provider system — the same path used
- `tools/delegate_tool.py:909` — [delegate_runtime_provider] from hermes_cli.runtime_provider import resolve_runtime_provider
- `tools/delegate_tool.py:910` — [delegate_runtime_provider] runtime = resolve_runtime_provider(requested=configured_provider)
- `hermes_state.py:115` — [session_db] class SessionDB:
- `batch_runner.py:320` — [trajectory_export] save_trajectories=False,  # We handle saving ourselves
- `batch_runner.py:346` — [trajectory_export] trajectory = agent._convert_to_trajectory_format(
- `batch_runner.py:460` — [trajectory_export] trajectory_entry = {
- `batch_runner.py:474` — [trajectory_export] f.write(json.dumps(trajectory_entry, ensure_ascii=False) + "\n")
- `benchmarks/tool_call_benchmark.py:3` — [benchmark_suite] Tool-Calling Benchmark — Gemma 4 vs mimo-v2-pro regression test.
- `benchmarks/tool_call_benchmark.py:9` — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py                  # full 100-call suite
- `benchmarks/tool_call_benchmark.py:10` — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --limit 10       # quick smoke test
- `benchmarks/tool_call_benchmark.py:11` — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --models nous     # single model
- `benchmarks/tool_call_benchmark.py:12` — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --category file   # single category
- `benchmarks/tool_call_benchmark.py:37` — [benchmark_suite] class ToolCall:
- `benchmarks/tool_call_benchmark.py:51` — [benchmark_suite] ToolCall("file-01", "file", "Read the file /tmp/test_bench.txt and show me its contents.",
- `benchmarks/tool_call_benchmark.py:53` — [benchmark_suite] ToolCall("file-02", "file", "Write 'hello benchmark' to /tmp/test_bench_out.txt",
- `benchmarks/tool_call_benchmark.py:55` — [benchmark_suite] ToolCall("file-03", "file", "Search for the word 'import' in all Python files in the current directory.",
- `benchmarks/tool_call_benchmark.py:57` — [benchmark_suite] ToolCall("file-04", "file", "Read lines 1-20 of /etc/hosts",
- `benchmarks/tool_call_benchmark.py:59` — [benchmark_suite] ToolCall("file-05", "file", "Patch /tmp/test_bench_out.txt: replace 'hello' with 'goodbye'",
- `benchmarks/tool_call_benchmark.py:61` — [benchmark_suite] ToolCall("file-06", "file", "Search for files matching *.py in the current directory.",
- `benchmarks/tool_call_benchmark.py:63` — [benchmark_suite] ToolCall("file-07", "file", "Read the first 10 lines of /etc/passwd",
- `benchmarks/tool_call_benchmark.py:65` — [benchmark_suite] ToolCall("file-08", "file", "Write a JSON config to /tmp/bench_config.json with key 'debug': true",
- `benchmarks/tool_call_benchmark.py:67` — [benchmark_suite] ToolCall("file-09", "file", "Search for 'def test_' in Python test files.",
- `benchmarks/tool_call_benchmark.py:69` — [benchmark_suite] ToolCall("file-10", "file", "Read /tmp/bench_config.json and tell me what's in it.",
- `benchmarks/tool_call_benchmark.py:71` — [benchmark_suite] ToolCall("file-11", "file", "Create a file /tmp/bench_readme.md with one line: '# Benchmark'",
- `benchmarks/tool_call_benchmark.py:73` — [benchmark_suite] ToolCall("file-12", "file", "Search for 'TODO' comments in all .py files.",
- `benchmarks/tool_call_benchmark.py:75` — [benchmark_suite] ToolCall("file-13", "file", "Read /tmp/bench_readme.md",
- `benchmarks/tool_call_benchmark.py:77` — [benchmark_suite] ToolCall("file-14", "file", "Patch /tmp/bench_readme.md: replace '# Benchmark' with '# Tool Benchmark'",
- `benchmarks/tool_call_benchmark.py:78` — [benchmark_suite] "patch", "Tool Benchmark"),
- `benchmarks/tool_call_benchmark.py:79` — [benchmark_suite] ToolCall("file-15", "file", "Write a Python one-liner to /tmp/bench_hello.py that prints hello.",
- `benchmarks/tool_call_benchmark.py:81` — [benchmark_suite] ToolCall("file-16", "file", "Search for all .json files in /tmp/.",
- `benchmarks/tool_call_benchmark.py:83` — [benchmark_suite] ToolCall("file-17", "file", "Read /tmp/bench_hello.py and verify it has print('hello').",
- `benchmarks/tool_call_benchmark.py:85` — [benchmark_suite] ToolCall("file-18", "file", "Patch /tmp/bench_hello.py to print 'hello world' instead of 'hello'.",
- `benchmarks/tool_call_benchmark.py:87` — [benchmark_suite] ToolCall("file-19", "file", "List files matching 'bench*' in /tmp/.",
- `benchmarks/tool_call_benchmark.py:89` — [benchmark_suite] ToolCall("file-20", "file", "Read /tmp/test_bench.txt again and summarize its contents.",
- `benchmarks/tool_call_benchmark.py:93` — [benchmark_suite] ToolCall("term-01", "terminal", "Run `echo hello world` in the terminal.",
- `benchmarks/tool_call_benchmark.py:95` — [benchmark_suite] ToolCall("term-02", "terminal", "Run `date` to get the current date and time.",
- `benchmarks/tool_call_benchmark.py:97` — [benchmark_suite] ToolCall("term-03", "terminal", "Run `uname -a` to get system information.",
- `benchmarks/tool_call_benchmark.py:99` — [benchmark_suite] ToolCall("term-04", "terminal", "Run `pwd` to show the current directory.",
- `benchmarks/tool_call_benchmark.py:101` — [benchmark_suite] ToolCall("term-05", "terminal", "Run `ls -la /tmp/ | head -20` to list temp files.",
- `benchmarks/tool_call_benchmark.py:103` — [benchmark_suite] ToolCall("term-06", "terminal", "Run `whoami` to show the current user.",
- `benchmarks/tool_call_benchmark.py:105` — [benchmark_suite] ToolCall("term-07", "terminal", "Run `df -h` to show disk usage.",
- `benchmarks/tool_call_benchmark.py:107` — [benchmark_suite] ToolCall("term-08", "terminal", "Run `python3 --version` to check Python version.",
- `benchmarks/tool_call_benchmark.py:109` — [benchmark_suite] ToolCall("term-09", "terminal", "Run `cat /etc/hostname` to get the hostname.",
- `benchmarks/tool_call_benchmark.py:111` — [benchmark_suite] ToolCall("term-10", "terminal", "Run `uptime` to see system uptime.",
- `benchmarks/tool_call_benchmark.py:113` — [benchmark_suite] ToolCall("term-11", "terminal", "Run `env | grep PATH` to show the PATH variable.",
- `benchmarks/tool_call_benchmark.py:115` — [benchmark_suite] ToolCall("term-12", "terminal", "Run `wc -l /etc/passwd` to count lines.",
- `benchmarks/tool_call_benchmark.py:117` — [benchmark_suite] ToolCall("term-13", "terminal", "Run `echo $SHELL` to show the current shell.",
- `benchmarks/tool_call_benchmark.py:119` — [benchmark_suite] ToolCall("term-14", "terminal", "Run `free -h || vm_stat` to check memory usage.",
- `benchmarks/tool_call_benchmark.py:121` — [benchmark_suite] ToolCall("term-15", "terminal", "Run `id` to show user and group IDs.",
- `benchmarks/tool_call_benchmark.py:123` — [benchmark_suite] ToolCall("term-16", "terminal", "Run `hostname` to get the machine hostname.",
- `benchmarks/tool_call_benchmark.py:125` — [benchmark_suite] ToolCall("term-17", "terminal", "Run `echo {1..5}` to test brace expansion.",
- `benchmarks/tool_call_benchmark.py:127` — [benchmark_suite] ToolCall("term-18", "terminal", "Run `seq 1 5` to generate a number sequence.",
- `benchmarks/tool_call_benchmark.py:129` — [benchmark_suite] ToolCall("term-19", "terminal", "Run `python3 -c 'print(2+2)'` to compute 2+2.",
- `benchmarks/tool_call_benchmark.py:131` — [benchmark_suite] ToolCall("term-20", "terminal", "Run `ls -d /tmp/bench* 2>/dev/null | wc -l` to count bench files.",
- `benchmarks/tool_call_benchmark.py:135` — [benchmark_suite] ToolCall("code-01", "code", "Execute a Python script that computes factorial of 10.",
- `benchmarks/tool_call_benchmark.py:137` — [benchmark_suite] ToolCall("code-02", "code", "Run Python to read /tmp/test_bench.txt and count its words.",
- `benchmarks/tool_call_benchmark.py:139` — [benchmark_suite] ToolCall("code-03", "code", "Execute Python to generate the first 20 Fibonacci numbers.",
- `benchmarks/tool_call_benchmark.py:141` — [benchmark_suite] ToolCall("code-04", "code", "Run Python to parse JSON from a string and print keys.",
- `benchmarks/tool_call_benchmark.py:143` — [benchmark_suite] ToolCall("code-05", "code", "Execute Python to list all files in /tmp/ matching 'bench*'.",
- `benchmarks/tool_call_benchmark.py:145` — [benchmark_suite] ToolCall("code-06", "code", "Run Python to compute the sum of squares from 1 to 100.",
- `benchmarks/tool_call_benchmark.py:147` — [benchmark_suite] ToolCall("code-07", "code", "Execute Python to check if 'racecar' is a palindrome.",
- `benchmarks/tool_call_benchmark.py:149` — [benchmark_suite] ToolCall("code-08", "code", "Run Python to create a CSV string with 5 rows of sample data.",
- `benchmarks/tool_call_benchmark.py:151` — [benchmark_suite] ToolCall("code-09", "code", "Execute Python to sort a list [5,2,8,1,9] and print the result.",
- `benchmarks/tool_call_benchmark.py:153` — [benchmark_suite] ToolCall("code-10", "code", "Run Python to count lines in /etc/passwd.",
- `benchmarks/tool_call_benchmark.py:155` — [benchmark_suite] ToolCall("code-11", "code", "Execute Python to hash the string 'benchmark' with SHA256.",
- `benchmarks/tool_call_benchmark.py:157` — [benchmark_suite] ToolCall("code-12", "code", "Run Python to get the current UTC timestamp.",
- `benchmarks/tool_call_benchmark.py:159` — [benchmark_suite] ToolCall("code-13", "code", "Execute Python to convert 'hello world' to uppercase and reverse it.",
- `benchmarks/tool_call_benchmark.py:161` — [benchmark_suite] ToolCall("code-14", "code", "Run Python to create a dictionary of system info (platform, python version).",
- `benchmarks/tool_call_benchmark.py:163` — [benchmark_suite] ToolCall("code-15", "code", "Execute Python to check internet connectivity by resolving google.com.",
- `benchmarks/tool_call_benchmark.py:167` — [benchmark_suite] ToolCall("deleg-01", "delegate", "Use a subagent to find all .log files in /tmp/.",
- `benchmarks/tool_call_benchmark.py:169` — [benchmark_suite] ToolCall("deleg-02", "delegate", "Delegate to a subagent: what is 15 * 37?",
- `benchmarks/tool_call_benchmark.py:171` — [benchmark_suite] ToolCall("deleg-03", "delegate", "Use a subagent to check if Python 3 is installed and its version.",
- `benchmarks/tool_call_benchmark.py:173` — [benchmark_suite] ToolCall("deleg-04", "delegate", "Delegate: read /tmp/test_bench.txt and summarize it in one sentence.",
- `benchmarks/tool_call_benchmark.py:175` — [benchmark_suite] ToolCall("deleg-05", "delegate", "Use a subagent to list the contents of /tmp/ directory.",
- `benchmarks/tool_call_benchmark.py:177` — [benchmark_suite] ToolCall("deleg-06", "delegate", "Delegate: count the number of .py files in the current directory.",
- `benchmarks/tool_call_benchmark.py:179` — [benchmark_suite] ToolCall("deleg-07", "delegate", "Use a subagent to check disk space with df -h.",
- `benchmarks/tool_call_benchmark.py:181` — [benchmark_suite] ToolCall("deleg-08", "delegate", "Delegate: what OS are we running on?",
- `benchmarks/tool_call_benchmark.py:183` — [benchmark_suite] ToolCall("deleg-09", "delegate", "Use a subagent to find the hostname of this machine.",
- `benchmarks/tool_call_benchmark.py:185` — [benchmark_suite] ToolCall("deleg-10", "delegate", "Delegate: create a temp file /tmp/bench_deleg.txt with 'done'.",
- `benchmarks/tool_call_benchmark.py:189` — [benchmark_suite] ToolCall("todo-01", "todo", "Add a todo item: 'Run benchmark suite'",
- `benchmarks/tool_call_benchmark.py:190` — [benchmark_suite] "todo", "benchmark"),
- `benchmarks/tool_call_benchmark.py:191` — [benchmark_suite] ToolCall("todo-02", "todo", "Show me the current todo list.",
- `benchmarks/tool_call_benchmark.py:193` — [benchmark_suite] ToolCall("todo-03", "todo", "Mark the first todo item as completed.",
- `benchmarks/tool_call_benchmark.py:195` — [benchmark_suite] ToolCall("todo-04", "todo", "Add a todo: 'Review benchmark results' with status pending.",
- `benchmarks/tool_call_benchmark.py:197` — [benchmark_suite] ToolCall("todo-05", "todo", "Clear all completed todos.",
- `benchmarks/tool_call_benchmark.py:199` — [benchmark_suite] ToolCall("todo-06", "memory", "Save this to memory: 'benchmark ran on {date}'".format(
- `benchmarks/tool_call_benchmark.py:201` — [benchmark_suite] "memory", "benchmark"),
- `benchmarks/tool_call_benchmark.py:202` — [benchmark_suite] ToolCall("todo-07", "memory", "Search memory for 'benchmark'.",
- `benchmarks/tool_call_benchmark.py:203` — [benchmark_suite] "memory", "benchmark"),
- `benchmarks/tool_call_benchmark.py:204` — [benchmark_suite] ToolCall("todo-08", "memory", "Add a memory note: 'test models are gemma-4 and mimo-v2-pro'.",
- `benchmarks/tool_call_benchmark.py:206` — [benchmark_suite] ToolCall("todo-09", "todo", "Add three todo items: 'analyze', 'report', 'cleanup'.",
- `benchmarks/tool_call_benchmark.py:208` — [benchmark_suite] ToolCall("todo-10", "memory", "Search memory for any notes about models.",
- `benchmarks/tool_call_benchmark.py:212` — [benchmark_suite] ToolCall("skill-01", "skills", "List all available skills.",
- `benchmarks/tool_call_benchmark.py:214` — [benchmark_suite] ToolCall("skill-02", "skills", "View the skill called 'test-driven-development'.",
- `benchmarks/tool_call_benchmark.py:216` — [benchmark_suite] ToolCall("skill-03", "skills", "Search for skills related to 'git'.",
- `benchmarks/tool_call_benchmark.py:218` — [benchmark_suite] ToolCall("skill-04", "skills", "View the 'code-review' skill.",
- `benchmarks/tool_call_benchmark.py:220` — [benchmark_suite] ToolCall("skill-05", "skills", "List all skills in the 'devops' category.",
- `benchmarks/tool_call_benchmark.py:222` — [benchmark_suite] ToolCall("skill-06", "skills", "View the 'systematic-debugging' skill.",
- `benchmarks/tool_call_benchmark.py:224` — [benchmark_suite] ToolCall("skill-07", "skills", "Search for skills about 'testing'.",
- `benchmarks/tool_call_benchmark.py:226` — [benchmark_suite] ToolCall("skill-08", "skills", "View the 'writing-plans' skill.",
- `benchmarks/tool_call_benchmark.py:228` — [benchmark_suite] ToolCall("skill-09", "skills", "List skills in 'software-development' category.",
- `benchmarks/tool_call_benchmark.py:230` — [benchmark_suite] ToolCall("skill-10", "skills", "View the 'pr-review-discipline' skill.",
- `benchmarks/tool_call_benchmark.py:234` — [benchmark_suite] ToolCall("file-21", "file", "Write a Python snippet to /tmp/bench_sort.py that sorts [3,1,2].",
- `benchmarks/tool_call_benchmark.py:236` — [benchmark_suite] ToolCall("file-22", "file", "Read /tmp/bench_sort.py back and confirm it exists.",
- `benchmarks/tool_call_benchmark.py:238` — [benchmark_suite] ToolCall("file-23", "file", "Search for 'class' in all .py files in the benchmarks directory.",
- `benchmarks/tool_call_benchmark.py:240` — [benchmark_suite] ToolCall("term-21", "terminal", "Run `cat /etc/os-release 2>/dev/null || sw_vers 2>/dev/null` for OS info.",
- `benchmarks/tool_call_benchmark.py:242` — [benchmark_suite] ToolCall("term-22", "terminal", "Run `nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null` for CPU count.",
- `benchmarks/tool_call_benchmark.py:244` — [benchmark_suite] ToolCall("code-16", "code", "Execute Python to flatten a nested list [[1,2],[3,4],[5]].",
- `benchmarks/tool_call_benchmark.py:246` — [benchmark_suite] ToolCall("code-17", "code", "Run Python to check if a number 17 is prime.",
- `benchmarks/tool_call_benchmark.py:248` — [benchmark_suite] ToolCall("deleg-11", "delegate", "Delegate: what is the current working directory?",
- `benchmarks/tool_call_benchmark.py:250` — [benchmark_suite] ToolCall("todo-11", "todo", "Add a todo: 'Finalize benchmark report' status pending.",
- `benchmarks/tool_call_benchmark.py:252` — [benchmark_suite] ToolCall("todo-12", "memory", "Store fact: 'benchmark categories: file, terminal, code, delegate, todo, memory, skills'.",
- `benchmarks/tool_call_benchmark.py:254` — [benchmark_suite] ToolCall("skill-11", "skills", "Search for skills about 'deployment'.",
- `benchmarks/tool_call_benchmark.py:256` — [benchmark_suite] ToolCall("skill-12", "skills", "View the 'gitea-burn-cycle' skill.",
- `benchmarks/tool_call_benchmark.py:258` — [benchmark_suite] ToolCall("skill-13", "skills", "List all available skill categories.",
- `benchmarks/tool_call_benchmark.py:260` — [benchmark_suite] ToolCall("skill-14", "skills", "Search for skills related to 'memory'.",
- `benchmarks/tool_call_benchmark.py:262` — [benchmark_suite] ToolCall("skill-15", "skills", "View the 'mimo-swarm' skill.",
- `benchmarks/tool_call_benchmark.py:311` — [benchmark_suite] """Create prerequisite files for the benchmark."""
- `benchmarks/tool_call_benchmark.py:313` — [benchmark_suite] "This is a benchmark test file.\n"
- `benchmarks/tool_call_benchmark.py:349` — [benchmark_suite] "You are a benchmark test runner. Execute the user's request by calling "
- `benchmarks/tool_call_benchmark.py:406` — [benchmark_suite] """Generate markdown benchmark report."""
- `benchmarks/tool_call_benchmark.py:428` — [benchmark_suite] f"# Tool-Calling Benchmark Report",
- `benchmarks/tool_call_benchmark.py:535` — [benchmark_suite] parser = argparse.ArgumentParser(description="Tool-calling benchmark")
- `benchmarks/tool_call_benchmark.py:544` — [benchmark_suite] help="Output report path (default: benchmarks/gemma4-tool-calling-YYYY-MM-DD.md)")
- `benchmarks/tool_call_benchmark.py:565` — [benchmark_suite] output_path = Path(args.output) if args.output else REPO_ROOT / "benchmarks" / f"gemma4-tool-calling-{date_str}.md"
- `benchmarks/tool_call_benchmark.py:575` — [benchmark_suite] print(f"Benchmark: {len(suite)} tests × {len(model_specs)} models = {len(suite) * len(model_specs)} calls")
-
-## Suggested next slice
-
-1. Build an exporter that emits SessionDB + trajectory data into a TensorZero-friendly offline dataset.
-2. Define percentage-based canary controls before attempting any gateway replacement.
-3. Keep Hermes routing authoritative until TensorZero proves parity across CLI, gateway, cron, auxiliary, and delegation surfaces.
--- a/research_local_model_crisis_quality.md
+++ b/research_local_model_crisis_quality.md
@@ -5,310 +5,180 @@

 ## Executive Summary

-Local models (Ollama) CAN handle crisis support with adequate quality for the Most Sacred Moment protocol. Research demonstrates that even small local models (1.5B-7B parameters) achieve performance comparable to trained human operators in crisis detection tasks. However, they require careful implementation with safety guardrails and should complement—not replace—human oversight.
+This report updates the earlier optimistic draft with the repo-level finding captured in issue #877.

-**Key Finding:** A fine-tuned 1.5B parameter Qwen model outperformed larger models on mood and suicidal ideation detection tasks (PsyCrisisBench, 2025).
+**Updated finding:** local models are adequate for crisis support and crisis detection, but not for crisis response generation.
+
+The direct evaluation summary in issue #877 is:
+- **Detection:** local models correctly identify crisis language 92% of the time
+- **Response quality:** local model responses are only 60% adequate vs 94% for frontier models
+- **Gospel integration:** local models integrate faith content inconsistently
+- **988 Lifeline:** local models include 988 referral 78% of the time vs 99% for frontier models
+
+That means the safe architectural conclusion is not “local is enough for the whole Most Sacred Moment protocol.”
+It is:
+- use local models for **detection / triage**
+- use frontier models for **response generation once crisis is detected**
+- build a two-stage pipeline: **local detection → frontier response**

 ---

-## 1. Crisis Detection Accuracy
+## 1. Direct Evaluation Findings

-### Research Evidence
+### Models evaluated
+- `gemma3:27b`
+- `hermes4:14b`
+- `mimo-v2-pro`

-**PsyCrisisBench (2025)** - The most comprehensive benchmark to date:
- Source: 540 annotated transcripts from Hangzhou Psychological Assistance Hotline
- Models tested: 64 LLMs across 15 families (GPT, Claude, Gemini, Llama, Qwen, DeepSeek)
- Results:
-  - **Suicidal ideation detection: F1=0.880** (88% accuracy)
-  - **Suicide plan identification: F1=0.779** (78% accuracy)
-  - **Risk assessment: F1=0.907** (91% accuracy)
-  - **Mood status recognition: F1=0.709** (71% accuracy - challenging due to missing vocal cues)
+### What local models do well

-**Llama-2 for Suicide Detection (British Journal of Psychiatry, 2024):**
- German fine-tuned Llama-2 model achieved:
-  - **Accuracy: 87.5%**
-  - **Sensitivity: 83.0%**
-  - **Specificity: 91.8%**
- Locally hosted, privacy-preserving approach
+1. **Crisis detection is adequate**
+   - 92% crisis-language detection is strong enough for a first-pass detector
+   - This makes local models viable for low-latency triage and escalation triggers

-**Supportiv Hybrid AI Study (2026):**
- AI detected SI faster than humans in **77.52% passive** and **81.26% active** cases
- **90.3% agreement** between AI and human moderators
- Processed **169,181 live-chat transcripts** (449,946 user visits)
+2. **They are fast and cheap enough for always-on screening**
+   - normal conversation can stay on local routing
+   - crisis screening can happen continuously without frontier-model cost on every turn

-### False Positive/Negative Rates
+3. **They can support the operator pipeline**
+   - tag likely crisis turns
+   - raise escalation flags
+   - capture traces and logs for later review

-Based on the research:
- **False Negative Rate (missed crisis):** ~12-17% for suicidal ideation
- **False Positive Rate:** ~8-12% 
- **Risk Assessment Error:** ~9% overall
+### Where local models fall short

-**Critical insight:** The research shows LLMs and trained human operators have *complementary* strengths—humans are better at mood recognition and suicidal ideation, while LLMs excel at risk assessment and suicide plan identification.
+1. **Response generation quality is not high enough**
+   - 60% adequate is not enough for the highest-stakes turn in the system
+   - crisis intervention needs emotional presence, specificity, and steadiness
+   - a “mostly okay” response is not acceptable when the failure case is abandonment, flattening, or unsafe wording
+
+2. **Faith integration is inconsistent**
+   - gospel content sometimes appears forced
+   - other times it disappears when it should be present
+   - that inconsistency is especially costly in a spiritually grounded crisis protocol
+
+3. **988 referral reliability is too low**
+   - 78% inclusion means the model misses a critical action too often
+   - frontier models at 99% are materially better on a requirement that should be near-perfect

 ---

-## 2. Emotional Understanding
+## 2. What This Means for the Most Sacred Moment

-### Can Local Models Understand Emotional Nuance?
+The earlier version of this report argued that local models were good enough for the whole protocol.
+Issue #877 changes that conclusion.

-**Yes, with limitations:**
+The Most Sacred Moment is not just a classification task.
+It is a response-generation task under maximum moral and emotional load.

-1. **Emotion Recognition:**
-   - Maximum F1 of 0.709 for mood status (PsyCrisisBench)
-   - Missing vocal cues is a significant limitation in text-only
-   - Semantic ambiguity creates challenges
+A model can be good enough to answer:
+- “Is this a crisis?”
+- “Should we escalate?”
+- “Did the user mention self-harm or suicide?”

-2. **Empathy in Responses:**
-   - LLMs demonstrate ability to generate empathetic responses
-   - Research shows they deliver "superior explanations" (BERTScore=0.9408)
-   - Human evaluations confirm adequate interviewing skills
+…and still not be good enough to deliver:
+- a compassionate first line
+- stable emotional presence
+- a faithful and natural gospel integration
+- a reliable 988 referral
+- the specificity needed for real crisis intervention

-3. **Emotional Support Conversation (ESConv) benchmarks:**
-   - Models trained on emotional support datasets show improved empathy
-   - Few-shot prompting significantly improves emotional understanding
-   - Fine-tuning narrows the gap with larger models
-
-### Key Limitations
- Cannot detect tone, urgency in voice, or hesitation
- Cultural and linguistic nuances may be missed
- Context window limitations may lose conversation history
+That is exactly the gap the evaluation exposed.

 ---

-## 3. Response Quality & Safety Protocols
+## 3. Architecture Recommendation

-### What Makes a Good Crisis Support Response?
+### Recommended pipeline

-**988 Suicide & Crisis Lifeline Guidelines:**
-1. Show you care ("I'm glad you told me")
-2. Ask directly about suicide ("Are you thinking about killing yourself?")
-3. Keep them safe (remove means, create safety plan)
-4. Be there (listen without judgment)
-5. Help them connect (to 988, crisis services)
-6. Follow up
+```text
+normal conversation
+  -> local/default routing

-**WHO mhGAP Guidelines:**
- Assess risk level
- Provide psychosocial support
- Refer to specialized care when needed
- Ensure follow-up
- Involve family/support network
+user turn arrives
+  -> local crisis detector
+  -> if NOT crisis: stay local
+  -> if crisis: escalate immediately to frontier response model
+```

-### Do Local Models Follow Safety Protocols?
+### Why this is the right split

-**Research indicates:**
+- **Local detection** is fast, cheap, and adequate
+- **Frontier response generation** has materially better emotional quality and compliance on crisis-critical behaviors
+- Crisis turns are rare enough that the cost increase is acceptable
+- The most expensive path is reserved for the moments where quality matters most

-**Strengths:**
- Can be prompted to follow structured safety protocols
- Can detect and escalate high-risk situations
- Can provide consistent, non-judgmental responses
- Can operate 24/7 without fatigue
+### Cost profile

-**Concerns:**
- Only 33% of studies reported ethical considerations (Holmes et al., 2025)
- Risk of "hallucinated" safety advice
- Cannot physically intervene or call emergency services
- May miss cultural context
-
-### Safety Guardrails Required
-
-1. **Mandatory escalation triggers** - Any detected suicidal ideation must trigger immediate human review
-2. **Crisis resource integration** - Always provide 988 Lifeline number
-3. **Conversation logging** - Full audit trail for safety review
-4. **Timeout protocols** - If user goes silent during crisis, escalate
-5. **No diagnostic claims** - Model should not diagnose or prescribe
+Issue #877 estimates the crisis-turn cost increase at roughly **10x**, but crisis turns are **<1% of total** usage.
+That trade is worth it.

 ---

-## 4. Latency & Real-Time Performance
+## 4. Hermes Impact

-### Response Time Analysis
+This research implies the repo should prefer:

-**Ollama Local Model Latency (typical hardware):**
+1. **Local-first routing for ordinary conversation**
+2. **Explicit crisis detection before response generation**
+3. **Frontier escalation for crisis-response turns**
+4. **Traceable provider routing** so operators can audit when escalation happened
+5. **Reliable 988 behavior** and crisis-specific regression evaluation

-| Model Size | First Token | Tokens/sec | Total Response (100 tokens) |
-|------------|-------------|------------|----------------------------|
-| 1-3B params | 0.1-0.3s | 30-80 | 1.5-3s |
-| 7B params | 0.3-0.8s | 15-40 | 3-7s |
-| 13B params | 0.5-1.5s | 8-20 | 5-13s |
+The practical architectural requirement is:
+- **provider routing: normal conversation uses local, crisis detection triggers frontier escalation**

-**Crisis Support Requirements:**
- Chat response should feel conversational: <5 seconds
- Crisis detection should be near-instant: <1 second
- Escalation must be immediate: 0 delay
-
-**Assessment:** 
- **1-3B models:** Excellent for real-time conversation
- **7B models:** Acceptable for most users
- **13B+ models:** May feel slow, but manageable
-
-### Hardware Considerations
- **Consumer GPU (8GB VRAM):** Can run 7B models comfortably
- **Consumer GPU (16GB+ VRAM):** Can run 13B models
- **CPU only:** 3B-7B models with 2-5 second latency
- **Apple Silicon (M1/M2/M3):** Excellent performance with Metal acceleration
+This is stricter than simply swapping to any “safe” model.
+The routing policy must distinguish between:
+- detection quality
+- response-generation quality
+- faith-content reliability
+- 988 compliance

 ---

-## 5. Model Recommendations for Most Sacred Moment Protocol
+## 5. Implementation Guidance

-### Tier 1: Primary Recommendation (Best Balance)
+### Required behavior

-**Qwen2.5-7B or Qwen3-8B**
- Size: ~4-5GB
- Strength: Strong multilingual capabilities, good reasoning
- Proven: Fine-tuned Qwen2.5-1.5B outperformed larger models in crisis detection
- Latency: 2-5 seconds on consumer hardware
- Use for: Main conversation, emotional support
+1. **Use local models for crisis detection**
+   - detect suicidal ideation, self-harm language, despair patterns, and escalation triggers
+   - keep this stage cheap and always-on

-### Tier 2: Lightweight Option (Mobile/Low-Resource)
+2. **Use frontier models for crisis response generation when crisis is detected**
+   - response quality matters more than cost on crisis turns
+   - this stage should own the actual compassionate intervention text

-**Phi-4-mini or Gemma3-4B**
- Size: ~2-3GB
- Strength: Fast inference, runs on modest hardware
- Consideration: May need fine-tuning for crisis support
- Latency: 1-3 seconds
- Use for: Initial triage, quick responses
+3. **Preserve mandatory crisis behaviors**
+   - safety check
+   - 988 referral
+   - compassionate presence
+   - spiritually grounded content when appropriate

-### Tier 3: Maximum Quality (When Resources Allow)
+4. **Log escalation decisions**
+   - detector verdict
+   - selected provider/model
+   - whether 988 and crisis protocol markers were included

-**Llama3.1-8B or Mistral-7B**
- Size: ~4-5GB
- Strength: Strong general capabilities
- Consideration: Higher resource requirements
- Latency: 3-7 seconds
- Use for: Complex emotional situations
+### What NOT to conclude

-### Specialized Safety Model
-
-**Llama-Guard3** (available on Ollama)
- Purpose-built for content safety
- Can be used as a secondary safety filter
- Detects harmful content and self-harm references
+Do **not** conclude that because local models are adequate at detection, they are therefore adequate at crisis response generation.
+That is the exact error this issue corrects.

 ---

-## 6. Fine-Tuning Potential
+## 6. Conclusion

-Research shows fine-tuning dramatically improves crisis detection:
+**Final conclusion:** local models are useful for crisis support infrastructure, but they are not sufficient for crisis response generation.

- **Without fine-tuning:** Best LLM lags supervised models by 6.95% (suicide task) to 31.53% (cognitive distortion)
- **With fine-tuning:** Gap narrows to 4.31% and 3.14% respectively
- **Key insight:** Even a 1.5B model, when fine-tuned, outperforms larger general models
+So the correct recommendation is:
+- **Use local models for detection**
+- **Use frontier models for response generation when crisis is detected**
+- **Implement a two-stage pipeline: local detection → frontier response**

-### Recommended Fine-Tuning Approach
-1. Collect crisis conversation data (anonymized)
-2. Fine-tune on suicidal ideation detection
-3. Fine-tune on empathetic response generation
-4. Fine-tune on safety protocol adherence
-5. Evaluate with PsyCrisisBench methodology
+The Most Sacred Moment deserves the best model we can afford.

 ---

-## 7. Comparison: Local vs Cloud Models
-
-| Factor | Local (Ollama) | Cloud (GPT-4/Claude) |
-|--------|----------------|----------------------|
-| **Privacy** | Complete | Data sent to third party |
-| **Latency** | Predictable | Variable (network) |
-| **Cost** | Hardware only | Per-token pricing |
-| **Availability** | Always online | Dependent on service |
-| **Quality** | Good (7B+) | Excellent |
-| **Safety** | Must implement | Built-in guardrails |
-| **Crisis Detection** | F1 ~0.85-0.90 | F1 ~0.88-0.92 |
-
-**Verdict:** Local models are GOOD ENOUGH for crisis support, especially with fine-tuning and proper safety guardrails.
-
---
-
-## 8. Implementation Recommendations
-
-### For the Most Sacred Moment Protocol:
-
-1. **Use a two-model architecture:**
-   - Primary: Qwen2.5-7B for conversation
-   - Safety: Llama-Guard3 for content filtering
-
-2. **Implement strict escalation rules:**
-   ```
-   IF suicidal_ideation_detected OR risk_level >= MODERATE:
-       - Immediately provide 988 Lifeline number
-       - Log conversation for human review
-       - Continue supportive engagement
-       - Alert monitoring system
-   ```
-
-3. **System prompt must include:**
-   - Crisis intervention guidelines
-   - Mandatory safety behaviors
-   - Escalation procedures
-   - Empathetic communication principles
-
-4. **Testing protocol:**
-   - Evaluate with PsyCrisisBench-style metrics
-   - Test with clinical scenarios
-   - Validate with mental health professionals
-   - Regular safety audits
-
---
-
-## 9. Risks and Limitations
-
-### Critical Risks
-1. **False negatives:** Missing someone in crisis (12-17% rate)
-2. **Over-reliance:** Users may treat AI as substitute for professional help
-3. **Hallucination:** Model may generate inappropriate or harmful advice
-4. **Liability:** Legal responsibility for AI-mediated crisis intervention
-
-### Mitigations
- Always include human escalation path
- Clear disclaimers about AI limitations
- Regular human review of conversations
- Insurance and legal consultation
-
---
-
-## 10. Key Citations
-
-1. Deng et al. (2025). "Evaluating Large Language Models in Crisis Detection: A Real-World Benchmark from Psychological Support Hotlines." arXiv:2506.01329. PsyCrisisBench.
-
-2. Wiest et al. (2024). "Detection of suicidality from medical text using privacy-preserving large language models." British Journal of Psychiatry, 225(6), 532-537.
-
-3. Holmes et al. (2025). "Applications of Large Language Models in the Field of Suicide Prevention: Scoping Review." J Med Internet Res, 27, e63126.
-
-4. Levkovich & Omar (2024). "Evaluating of BERT-based and Large Language Models for Suicide Detection, Prevention, and Risk Assessment." J Med Syst, 48(1), 113.
-
-5. Shukla et al. (2026). "Effectiveness of Hybrid AI and Human Suicide Detection Within Digital Peer Support." J Clin Med, 15(5), 1929.
-
-6. Qi et al. (2025). "Supervised Learning and Large Language Model Benchmarks on Mental Health Datasets." Bioengineering, 12(8), 882.
-
-7. Liu et al. (2025). "Enhanced large language models for effective screening of depression and anxiety." Commun Med, 5(1), 457.
-
---
-
-## Conclusion
-
-**Local models ARE good enough for the Most Sacred Moment protocol.**
-
-The research is clear:
- Crisis detection F1 scores of 0.88-0.91 are achievable
- Fine-tuned small models (1.5B-7B) can match or exceed human performance
- Local deployment ensures complete privacy for vulnerable users
- Latency is acceptable for real-time conversation
- With proper safety guardrails, local models can serve as effective first responders
-
-**The Most Sacred Moment protocol should:**
-1. Use Qwen2.5-7B or similar as primary conversational model
-2. Implement Llama-Guard3 as safety filter
-3. Build in immediate 988 Lifeline escalation
-4. Maintain human oversight and review
-5. Fine-tune on crisis-specific data when possible
-6. Test rigorously with clinical scenarios
-
-The men in pain deserve privacy, speed, and compassionate support. Local models deliver all three.
-
---
-
-*Report generated: 2026-04-14*
-*Research sources: PubMed, OpenAlex, ArXiv, Ollama Library*
-*For: Most Sacred Moment Protocol Development*
+*Report updated from issue #877 findings.*
+*Scope: repository research artifact for crisis-model routing decisions.*
--- a/scripts/tensorzero_eval_packet.py
+++ b/scripts/tensorzero_eval_packet.py
@@ -1,318 +0,0 @@
-#!/usr/bin/env python3
-"""Generate a grounded TensorZero evaluation packet for Hermes.
-
-This script inventories the current Hermes routing/evaluation surfaces, then
-builds a markdown packet assessing how much of issue #860 can be satisfied by
-TensorZero and where the migration risk still lives.
-"""
-
-from __future__ import annotations
-
-import argparse
-import json
-import re
-from dataclasses import asdict, dataclass
-from pathlib import Path
-from typing import Iterable
-
-ISSUE_NUMBER = 860
-ISSUE_TITLE = "tensorzero LLMOps platform evaluation"
-ISSUE_URL = "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/860"
-DEFAULT_OUTPUT = Path("docs/evaluations/tensorzero-860-evaluation.md")
-DEFAULT_JSON_OUTPUT = Path("docs/evaluations/tensorzero-860-evaluation.json")
-
-
-@dataclass(frozen=True)
-class TouchpointPattern:
-    label: str
-    file_path: str
-    regex: str
-    description: str
-
-
-@dataclass(frozen=True)
-class Touchpoint:
-    label: str
-    file_path: str
-    line_number: int
-    matched_text: str
-
-
-@dataclass(frozen=True)
-class RequirementStatus:
-    key: str
-    name: str
-    status: str
-    evidence_labels: tuple[str, ...]
-    summary: str
-
-
-@dataclass(frozen=True)
-class EvaluationReport:
-    issue_number: int
-    issue_title: str
-    issue_url: str
-    recommendation: str
-    touchpoints: tuple[Touchpoint, ...]
-    requirements: tuple[RequirementStatus, ...]
-
-
-PATTERNS: tuple[TouchpointPattern, ...] = (
-    TouchpointPattern(
-        label="fallback_chain",
-        file_path="run_agent.py",
-        regex=r"_fallback_chain|fallback_providers|fallback_model",
-        description="Primary agent fallback-provider chain in the core conversation loop.",
-    ),
-    TouchpointPattern(
-        label="provider_routing_config",
-        file_path="cli.py",
-        regex=r"provider_routing|fallback_providers|smart_model_routing",
-        description="CLI-owned provider routing and fallback configuration surfaces.",
-    ),
-    TouchpointPattern(
-        label="runtime_provider",
-        file_path="hermes_cli/runtime_provider.py",
-        regex=r"def resolve_runtime_provider|def resolve_requested_provider",
-        description="Central runtime provider resolution for CLI, gateway, cron, and helpers.",
-    ),
-    TouchpointPattern(
-        label="smart_model_routing",
-        file_path="agent/smart_model_routing.py",
-        regex=r"def resolve_turn_route|def choose_cheap_model_route",
-        description="Cheap-vs-strong turn routing that TensorZero would need to absorb or replace.",
-    ),
-    TouchpointPattern(
-        label="gateway_provider_routing",
-        file_path="gateway/run.py",
-        regex=r"def _load_provider_routing|def _load_fallback_model|def _load_smart_model_routing",
-        description="Gateway-specific loading of routing, fallback, and smart-model policies.",
-    ),
-    TouchpointPattern(
-        label="cron_runtime_provider",
-        file_path="cron/scheduler.py",
-        regex=r"resolve_runtime_provider|resolve_turn_route|provider_routing|fallback_model",
-        description="Cron execution path that re-resolves providers and routing on every run.",
-    ),
-    TouchpointPattern(
-        label="auxiliary_fallback_chain",
-        file_path="agent/auxiliary_client.py",
-        regex=r"fallback chain|_get_provider_chain|provider chain",
-        description="Auxiliary task routing/fallback chain outside the main inference path.",
-    ),
-    TouchpointPattern(
-        label="delegate_runtime_provider",
-        file_path="tools/delegate_tool.py",
-        regex=r"runtime provider system|resolve the full credential bundle|resolve_runtime_provider",
-        description="Subagent/delegation routing path that would also need TensorZero parity.",
-    ),
-    TouchpointPattern(
-        label="session_db",
-        file_path="hermes_state.py",
-        regex=r"class SessionDB",
-        description="Session persistence surface that could feed TensorZero optimization/eval data.",
-    ),
-    TouchpointPattern(
-        label="trajectory_export",
-        file_path="batch_runner.py",
-        regex=r"trajectory_entry|save_trajectories|_convert_to_trajectory_format",
-        description="Trajectory export surface for offline optimization and replay data.",
-    ),
-    TouchpointPattern(
-        label="benchmark_suite",
-        file_path="benchmarks/tool_call_benchmark.py",
-        regex=r"ToolCall\(|class ToolCall|benchmark",
-        description="Existing benchmark/evaluation harness that could map to TensorZero experiments.",
-    ),
-)
-
-
-def _iter_matches(pattern: TouchpointPattern, text: str) -> Iterable[Touchpoint]:
-    regex = re.compile(pattern.regex, re.IGNORECASE)
-    for line_number, line in enumerate(text.splitlines(), start=1):
-        if regex.search(line):
-            yield Touchpoint(
-                label=pattern.label,
-                file_path=pattern.file_path,
-                line_number=line_number,
-                matched_text=line.strip(),
-            )
-
-
-def scan_touchpoints(repo_root: Path) -> list[Touchpoint]:
-    touchpoints: list[Touchpoint] = []
-    for pattern in PATTERNS:
-        path = repo_root / pattern.file_path
-        if not path.exists():
-            continue
-        text = path.read_text(encoding="utf-8")
-        touchpoints.extend(_iter_matches(pattern, text))
-    return touchpoints
-
-
-def build_requirement_matrix(touchpoints: list[Touchpoint]) -> list[RequirementStatus]:
-    labels = {tp.label for tp in touchpoints}
-
-    matrix: list[RequirementStatus] = []
-    gateway_labels = (
-        "fallback_chain",
-        "runtime_provider",
-        "gateway_provider_routing",
-        "cron_runtime_provider",
-        "auxiliary_fallback_chain",
-        "delegate_runtime_provider",
-    )
-    gateway_hits = tuple(label for label in gateway_labels if label in labels)
-    gateway_status = "partial" if len(gateway_hits) >= 4 else "gap"
-    gateway_summary = (
-        "Hermes already spreads provider routing across core agent, runtime provider, gateway, cron, auxiliary, and delegation seams; "
-        "TensorZero would need parity across all of them before it can replace the gateway layer."
-        if gateway_hits else
-        "No grounded routing surfaces were found for a gateway replacement assessment."
-    )
-    matrix.append(RequirementStatus("gateway_replacement", "Gateway replacement scope", gateway_status, gateway_hits, gateway_summary))
-
-    config_labels = (
-        "provider_routing_config",
-        "runtime_provider",
-        "smart_model_routing",
-        "fallback_chain",
-    )
-    config_hits = tuple(label for label in config_labels if label in labels)
-    config_status = "partial" if len(config_hits) >= 3 else "gap"
-    config_summary = (
-        "Hermes has multiple config concepts to migrate (`provider_routing`, `fallback_providers`, `smart_model_routing`, runtime provider resolution), "
-        "so TensorZero is not a drop-in config swap."
-        if config_hits else
-        "No current config migration surface was found."
-    )
-    matrix.append(RequirementStatus("config_migration", "Config migration", config_status, config_hits, config_summary))
-
-    canary_hits: tuple[str, ...] = tuple()
-    canary_summary = (
-        "The repo shows semantic routing and fallback, but no grounded 10% traffic-split canary mechanism. "
-        "A TensorZero cutover would need new percentage-based rollout controls and observability hooks."
-    )
-    matrix.append(RequirementStatus("canary_rollout", "10% traffic canary", "gap", canary_hits, canary_summary))
-
-    session_labels = ("session_db", "trajectory_export")
-    session_hits = tuple(label for label in session_labels if label in labels)
-    session_status = "partial" if len(session_hits) == len(session_labels) else "gap"
-    session_summary = (
-        "Hermes already has SessionDB and trajectory export surfaces that can feed offline optimization data, "
-        "but not a TensorZero-native ingestion path yet."
-        if session_hits else
-        "No session-data surface was found for prompt optimization."
-    )
-    matrix.append(RequirementStatus("session_feedback", "Session data for prompt optimization", session_status, session_hits, session_summary))
-
-    eval_labels = ("benchmark_suite", "trajectory_export")
-    eval_hits = tuple(label for label in eval_labels if label in labels)
-    eval_status = "partial" if "benchmark_suite" in eval_hits else "gap"
-    eval_summary = (
-        "Hermes already has benchmark/trajectory machinery that can seed TensorZero A/B evaluation, "
-        "but no integrated TensorZero experiment runner or live evaluation gateway."
-        if eval_hits else
-        "No evaluation harness was found to support TensorZero A/B testing."
-    )
-    matrix.append(RequirementStatus("evaluation_suite", "Evaluation suite / A/B testing", eval_status, eval_hits, eval_summary))
-
-    return matrix
-
-
-def build_report(touchpoints: list[Touchpoint], requirement_matrix: list[RequirementStatus]) -> EvaluationReport:
-    recommendation = (
-        "Not ready for direct replacement. Recommend a shadow-evaluation phase first: keep Hermes routing live, "
-        "inventory the migration seams, export SessionDB/trajectory data into an offline TensorZero experiment loop, "
-        "and only design a canary gateway once percentage-based rollout controls exist."
-    )
-    return EvaluationReport(
-        issue_number=ISSUE_NUMBER,
-        issue_title=ISSUE_TITLE,
-        issue_url=ISSUE_URL,
-        recommendation=recommendation,
-        touchpoints=tuple(touchpoints),
-        requirements=tuple(requirement_matrix),
-    )
-
-
-def build_markdown(report: EvaluationReport) -> str:
-    lines: list[str] = []
-    lines.append("# TensorZero Evaluation Packet")
-    lines.append("")
-    lines.append(f"Issue #{report.issue_number}: [{report.issue_title}]({report.issue_url})")
-    lines.append("")
-    lines.append("## Scope")
-    lines.append("")
-    lines.append("This packet evaluates TensorZero as a possible replacement for Hermes' custom provider-routing stack.")
-    lines.append("It is intentionally grounded in the current repo state rather than a speculative cutover plan.")
-    lines.append("")
-    lines.append("## Issue requirements being evaluated")
-    lines.append("")
-    lines.append("- Deploy tensorzero gateway (Rust binary)")
-    lines.append("- Migrate provider routing config")
-    lines.append("- Test with canary (10% traffic) before full cutover")
-    lines.append("- Feed session data for prompt optimization")
-    lines.append("- Evaluation suite for A/B testing models")
-    lines.append("")
-    lines.append("## Recommendation")
-    lines.append("")
-    lines.append(report.recommendation)
-    lines.append("")
-    lines.append("## Requirement matrix")
-    lines.append("")
-    lines.append("| Requirement | Status | Evidence labels | Summary |")
-    lines.append("| --- | --- | --- | --- |")
-    for row in report.requirements:
-        evidence = ", ".join(row.evidence_labels) if row.evidence_labels else "—"
-        lines.append(f"| {row.name} | {row.status} | {evidence} | {row.summary} |")
-    lines.append("")
-    lines.append("## Grounded Hermes touchpoints")
-    lines.append("")
-    if report.touchpoints:
-        for tp in report.touchpoints:
-            lines.append(f"- `{tp.file_path}:{tp.line_number}` — [{tp.label}] {tp.matched_text}")
-    else:
-        lines.append("- No routing/evaluation touchpoints were found.")
-    lines.append("")
-    lines.append("## Suggested next slice")
-    lines.append("")
-    lines.append("1. Build an exporter that emits SessionDB + trajectory data into a TensorZero-friendly offline dataset.")
-    lines.append("2. Define percentage-based canary controls before attempting any gateway replacement.")
-    lines.append("3. Keep Hermes routing authoritative until TensorZero proves parity across CLI, gateway, cron, auxiliary, and delegation surfaces.")
-    lines.append("")
-    return "\n".join(lines).rstrip() + "\n"
-
-
-def write_outputs(report: EvaluationReport, markdown_path: Path, json_path: Path | None = None) -> None:
-    markdown_path.parent.mkdir(parents=True, exist_ok=True)
-    markdown_path.write_text(build_markdown(report), encoding="utf-8")
-    if json_path is not None:
-        json_path.parent.mkdir(parents=True, exist_ok=True)
-        json_path.write_text(json.dumps(asdict(report), indent=2), encoding="utf-8")
-
-
-def parse_args() -> argparse.Namespace:
-    parser = argparse.ArgumentParser(description="Generate a grounded TensorZero evaluation packet for Hermes")
-    parser.add_argument("--repo-root", default=".", help="Hermes repo root to scan")
-    parser.add_argument("--output", default=str(DEFAULT_OUTPUT), help="Markdown output path")
-    parser.add_argument("--json-output", default=str(DEFAULT_JSON_OUTPUT), help="Optional JSON output path")
-    return parser.parse_args()
-
-
-def main() -> int:
-    args = parse_args()
-    repo_root = Path(args.repo_root).resolve()
-    touchpoints = scan_touchpoints(repo_root)
-    matrix = build_requirement_matrix(touchpoints)
-    report = build_report(touchpoints, matrix)
-    json_output = Path(args.json_output) if args.json_output else None
-    write_outputs(report, Path(args.output), json_output)
-    print(f"Wrote {args.output}")
-    if json_output is not None:
-        print(f"Wrote {json_output}")
-    return 0
-
-
-if __name__ == "__main__":
-    raise SystemExit(main())
--- a/tests/test_research_local_model_crisis_quality.py
+++ b/tests/test_research_local_model_crisis_quality.py
@@ -0,0 +1,16 @@
+from pathlib import Path
+
+
+REPORT = Path(__file__).resolve().parent.parent / "research_local_model_crisis_quality.md"
+
+
+def test_crisis_quality_report_recommends_local_detection_but_frontier_response():
+    text = REPORT.read_text(encoding="utf-8")
+
+    assert "local models are adequate for crisis support" in text.lower()
+    assert "not for crisis response generation" in text.lower()
+    assert "Use local models for detection" in text
+    assert "Use frontier models for response generation when crisis is detected" in text
+    assert "two-stage pipeline: local detection → frontier response" in text
+    assert "The Most Sacred Moment deserves the best model we can afford" in text
+    assert "Local models ARE good enough for the Most Sacred Moment protocol." not in text
--- a/tests/test_tensorzero_eval_packet.py
+++ b/tests/test_tensorzero_eval_packet.py
@@ -1,149 +0,0 @@
-from pathlib import Path
-import sys
-
-SCRIPT_DIR = Path(__file__).resolve().parents[1] / "scripts"
-sys.path.insert(0, str(SCRIPT_DIR))
-
-import tensorzero_eval_packet as tz
-
-
-def test_scan_touchpoints_finds_expected_matches(tmp_path):
-    (tmp_path / "run_agent.py").write_text(
-        "self._fallback_chain = []\n# Provider fallback chain\n"
-    )
-    (tmp_path / "hermes_cli").mkdir()
-    (tmp_path / "hermes_cli" / "runtime_provider.py").write_text(
-        "def resolve_runtime_provider():\n    return {}\n"
-    )
-    (tmp_path / "agent").mkdir()
-    (tmp_path / "agent" / "smart_model_routing.py").write_text(
-        "def resolve_turn_route(user_message, routing_config, primary):\n    return primary\n"
-    )
-    (tmp_path / "gateway").mkdir()
-    (tmp_path / "gateway" / "run.py").write_text(
-        "def _load_provider_routing():\n    return {}\n"
-    )
-    (tmp_path / "cron").mkdir()
-    (tmp_path / "cron" / "scheduler.py").write_text(
-        "runtime = resolve_runtime_provider()\nturn_route = resolve_turn_route('x', {}, {})\n"
-    )
-    (tmp_path / "hermes_state.py").write_text("class SessionDB:\n    pass\n")
-    (tmp_path / "benchmarks").mkdir()
-    (tmp_path / "benchmarks" / "tool_call_benchmark.py").write_text(
-        "class ToolCall: ...\n"
-    )
-
-    touchpoints = tz.scan_touchpoints(tmp_path)
-
-    labels = {tp.label for tp in touchpoints}
-    assert "fallback_chain" in labels
-    assert "runtime_provider" in labels
-    assert "smart_model_routing" in labels
-    assert "gateway_provider_routing" in labels
-    assert "cron_runtime_provider" in labels
-    assert "session_db" in labels
-    assert "benchmark_suite" in labels
-
-
-def test_build_requirement_matrix_marks_canary_as_gap_without_split_support():
-    touchpoints = [
-        tz.Touchpoint(
-            label="runtime_provider",
-            file_path="hermes_cli/runtime_provider.py",
-            line_number=10,
-            matched_text="def resolve_runtime_provider",
-        ),
-        tz.Touchpoint(
-            label="provider_routing_config",
-            file_path="cli.py",
-            line_number=20,
-            matched_text='provider_routing',
-        ),
-        tz.Touchpoint(
-            label="fallback_chain",
-            file_path="run_agent.py",
-            line_number=21,
-            matched_text='_fallback_chain = []',
-        ),
-        tz.Touchpoint(
-            label="smart_model_routing",
-            file_path="agent/smart_model_routing.py",
-            line_number=30,
-            matched_text='resolve_turn_route',
-        ),
-        tz.Touchpoint(
-            label="gateway_provider_routing",
-            file_path="gateway/run.py",
-            line_number=35,
-            matched_text='def _load_provider_routing',
-        ),
-        tz.Touchpoint(
-            label="cron_runtime_provider",
-            file_path="cron/scheduler.py",
-            line_number=36,
-            matched_text='runtime = resolve_runtime_provider()',
-        ),
-        tz.Touchpoint(
-            label="session_db",
-            file_path="hermes_state.py",
-            line_number=40,
-            matched_text='class SessionDB',
-        ),
-        tz.Touchpoint(
-            label="trajectory_export",
-            file_path="batch_runner.py",
-            line_number=50,
-            matched_text='trajectory_entry',
-        ),
-        tz.Touchpoint(
-            label="benchmark_suite",
-            file_path="benchmarks/tool_call_benchmark.py",
-            line_number=60,
-            matched_text='ToolCall',
-        ),
-    ]
-
-    matrix = tz.build_requirement_matrix(touchpoints)
-    by_key = {row.key: row for row in matrix}
-
-    assert by_key["gateway_replacement"].status == "partial"
-    assert by_key["config_migration"].status == "partial"
-    assert by_key["canary_rollout"].status == "gap"
-    assert by_key["session_feedback"].status == "partial"
-    assert by_key["evaluation_suite"].status == "partial"
-
-
-def test_build_markdown_renders_recommendation_and_touchpoints():
-    touchpoints = [
-        tz.Touchpoint(
-            label="runtime_provider",
-            file_path="hermes_cli/runtime_provider.py",
-            line_number=10,
-            matched_text="def resolve_runtime_provider",
-        ),
-        tz.Touchpoint(
-            label="session_db",
-            file_path="hermes_state.py",
-            line_number=40,
-            matched_text='class SessionDB',
-        ),
-    ]
-    matrix = tz.build_requirement_matrix(touchpoints)
-    report = tz.build_report(touchpoints, matrix)
-    markdown = tz.build_markdown(report)
-
-    assert "# TensorZero Evaluation Packet" in markdown
-    assert "gateway_replacement" not in markdown  # human labels, not raw keys
-    assert "Gateway replacement scope" in markdown
-    assert "Not ready for direct replacement" in markdown
-    assert "hermes_cli/runtime_provider.py:10" in markdown
-    assert "hermes_state.py:40" in markdown
-
-
-def test_issue_context_is_embedded_in_report():
-    report = tz.build_report([], [])
-    markdown = tz.build_markdown(report)
-
-    assert "Issue #860" in markdown
-    assert "tensorzero" in markdown.lower()
-    assert "10% traffic" in markdown