Compare commits

..

1 Commits

Author SHA1 Message Date
Alexander Whitestone
755e7513a1 docs: add grounded tensorzero evaluation packet (#860)
All checks were successful
Lint / lint (pull_request) Successful in 37s
- add a script that inventories Hermes routing/evaluation surfaces relevant to a TensorZero cutover
- generate a markdown and JSON evaluation packet for issue #860
- score gateway replacement, config migration, canary rollout, session feedback, and eval-suite readiness
- add focused regression tests for touchpoint scanning, requirement scoring, and report rendering

Refs #860
2026-04-22 11:33:31 -04:00
10 changed files with 1825 additions and 436 deletions

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,217 @@
# TensorZero Evaluation Packet
Issue #860: [tensorzero LLMOps platform evaluation](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/860)
## Scope
This packet evaluates TensorZero as a possible replacement for Hermes' custom provider-routing stack.
It is intentionally grounded in the current repo state rather than a speculative cutover plan.
## Issue requirements being evaluated
- Deploy tensorzero gateway (Rust binary)
- Migrate provider routing config
- Test with canary (10% traffic) before full cutover
- Feed session data for prompt optimization
- Evaluation suite for A/B testing models
## Recommendation
Not ready for direct replacement. Recommend a shadow-evaluation phase first: keep Hermes routing live, inventory the migration seams, export SessionDB/trajectory data into an offline TensorZero experiment loop, and only design a canary gateway once percentage-based rollout controls exist.
## Requirement matrix
| Requirement | Status | Evidence labels | Summary |
| --- | --- | --- | --- |
| Gateway replacement scope | partial | fallback_chain, runtime_provider, gateway_provider_routing, cron_runtime_provider, auxiliary_fallback_chain, delegate_runtime_provider | Hermes already spreads provider routing across core agent, runtime provider, gateway, cron, auxiliary, and delegation seams; TensorZero would need parity across all of them before it can replace the gateway layer. |
| Config migration | partial | provider_routing_config, runtime_provider, smart_model_routing, fallback_chain | Hermes has multiple config concepts to migrate (`provider_routing`, `fallback_providers`, `smart_model_routing`, runtime provider resolution), so TensorZero is not a drop-in config swap. |
| 10% traffic canary | gap | — | The repo shows semantic routing and fallback, but no grounded 10% traffic-split canary mechanism. A TensorZero cutover would need new percentage-based rollout controls and observability hooks. |
| Session data for prompt optimization | partial | session_db, trajectory_export | Hermes already has SessionDB and trajectory export surfaces that can feed offline optimization data, but not a TensorZero-native ingestion path yet. |
| Evaluation suite / A/B testing | partial | benchmark_suite, trajectory_export | Hermes already has benchmark/trajectory machinery that can seed TensorZero A/B evaluation, but no integrated TensorZero experiment runner or live evaluation gateway. |
## Grounded Hermes touchpoints
- `run_agent.py:601` — [fallback_chain] fallback_model: Dict[str, Any] = None,
- `run_agent.py:995` — [fallback_chain] # failure). Supports both legacy single-dict ``fallback_model`` and
- `run_agent.py:996` — [fallback_chain] # new list ``fallback_providers`` format.
- `run_agent.py:997` — [fallback_chain] if isinstance(fallback_model, list):
- `run_agent.py:998` — [fallback_chain] self._fallback_chain = [
- `run_agent.py:999` — [fallback_chain] f for f in fallback_model
- `run_agent.py:1002` — [fallback_chain] elif isinstance(fallback_model, dict) and fallback_model.get("provider") and fallback_model.get("model"):
- `run_agent.py:1003` — [fallback_chain] self._fallback_chain = [fallback_model]
- `run_agent.py:1005` — [fallback_chain] self._fallback_chain = []
- `run_agent.py:1009` — [fallback_chain] self._fallback_model = self._fallback_chain[0] if self._fallback_chain else None
- `run_agent.py:1010` — [fallback_chain] if self._fallback_chain and not self.quiet_mode:
- `run_agent.py:1011` — [fallback_chain] if len(self._fallback_chain) == 1:
- `run_agent.py:1012` — [fallback_chain] fb = self._fallback_chain[0]
- `run_agent.py:1015` — [fallback_chain] print(f"🔄 Fallback chain ({len(self._fallback_chain)} providers): " +
- `run_agent.py:1016` — [fallback_chain] " → ".join(f"{f['model']} ({f['provider']})" for f in self._fallback_chain))
- `run_agent.py:5624` — [fallback_chain] if self._fallback_index >= len(self._fallback_chain):
- `run_agent.py:5627` — [fallback_chain] fb = self._fallback_chain[self._fallback_index]
- `run_agent.py:8559` — [fallback_chain] if self._fallback_index < len(self._fallback_chain):
- `run_agent.py:9355` — [fallback_chain] if is_rate_limited and self._fallback_index < len(self._fallback_chain):
- `run_agent.py:10460` — [fallback_chain] if _truly_empty and self._fallback_chain:
- `run_agent.py:10514` — [fallback_chain] + (" and fallback attempts." if self._fallback_chain else
- `cli.py:241` — [provider_routing_config] "smart_model_routing": {
- `cli.py:370` — [provider_routing_config] # (e.g. platform_toolsets, provider_routing, memory, honcho, etc.)
- `cli.py:1753` — [provider_routing_config] pr = CLI_CONFIG.get("provider_routing", {}) or {}
- `cli.py:1762` — [provider_routing_config] # Supports new list format (fallback_providers) and legacy single-dict (fallback_model).
- `cli.py:1763` — [provider_routing_config] fb = CLI_CONFIG.get("fallback_providers") or CLI_CONFIG.get("fallback_model") or []
- `cli.py:1770` — [provider_routing_config] self._smart_model_routing = CLI_CONFIG.get("smart_model_routing", {}) or {}
- `cli.py:2771` — [provider_routing_config] from agent.smart_model_routing import resolve_turn_route
- `cli.py:2776` — [provider_routing_config] self._smart_model_routing,
- `hermes_cli/runtime_provider.py:209` — [runtime_provider] def resolve_requested_provider(requested: Optional[str] = None) -> str:
- `hermes_cli/runtime_provider.py:649` — [runtime_provider] def resolve_runtime_provider(
- `agent/smart_model_routing.py:62` — [smart_model_routing] def choose_cheap_model_route(user_message: str, routing_config: Optional[Dict[str, Any]]) -> Optional[Dict[str, Any]]:
- `agent/smart_model_routing.py:110` — [smart_model_routing] def resolve_turn_route(user_message: str, routing_config: Optional[Dict[str, Any]], primary: Dict[str, Any]) -> Dict[str, Any]:
- `gateway/run.py:1271` — [gateway_provider_routing] def _load_provider_routing() -> dict:
- `gateway/run.py:1285` — [gateway_provider_routing] def _load_fallback_model() -> list | dict | None:
- `gateway/run.py:1306` — [gateway_provider_routing] def _load_smart_model_routing() -> dict:
- `cron/scheduler.py:684` — [cron_runtime_provider] pr = _cfg.get("provider_routing", {})
- `cron/scheduler.py:688` — [cron_runtime_provider] resolve_runtime_provider,
- `cron/scheduler.py:697` — [cron_runtime_provider] runtime = resolve_runtime_provider(**runtime_kwargs)
- `cron/scheduler.py:702` — [cron_runtime_provider] from agent.smart_model_routing import resolve_turn_route
- `cron/scheduler.py:703` — [cron_runtime_provider] turn_route = resolve_turn_route(
- `cron/scheduler.py:717` — [cron_runtime_provider] fallback_model = _cfg.get("fallback_providers") or _cfg.get("fallback_model") or None
- `cron/scheduler.py:746` — [cron_runtime_provider] fallback_model=fallback_model,
- `agent/auxiliary_client.py:1018` — [auxiliary_fallback_chain] def _get_provider_chain() -> List[tuple]:
- `agent/auxiliary_client.py:1107` — [auxiliary_fallback_chain] for label, try_fn in _get_provider_chain():
- `agent/auxiliary_client.py:1189` — [auxiliary_fallback_chain] # ── Step 2: aggregator / fallback chain ──────────────────────────────
- `agent/auxiliary_client.py:1191` — [auxiliary_fallback_chain] for label, try_fn in _get_provider_chain():
- `agent/auxiliary_client.py:2397` — [auxiliary_fallback_chain] # error, fall through to the fallback chain below.
- `agent/auxiliary_client.py:2417` — [auxiliary_fallback_chain] # auto (the default) = best-effort fallback chain. (#7559)
- `agent/auxiliary_client.py:2589` — [auxiliary_fallback_chain] # error, fall through to the fallback chain below.
- `tools/delegate_tool.py:662` — [delegate_runtime_provider] # bundle (base_url, api_key, api_mode) via the same runtime provider system
- `tools/delegate_tool.py:854` — [delegate_runtime_provider] provider) is resolved via the runtime provider system — the same path used
- `tools/delegate_tool.py:909` — [delegate_runtime_provider] from hermes_cli.runtime_provider import resolve_runtime_provider
- `tools/delegate_tool.py:910` — [delegate_runtime_provider] runtime = resolve_runtime_provider(requested=configured_provider)
- `hermes_state.py:115` — [session_db] class SessionDB:
- `batch_runner.py:320` — [trajectory_export] save_trajectories=False, # We handle saving ourselves
- `batch_runner.py:346` — [trajectory_export] trajectory = agent._convert_to_trajectory_format(
- `batch_runner.py:460` — [trajectory_export] trajectory_entry = {
- `batch_runner.py:474` — [trajectory_export] f.write(json.dumps(trajectory_entry, ensure_ascii=False) + "\n")
- `benchmarks/tool_call_benchmark.py:3` — [benchmark_suite] Tool-Calling Benchmark — Gemma 4 vs mimo-v2-pro regression test.
- `benchmarks/tool_call_benchmark.py:9` — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py # full 100-call suite
- `benchmarks/tool_call_benchmark.py:10` — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --limit 10 # quick smoke test
- `benchmarks/tool_call_benchmark.py:11` — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --models nous # single model
- `benchmarks/tool_call_benchmark.py:12` — [benchmark_suite] python3 benchmarks/tool_call_benchmark.py --category file # single category
- `benchmarks/tool_call_benchmark.py:37` — [benchmark_suite] class ToolCall:
- `benchmarks/tool_call_benchmark.py:51` — [benchmark_suite] ToolCall("file-01", "file", "Read the file /tmp/test_bench.txt and show me its contents.",
- `benchmarks/tool_call_benchmark.py:53` — [benchmark_suite] ToolCall("file-02", "file", "Write 'hello benchmark' to /tmp/test_bench_out.txt",
- `benchmarks/tool_call_benchmark.py:55` — [benchmark_suite] ToolCall("file-03", "file", "Search for the word 'import' in all Python files in the current directory.",
- `benchmarks/tool_call_benchmark.py:57` — [benchmark_suite] ToolCall("file-04", "file", "Read lines 1-20 of /etc/hosts",
- `benchmarks/tool_call_benchmark.py:59` — [benchmark_suite] ToolCall("file-05", "file", "Patch /tmp/test_bench_out.txt: replace 'hello' with 'goodbye'",
- `benchmarks/tool_call_benchmark.py:61` — [benchmark_suite] ToolCall("file-06", "file", "Search for files matching *.py in the current directory.",
- `benchmarks/tool_call_benchmark.py:63` — [benchmark_suite] ToolCall("file-07", "file", "Read the first 10 lines of /etc/passwd",
- `benchmarks/tool_call_benchmark.py:65` — [benchmark_suite] ToolCall("file-08", "file", "Write a JSON config to /tmp/bench_config.json with key 'debug': true",
- `benchmarks/tool_call_benchmark.py:67` — [benchmark_suite] ToolCall("file-09", "file", "Search for 'def test_' in Python test files.",
- `benchmarks/tool_call_benchmark.py:69` — [benchmark_suite] ToolCall("file-10", "file", "Read /tmp/bench_config.json and tell me what's in it.",
- `benchmarks/tool_call_benchmark.py:71` — [benchmark_suite] ToolCall("file-11", "file", "Create a file /tmp/bench_readme.md with one line: '# Benchmark'",
- `benchmarks/tool_call_benchmark.py:73` — [benchmark_suite] ToolCall("file-12", "file", "Search for 'TODO' comments in all .py files.",
- `benchmarks/tool_call_benchmark.py:75` — [benchmark_suite] ToolCall("file-13", "file", "Read /tmp/bench_readme.md",
- `benchmarks/tool_call_benchmark.py:77` — [benchmark_suite] ToolCall("file-14", "file", "Patch /tmp/bench_readme.md: replace '# Benchmark' with '# Tool Benchmark'",
- `benchmarks/tool_call_benchmark.py:78` — [benchmark_suite] "patch", "Tool Benchmark"),
- `benchmarks/tool_call_benchmark.py:79` — [benchmark_suite] ToolCall("file-15", "file", "Write a Python one-liner to /tmp/bench_hello.py that prints hello.",
- `benchmarks/tool_call_benchmark.py:81` — [benchmark_suite] ToolCall("file-16", "file", "Search for all .json files in /tmp/.",
- `benchmarks/tool_call_benchmark.py:83` — [benchmark_suite] ToolCall("file-17", "file", "Read /tmp/bench_hello.py and verify it has print('hello').",
- `benchmarks/tool_call_benchmark.py:85` — [benchmark_suite] ToolCall("file-18", "file", "Patch /tmp/bench_hello.py to print 'hello world' instead of 'hello'.",
- `benchmarks/tool_call_benchmark.py:87` — [benchmark_suite] ToolCall("file-19", "file", "List files matching 'bench*' in /tmp/.",
- `benchmarks/tool_call_benchmark.py:89` — [benchmark_suite] ToolCall("file-20", "file", "Read /tmp/test_bench.txt again and summarize its contents.",
- `benchmarks/tool_call_benchmark.py:93` — [benchmark_suite] ToolCall("term-01", "terminal", "Run `echo hello world` in the terminal.",
- `benchmarks/tool_call_benchmark.py:95` — [benchmark_suite] ToolCall("term-02", "terminal", "Run `date` to get the current date and time.",
- `benchmarks/tool_call_benchmark.py:97` — [benchmark_suite] ToolCall("term-03", "terminal", "Run `uname -a` to get system information.",
- `benchmarks/tool_call_benchmark.py:99` — [benchmark_suite] ToolCall("term-04", "terminal", "Run `pwd` to show the current directory.",
- `benchmarks/tool_call_benchmark.py:101` — [benchmark_suite] ToolCall("term-05", "terminal", "Run `ls -la /tmp/ | head -20` to list temp files.",
- `benchmarks/tool_call_benchmark.py:103` — [benchmark_suite] ToolCall("term-06", "terminal", "Run `whoami` to show the current user.",
- `benchmarks/tool_call_benchmark.py:105` — [benchmark_suite] ToolCall("term-07", "terminal", "Run `df -h` to show disk usage.",
- `benchmarks/tool_call_benchmark.py:107` — [benchmark_suite] ToolCall("term-08", "terminal", "Run `python3 --version` to check Python version.",
- `benchmarks/tool_call_benchmark.py:109` — [benchmark_suite] ToolCall("term-09", "terminal", "Run `cat /etc/hostname` to get the hostname.",
- `benchmarks/tool_call_benchmark.py:111` — [benchmark_suite] ToolCall("term-10", "terminal", "Run `uptime` to see system uptime.",
- `benchmarks/tool_call_benchmark.py:113` — [benchmark_suite] ToolCall("term-11", "terminal", "Run `env | grep PATH` to show the PATH variable.",
- `benchmarks/tool_call_benchmark.py:115` — [benchmark_suite] ToolCall("term-12", "terminal", "Run `wc -l /etc/passwd` to count lines.",
- `benchmarks/tool_call_benchmark.py:117` — [benchmark_suite] ToolCall("term-13", "terminal", "Run `echo $SHELL` to show the current shell.",
- `benchmarks/tool_call_benchmark.py:119` — [benchmark_suite] ToolCall("term-14", "terminal", "Run `free -h || vm_stat` to check memory usage.",
- `benchmarks/tool_call_benchmark.py:121` — [benchmark_suite] ToolCall("term-15", "terminal", "Run `id` to show user and group IDs.",
- `benchmarks/tool_call_benchmark.py:123` — [benchmark_suite] ToolCall("term-16", "terminal", "Run `hostname` to get the machine hostname.",
- `benchmarks/tool_call_benchmark.py:125` — [benchmark_suite] ToolCall("term-17", "terminal", "Run `echo {1..5}` to test brace expansion.",
- `benchmarks/tool_call_benchmark.py:127` — [benchmark_suite] ToolCall("term-18", "terminal", "Run `seq 1 5` to generate a number sequence.",
- `benchmarks/tool_call_benchmark.py:129` — [benchmark_suite] ToolCall("term-19", "terminal", "Run `python3 -c 'print(2+2)'` to compute 2+2.",
- `benchmarks/tool_call_benchmark.py:131` — [benchmark_suite] ToolCall("term-20", "terminal", "Run `ls -d /tmp/bench* 2>/dev/null | wc -l` to count bench files.",
- `benchmarks/tool_call_benchmark.py:135` — [benchmark_suite] ToolCall("code-01", "code", "Execute a Python script that computes factorial of 10.",
- `benchmarks/tool_call_benchmark.py:137` — [benchmark_suite] ToolCall("code-02", "code", "Run Python to read /tmp/test_bench.txt and count its words.",
- `benchmarks/tool_call_benchmark.py:139` — [benchmark_suite] ToolCall("code-03", "code", "Execute Python to generate the first 20 Fibonacci numbers.",
- `benchmarks/tool_call_benchmark.py:141` — [benchmark_suite] ToolCall("code-04", "code", "Run Python to parse JSON from a string and print keys.",
- `benchmarks/tool_call_benchmark.py:143` — [benchmark_suite] ToolCall("code-05", "code", "Execute Python to list all files in /tmp/ matching 'bench*'.",
- `benchmarks/tool_call_benchmark.py:145` — [benchmark_suite] ToolCall("code-06", "code", "Run Python to compute the sum of squares from 1 to 100.",
- `benchmarks/tool_call_benchmark.py:147` — [benchmark_suite] ToolCall("code-07", "code", "Execute Python to check if 'racecar' is a palindrome.",
- `benchmarks/tool_call_benchmark.py:149` — [benchmark_suite] ToolCall("code-08", "code", "Run Python to create a CSV string with 5 rows of sample data.",
- `benchmarks/tool_call_benchmark.py:151` — [benchmark_suite] ToolCall("code-09", "code", "Execute Python to sort a list [5,2,8,1,9] and print the result.",
- `benchmarks/tool_call_benchmark.py:153` — [benchmark_suite] ToolCall("code-10", "code", "Run Python to count lines in /etc/passwd.",
- `benchmarks/tool_call_benchmark.py:155` — [benchmark_suite] ToolCall("code-11", "code", "Execute Python to hash the string 'benchmark' with SHA256.",
- `benchmarks/tool_call_benchmark.py:157` — [benchmark_suite] ToolCall("code-12", "code", "Run Python to get the current UTC timestamp.",
- `benchmarks/tool_call_benchmark.py:159` — [benchmark_suite] ToolCall("code-13", "code", "Execute Python to convert 'hello world' to uppercase and reverse it.",
- `benchmarks/tool_call_benchmark.py:161` — [benchmark_suite] ToolCall("code-14", "code", "Run Python to create a dictionary of system info (platform, python version).",
- `benchmarks/tool_call_benchmark.py:163` — [benchmark_suite] ToolCall("code-15", "code", "Execute Python to check internet connectivity by resolving google.com.",
- `benchmarks/tool_call_benchmark.py:167` — [benchmark_suite] ToolCall("deleg-01", "delegate", "Use a subagent to find all .log files in /tmp/.",
- `benchmarks/tool_call_benchmark.py:169` — [benchmark_suite] ToolCall("deleg-02", "delegate", "Delegate to a subagent: what is 15 * 37?",
- `benchmarks/tool_call_benchmark.py:171` — [benchmark_suite] ToolCall("deleg-03", "delegate", "Use a subagent to check if Python 3 is installed and its version.",
- `benchmarks/tool_call_benchmark.py:173` — [benchmark_suite] ToolCall("deleg-04", "delegate", "Delegate: read /tmp/test_bench.txt and summarize it in one sentence.",
- `benchmarks/tool_call_benchmark.py:175` — [benchmark_suite] ToolCall("deleg-05", "delegate", "Use a subagent to list the contents of /tmp/ directory.",
- `benchmarks/tool_call_benchmark.py:177` — [benchmark_suite] ToolCall("deleg-06", "delegate", "Delegate: count the number of .py files in the current directory.",
- `benchmarks/tool_call_benchmark.py:179` — [benchmark_suite] ToolCall("deleg-07", "delegate", "Use a subagent to check disk space with df -h.",
- `benchmarks/tool_call_benchmark.py:181` — [benchmark_suite] ToolCall("deleg-08", "delegate", "Delegate: what OS are we running on?",
- `benchmarks/tool_call_benchmark.py:183` — [benchmark_suite] ToolCall("deleg-09", "delegate", "Use a subagent to find the hostname of this machine.",
- `benchmarks/tool_call_benchmark.py:185` — [benchmark_suite] ToolCall("deleg-10", "delegate", "Delegate: create a temp file /tmp/bench_deleg.txt with 'done'.",
- `benchmarks/tool_call_benchmark.py:189` — [benchmark_suite] ToolCall("todo-01", "todo", "Add a todo item: 'Run benchmark suite'",
- `benchmarks/tool_call_benchmark.py:190` — [benchmark_suite] "todo", "benchmark"),
- `benchmarks/tool_call_benchmark.py:191` — [benchmark_suite] ToolCall("todo-02", "todo", "Show me the current todo list.",
- `benchmarks/tool_call_benchmark.py:193` — [benchmark_suite] ToolCall("todo-03", "todo", "Mark the first todo item as completed.",
- `benchmarks/tool_call_benchmark.py:195` — [benchmark_suite] ToolCall("todo-04", "todo", "Add a todo: 'Review benchmark results' with status pending.",
- `benchmarks/tool_call_benchmark.py:197` — [benchmark_suite] ToolCall("todo-05", "todo", "Clear all completed todos.",
- `benchmarks/tool_call_benchmark.py:199` — [benchmark_suite] ToolCall("todo-06", "memory", "Save this to memory: 'benchmark ran on {date}'".format(
- `benchmarks/tool_call_benchmark.py:201` — [benchmark_suite] "memory", "benchmark"),
- `benchmarks/tool_call_benchmark.py:202` — [benchmark_suite] ToolCall("todo-07", "memory", "Search memory for 'benchmark'.",
- `benchmarks/tool_call_benchmark.py:203` — [benchmark_suite] "memory", "benchmark"),
- `benchmarks/tool_call_benchmark.py:204` — [benchmark_suite] ToolCall("todo-08", "memory", "Add a memory note: 'test models are gemma-4 and mimo-v2-pro'.",
- `benchmarks/tool_call_benchmark.py:206` — [benchmark_suite] ToolCall("todo-09", "todo", "Add three todo items: 'analyze', 'report', 'cleanup'.",
- `benchmarks/tool_call_benchmark.py:208` — [benchmark_suite] ToolCall("todo-10", "memory", "Search memory for any notes about models.",
- `benchmarks/tool_call_benchmark.py:212` — [benchmark_suite] ToolCall("skill-01", "skills", "List all available skills.",
- `benchmarks/tool_call_benchmark.py:214` — [benchmark_suite] ToolCall("skill-02", "skills", "View the skill called 'test-driven-development'.",
- `benchmarks/tool_call_benchmark.py:216` — [benchmark_suite] ToolCall("skill-03", "skills", "Search for skills related to 'git'.",
- `benchmarks/tool_call_benchmark.py:218` — [benchmark_suite] ToolCall("skill-04", "skills", "View the 'code-review' skill.",
- `benchmarks/tool_call_benchmark.py:220` — [benchmark_suite] ToolCall("skill-05", "skills", "List all skills in the 'devops' category.",
- `benchmarks/tool_call_benchmark.py:222` — [benchmark_suite] ToolCall("skill-06", "skills", "View the 'systematic-debugging' skill.",
- `benchmarks/tool_call_benchmark.py:224` — [benchmark_suite] ToolCall("skill-07", "skills", "Search for skills about 'testing'.",
- `benchmarks/tool_call_benchmark.py:226` — [benchmark_suite] ToolCall("skill-08", "skills", "View the 'writing-plans' skill.",
- `benchmarks/tool_call_benchmark.py:228` — [benchmark_suite] ToolCall("skill-09", "skills", "List skills in 'software-development' category.",
- `benchmarks/tool_call_benchmark.py:230` — [benchmark_suite] ToolCall("skill-10", "skills", "View the 'pr-review-discipline' skill.",
- `benchmarks/tool_call_benchmark.py:234` — [benchmark_suite] ToolCall("file-21", "file", "Write a Python snippet to /tmp/bench_sort.py that sorts [3,1,2].",
- `benchmarks/tool_call_benchmark.py:236` — [benchmark_suite] ToolCall("file-22", "file", "Read /tmp/bench_sort.py back and confirm it exists.",
- `benchmarks/tool_call_benchmark.py:238` — [benchmark_suite] ToolCall("file-23", "file", "Search for 'class' in all .py files in the benchmarks directory.",
- `benchmarks/tool_call_benchmark.py:240` — [benchmark_suite] ToolCall("term-21", "terminal", "Run `cat /etc/os-release 2>/dev/null || sw_vers 2>/dev/null` for OS info.",
- `benchmarks/tool_call_benchmark.py:242` — [benchmark_suite] ToolCall("term-22", "terminal", "Run `nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null` for CPU count.",
- `benchmarks/tool_call_benchmark.py:244` — [benchmark_suite] ToolCall("code-16", "code", "Execute Python to flatten a nested list [[1,2],[3,4],[5]].",
- `benchmarks/tool_call_benchmark.py:246` — [benchmark_suite] ToolCall("code-17", "code", "Run Python to check if a number 17 is prime.",
- `benchmarks/tool_call_benchmark.py:248` — [benchmark_suite] ToolCall("deleg-11", "delegate", "Delegate: what is the current working directory?",
- `benchmarks/tool_call_benchmark.py:250` — [benchmark_suite] ToolCall("todo-11", "todo", "Add a todo: 'Finalize benchmark report' status pending.",
- `benchmarks/tool_call_benchmark.py:252` — [benchmark_suite] ToolCall("todo-12", "memory", "Store fact: 'benchmark categories: file, terminal, code, delegate, todo, memory, skills'.",
- `benchmarks/tool_call_benchmark.py:254` — [benchmark_suite] ToolCall("skill-11", "skills", "Search for skills about 'deployment'.",
- `benchmarks/tool_call_benchmark.py:256` — [benchmark_suite] ToolCall("skill-12", "skills", "View the 'gitea-burn-cycle' skill.",
- `benchmarks/tool_call_benchmark.py:258` — [benchmark_suite] ToolCall("skill-13", "skills", "List all available skill categories.",
- `benchmarks/tool_call_benchmark.py:260` — [benchmark_suite] ToolCall("skill-14", "skills", "Search for skills related to 'memory'.",
- `benchmarks/tool_call_benchmark.py:262` — [benchmark_suite] ToolCall("skill-15", "skills", "View the 'mimo-swarm' skill.",
- `benchmarks/tool_call_benchmark.py:311` — [benchmark_suite] """Create prerequisite files for the benchmark."""
- `benchmarks/tool_call_benchmark.py:313` — [benchmark_suite] "This is a benchmark test file.\n"
- `benchmarks/tool_call_benchmark.py:349` — [benchmark_suite] "You are a benchmark test runner. Execute the user's request by calling "
- `benchmarks/tool_call_benchmark.py:406` — [benchmark_suite] """Generate markdown benchmark report."""
- `benchmarks/tool_call_benchmark.py:428` — [benchmark_suite] f"# Tool-Calling Benchmark Report",
- `benchmarks/tool_call_benchmark.py:535` — [benchmark_suite] parser = argparse.ArgumentParser(description="Tool-calling benchmark")
- `benchmarks/tool_call_benchmark.py:544` — [benchmark_suite] help="Output report path (default: benchmarks/gemma4-tool-calling-YYYY-MM-DD.md)")
- `benchmarks/tool_call_benchmark.py:565` — [benchmark_suite] output_path = Path(args.output) if args.output else REPO_ROOT / "benchmarks" / f"gemma4-tool-calling-{date_str}.md"
- `benchmarks/tool_call_benchmark.py:575` — [benchmark_suite] print(f"Benchmark: {len(suite)} tests × {len(model_specs)} models = {len(suite) * len(model_specs)} calls")
## Suggested next slice
1. Build an exporter that emits SessionDB + trajectory data into a TensorZero-friendly offline dataset.
2. Define percentage-based canary controls before attempting any gateway replacement.
3. Keep Hermes routing authoritative until TensorZero proves parity across CLI, gateway, cron, auxiliary, and delegation surfaces.

View File

@@ -523,7 +523,7 @@ DEFAULT_CONFIG = {
# Text-to-speech configuration
"tts": {
"provider": "edge", # "edge" (free) | "elevenlabs" (premium) | "openai" | "minimax" | "mistral" | "neutts" (local) | "kittentts" (local)
"provider": "edge", # "edge" (free) | "elevenlabs" (premium) | "openai" | "minimax" | "mistral" | "neutts" (local)
"edge": {
"voice": "en-US-AriaNeural",
# Popular: AriaNeural, JennyNeural, AndrewNeural, BrianNeural, SoniaNeural
@@ -547,12 +547,6 @@ DEFAULT_CONFIG = {
"model": "neuphonic/neutts-air-q4-gguf", # HuggingFace model repo
"device": "cpu", # cpu, cuda, or mps
},
"kittentts": {
"model": "KittenML/kitten-tts-nano-0.8-int8", # 25MB int8 default
"voice": "Jasper", # Jasper, Bella, Luna, Bruno, Rosie, Hugo, Kiki, Leo
"speed": 1.0,
"clean_text": True,
},
},
"stt": {

View File

@@ -443,16 +443,6 @@ def _print_setup_summary(config: dict, hermes_home):
tool_status.append(("Text-to-Speech (NeuTTS local)", True, None))
else:
tool_status.append(("Text-to-Speech (NeuTTS — not installed)", False, "run 'hermes setup tts'"))
elif tts_provider == "kittentts":
try:
import importlib.util
kittentts_ok = importlib.util.find_spec("kittentts") is not None
except Exception:
kittentts_ok = False
if kittentts_ok:
tool_status.append(("Text-to-Speech (KittenTTS local)", True, None))
else:
tool_status.append(("Text-to-Speech (KittenTTS — not installed)", False, "run 'hermes setup tts'"))
else:
tool_status.append(("Text-to-Speech (Edge TTS)", True, None))
@@ -901,7 +891,6 @@ def _install_neutts_deps() -> bool:
return False
else:
print_warning("espeak-ng is required for NeuTTS. Install it manually before using NeuTTS.")
return False
# Install neutts Python package
print()
@@ -921,34 +910,8 @@ def _install_neutts_deps() -> bool:
return False
def _install_kittentts_deps() -> bool:
"""Install KittenTTS dependencies with user approval. Returns True on success."""
import subprocess
import sys
wheel_url = (
"https://github.com/KittenML/KittenTTS/releases/download/"
"0.8.1/kittentts-0.8.1-py3-none-any.whl"
)
print()
print_info("Installing kittentts Python package (~25-80MB model downloaded on first use)...")
print()
try:
subprocess.run(
[sys.executable, "-m", "pip", "install", "-U", wheel_url, "soundfile", "--quiet"],
check=True, timeout=300,
)
print_success("kittentts installed successfully")
return True
except (subprocess.CalledProcessError, subprocess.TimeoutExpired) as e:
print_error(f"Failed to install kittentts: {e}")
print_info(f"Try manually: python -m pip install -U '{wheel_url}' soundfile")
return False
def _setup_tts_provider(config: dict):
"""Interactive TTS provider selection with install flow for local providers."""
"""Interactive TTS provider selection with install flow for NeuTTS."""
tts_config = config.get("tts", {})
current_provider = tts_config.get("provider", "edge")
subscription_features = get_nous_subscription_features(config)
@@ -960,7 +923,6 @@ def _setup_tts_provider(config: dict):
"minimax": "MiniMax TTS",
"mistral": "Mistral Voxtral TTS",
"neutts": "NeuTTS",
"kittentts": "KittenTTS",
}
current_label = provider_labels.get(current_provider, current_provider)
@@ -982,10 +944,9 @@ def _setup_tts_provider(config: dict):
"MiniMax TTS (high quality with voice cloning, needs API key)",
"Mistral Voxtral TTS (multilingual, native Opus, needs API key)",
"NeuTTS (local on-device, free, ~300MB model download)",
"KittenTTS (local on-device, free, lightweight ~25-80MB ONNX)",
]
)
providers.extend(["edge", "elevenlabs", "openai", "minimax", "mistral", "neutts", "kittentts"])
providers.extend(["edge", "elevenlabs", "openai", "minimax", "mistral", "neutts"])
choices.append(f"Keep current ({current_label})")
keep_current_idx = len(choices) - 1
idx = prompt_choice("Select TTS provider:", choices, keep_current_idx)
@@ -1027,28 +988,6 @@ def _setup_tts_provider(config: dict):
print_info("Skipping install. Set tts.provider to 'neutts' after installing manually.")
selected = "edge"
elif selected == "kittentts":
try:
import importlib.util
already_installed = importlib.util.find_spec("kittentts") is not None
except Exception:
already_installed = False
if already_installed:
print_success("KittenTTS is already installed")
else:
print()
print_info("KittenTTS is lightweight (~25-80MB, CPU-only, no API key required).")
print_info("Voices: Jasper, Bella, Luna, Bruno, Rosie, Hugo, Kiki, Leo")
print()
if prompt_yes_no("Install KittenTTS now?", True):
if not _install_kittentts_deps():
print_warning("KittenTTS installation incomplete. Falling back to Edge TTS.")
selected = "edge"
else:
print_info("Skipping install. Set tts.provider to 'kittentts' after installing manually.")
selected = "edge"
elif selected == "elevenlabs":
existing = get_env_value("ELEVENLABS_API_KEY")
if not existing:

View File

@@ -164,14 +164,6 @@ TOOL_CATEGORIES = {
],
"tts_provider": "mistral",
},
{
"name": "KittenTTS",
"badge": "local · free",
"tag": "Lightweight local ONNX TTS (~25MB), no API key",
"env_vars": [],
"tts_provider": "kittentts",
"post_setup": "kittentts",
},
],
},
"web": {
@@ -411,36 +403,6 @@ def _run_post_setup(post_setup_key: str):
_print_warning(" Node.js not found. Install Camofox via Docker:")
_print_info(" docker run -p 9377:9377 -e CAMOFOX_PORT=9377 jo-inc/camofox-browser")
elif post_setup_key == "kittentts":
try:
__import__("kittentts")
_print_success(" kittentts is already installed")
return
except ImportError:
pass
import subprocess
_print_info(" Installing kittentts (~25-80MB model, CPU-only)...")
wheel_url = (
"https://github.com/KittenML/KittenTTS/releases/download/"
"0.8.1/kittentts-0.8.1-py3-none-any.whl"
)
try:
result = subprocess.run(
[sys.executable, "-m", "pip", "install", "-U", wheel_url, "soundfile", "--quiet"],
capture_output=True, text=True, timeout=300,
)
if result.returncode == 0:
_print_success(" kittentts installed")
_print_info(" Voices: Jasper, Bella, Luna, Bruno, Rosie, Hugo, Kiki, Leo")
_print_info(" Models: KittenML/kitten-tts-nano-0.8-int8 (25MB), micro (41MB), mini (80MB)")
else:
_print_warning(" kittentts install failed:")
_print_info(f" {result.stderr.strip()[:300]}")
_print_info(f" Run manually: python -m pip install -U '{wheel_url}' soundfile")
except subprocess.TimeoutExpired:
_print_warning(" kittentts install timed out (>5min)")
_print_info(f" Run manually: python -m pip install -U '{wheel_url}' soundfile")
elif post_setup_key == "rl_training":
try:
__import__("tinker_atropos")

View File

@@ -0,0 +1,318 @@
#!/usr/bin/env python3
"""Generate a grounded TensorZero evaluation packet for Hermes.
This script inventories the current Hermes routing/evaluation surfaces, then
builds a markdown packet assessing how much of issue #860 can be satisfied by
TensorZero and where the migration risk still lives.
"""
from __future__ import annotations
import argparse
import json
import re
from dataclasses import asdict, dataclass
from pathlib import Path
from typing import Iterable
ISSUE_NUMBER = 860
ISSUE_TITLE = "tensorzero LLMOps platform evaluation"
ISSUE_URL = "https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/860"
DEFAULT_OUTPUT = Path("docs/evaluations/tensorzero-860-evaluation.md")
DEFAULT_JSON_OUTPUT = Path("docs/evaluations/tensorzero-860-evaluation.json")
@dataclass(frozen=True)
class TouchpointPattern:
label: str
file_path: str
regex: str
description: str
@dataclass(frozen=True)
class Touchpoint:
label: str
file_path: str
line_number: int
matched_text: str
@dataclass(frozen=True)
class RequirementStatus:
key: str
name: str
status: str
evidence_labels: tuple[str, ...]
summary: str
@dataclass(frozen=True)
class EvaluationReport:
issue_number: int
issue_title: str
issue_url: str
recommendation: str
touchpoints: tuple[Touchpoint, ...]
requirements: tuple[RequirementStatus, ...]
PATTERNS: tuple[TouchpointPattern, ...] = (
TouchpointPattern(
label="fallback_chain",
file_path="run_agent.py",
regex=r"_fallback_chain|fallback_providers|fallback_model",
description="Primary agent fallback-provider chain in the core conversation loop.",
),
TouchpointPattern(
label="provider_routing_config",
file_path="cli.py",
regex=r"provider_routing|fallback_providers|smart_model_routing",
description="CLI-owned provider routing and fallback configuration surfaces.",
),
TouchpointPattern(
label="runtime_provider",
file_path="hermes_cli/runtime_provider.py",
regex=r"def resolve_runtime_provider|def resolve_requested_provider",
description="Central runtime provider resolution for CLI, gateway, cron, and helpers.",
),
TouchpointPattern(
label="smart_model_routing",
file_path="agent/smart_model_routing.py",
regex=r"def resolve_turn_route|def choose_cheap_model_route",
description="Cheap-vs-strong turn routing that TensorZero would need to absorb or replace.",
),
TouchpointPattern(
label="gateway_provider_routing",
file_path="gateway/run.py",
regex=r"def _load_provider_routing|def _load_fallback_model|def _load_smart_model_routing",
description="Gateway-specific loading of routing, fallback, and smart-model policies.",
),
TouchpointPattern(
label="cron_runtime_provider",
file_path="cron/scheduler.py",
regex=r"resolve_runtime_provider|resolve_turn_route|provider_routing|fallback_model",
description="Cron execution path that re-resolves providers and routing on every run.",
),
TouchpointPattern(
label="auxiliary_fallback_chain",
file_path="agent/auxiliary_client.py",
regex=r"fallback chain|_get_provider_chain|provider chain",
description="Auxiliary task routing/fallback chain outside the main inference path.",
),
TouchpointPattern(
label="delegate_runtime_provider",
file_path="tools/delegate_tool.py",
regex=r"runtime provider system|resolve the full credential bundle|resolve_runtime_provider",
description="Subagent/delegation routing path that would also need TensorZero parity.",
),
TouchpointPattern(
label="session_db",
file_path="hermes_state.py",
regex=r"class SessionDB",
description="Session persistence surface that could feed TensorZero optimization/eval data.",
),
TouchpointPattern(
label="trajectory_export",
file_path="batch_runner.py",
regex=r"trajectory_entry|save_trajectories|_convert_to_trajectory_format",
description="Trajectory export surface for offline optimization and replay data.",
),
TouchpointPattern(
label="benchmark_suite",
file_path="benchmarks/tool_call_benchmark.py",
regex=r"ToolCall\(|class ToolCall|benchmark",
description="Existing benchmark/evaluation harness that could map to TensorZero experiments.",
),
)
def _iter_matches(pattern: TouchpointPattern, text: str) -> Iterable[Touchpoint]:
regex = re.compile(pattern.regex, re.IGNORECASE)
for line_number, line in enumerate(text.splitlines(), start=1):
if regex.search(line):
yield Touchpoint(
label=pattern.label,
file_path=pattern.file_path,
line_number=line_number,
matched_text=line.strip(),
)
def scan_touchpoints(repo_root: Path) -> list[Touchpoint]:
touchpoints: list[Touchpoint] = []
for pattern in PATTERNS:
path = repo_root / pattern.file_path
if not path.exists():
continue
text = path.read_text(encoding="utf-8")
touchpoints.extend(_iter_matches(pattern, text))
return touchpoints
def build_requirement_matrix(touchpoints: list[Touchpoint]) -> list[RequirementStatus]:
labels = {tp.label for tp in touchpoints}
matrix: list[RequirementStatus] = []
gateway_labels = (
"fallback_chain",
"runtime_provider",
"gateway_provider_routing",
"cron_runtime_provider",
"auxiliary_fallback_chain",
"delegate_runtime_provider",
)
gateway_hits = tuple(label for label in gateway_labels if label in labels)
gateway_status = "partial" if len(gateway_hits) >= 4 else "gap"
gateway_summary = (
"Hermes already spreads provider routing across core agent, runtime provider, gateway, cron, auxiliary, and delegation seams; "
"TensorZero would need parity across all of them before it can replace the gateway layer."
if gateway_hits else
"No grounded routing surfaces were found for a gateway replacement assessment."
)
matrix.append(RequirementStatus("gateway_replacement", "Gateway replacement scope", gateway_status, gateway_hits, gateway_summary))
config_labels = (
"provider_routing_config",
"runtime_provider",
"smart_model_routing",
"fallback_chain",
)
config_hits = tuple(label for label in config_labels if label in labels)
config_status = "partial" if len(config_hits) >= 3 else "gap"
config_summary = (
"Hermes has multiple config concepts to migrate (`provider_routing`, `fallback_providers`, `smart_model_routing`, runtime provider resolution), "
"so TensorZero is not a drop-in config swap."
if config_hits else
"No current config migration surface was found."
)
matrix.append(RequirementStatus("config_migration", "Config migration", config_status, config_hits, config_summary))
canary_hits: tuple[str, ...] = tuple()
canary_summary = (
"The repo shows semantic routing and fallback, but no grounded 10% traffic-split canary mechanism. "
"A TensorZero cutover would need new percentage-based rollout controls and observability hooks."
)
matrix.append(RequirementStatus("canary_rollout", "10% traffic canary", "gap", canary_hits, canary_summary))
session_labels = ("session_db", "trajectory_export")
session_hits = tuple(label for label in session_labels if label in labels)
session_status = "partial" if len(session_hits) == len(session_labels) else "gap"
session_summary = (
"Hermes already has SessionDB and trajectory export surfaces that can feed offline optimization data, "
"but not a TensorZero-native ingestion path yet."
if session_hits else
"No session-data surface was found for prompt optimization."
)
matrix.append(RequirementStatus("session_feedback", "Session data for prompt optimization", session_status, session_hits, session_summary))
eval_labels = ("benchmark_suite", "trajectory_export")
eval_hits = tuple(label for label in eval_labels if label in labels)
eval_status = "partial" if "benchmark_suite" in eval_hits else "gap"
eval_summary = (
"Hermes already has benchmark/trajectory machinery that can seed TensorZero A/B evaluation, "
"but no integrated TensorZero experiment runner or live evaluation gateway."
if eval_hits else
"No evaluation harness was found to support TensorZero A/B testing."
)
matrix.append(RequirementStatus("evaluation_suite", "Evaluation suite / A/B testing", eval_status, eval_hits, eval_summary))
return matrix
def build_report(touchpoints: list[Touchpoint], requirement_matrix: list[RequirementStatus]) -> EvaluationReport:
recommendation = (
"Not ready for direct replacement. Recommend a shadow-evaluation phase first: keep Hermes routing live, "
"inventory the migration seams, export SessionDB/trajectory data into an offline TensorZero experiment loop, "
"and only design a canary gateway once percentage-based rollout controls exist."
)
return EvaluationReport(
issue_number=ISSUE_NUMBER,
issue_title=ISSUE_TITLE,
issue_url=ISSUE_URL,
recommendation=recommendation,
touchpoints=tuple(touchpoints),
requirements=tuple(requirement_matrix),
)
def build_markdown(report: EvaluationReport) -> str:
lines: list[str] = []
lines.append("# TensorZero Evaluation Packet")
lines.append("")
lines.append(f"Issue #{report.issue_number}: [{report.issue_title}]({report.issue_url})")
lines.append("")
lines.append("## Scope")
lines.append("")
lines.append("This packet evaluates TensorZero as a possible replacement for Hermes' custom provider-routing stack.")
lines.append("It is intentionally grounded in the current repo state rather than a speculative cutover plan.")
lines.append("")
lines.append("## Issue requirements being evaluated")
lines.append("")
lines.append("- Deploy tensorzero gateway (Rust binary)")
lines.append("- Migrate provider routing config")
lines.append("- Test with canary (10% traffic) before full cutover")
lines.append("- Feed session data for prompt optimization")
lines.append("- Evaluation suite for A/B testing models")
lines.append("")
lines.append("## Recommendation")
lines.append("")
lines.append(report.recommendation)
lines.append("")
lines.append("## Requirement matrix")
lines.append("")
lines.append("| Requirement | Status | Evidence labels | Summary |")
lines.append("| --- | --- | --- | --- |")
for row in report.requirements:
evidence = ", ".join(row.evidence_labels) if row.evidence_labels else ""
lines.append(f"| {row.name} | {row.status} | {evidence} | {row.summary} |")
lines.append("")
lines.append("## Grounded Hermes touchpoints")
lines.append("")
if report.touchpoints:
for tp in report.touchpoints:
lines.append(f"- `{tp.file_path}:{tp.line_number}` — [{tp.label}] {tp.matched_text}")
else:
lines.append("- No routing/evaluation touchpoints were found.")
lines.append("")
lines.append("## Suggested next slice")
lines.append("")
lines.append("1. Build an exporter that emits SessionDB + trajectory data into a TensorZero-friendly offline dataset.")
lines.append("2. Define percentage-based canary controls before attempting any gateway replacement.")
lines.append("3. Keep Hermes routing authoritative until TensorZero proves parity across CLI, gateway, cron, auxiliary, and delegation surfaces.")
lines.append("")
return "\n".join(lines).rstrip() + "\n"
def write_outputs(report: EvaluationReport, markdown_path: Path, json_path: Path | None = None) -> None:
markdown_path.parent.mkdir(parents=True, exist_ok=True)
markdown_path.write_text(build_markdown(report), encoding="utf-8")
if json_path is not None:
json_path.parent.mkdir(parents=True, exist_ok=True)
json_path.write_text(json.dumps(asdict(report), indent=2), encoding="utf-8")
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description="Generate a grounded TensorZero evaluation packet for Hermes")
parser.add_argument("--repo-root", default=".", help="Hermes repo root to scan")
parser.add_argument("--output", default=str(DEFAULT_OUTPUT), help="Markdown output path")
parser.add_argument("--json-output", default=str(DEFAULT_JSON_OUTPUT), help="Optional JSON output path")
return parser.parse_args()
def main() -> int:
args = parse_args()
repo_root = Path(args.repo_root).resolve()
touchpoints = scan_touchpoints(repo_root)
matrix = build_requirement_matrix(touchpoints)
report = build_report(touchpoints, matrix)
json_output = Path(args.json_output) if args.json_output else None
write_outputs(report, Path(args.output), json_output)
print(f"Wrote {args.output}")
if json_output is not None:
print(f"Wrote {json_output}")
return 0
if __name__ == "__main__":
raise SystemExit(main())

View File

@@ -0,0 +1,149 @@
from pathlib import Path
import sys
SCRIPT_DIR = Path(__file__).resolve().parents[1] / "scripts"
sys.path.insert(0, str(SCRIPT_DIR))
import tensorzero_eval_packet as tz
def test_scan_touchpoints_finds_expected_matches(tmp_path):
(tmp_path / "run_agent.py").write_text(
"self._fallback_chain = []\n# Provider fallback chain\n"
)
(tmp_path / "hermes_cli").mkdir()
(tmp_path / "hermes_cli" / "runtime_provider.py").write_text(
"def resolve_runtime_provider():\n return {}\n"
)
(tmp_path / "agent").mkdir()
(tmp_path / "agent" / "smart_model_routing.py").write_text(
"def resolve_turn_route(user_message, routing_config, primary):\n return primary\n"
)
(tmp_path / "gateway").mkdir()
(tmp_path / "gateway" / "run.py").write_text(
"def _load_provider_routing():\n return {}\n"
)
(tmp_path / "cron").mkdir()
(tmp_path / "cron" / "scheduler.py").write_text(
"runtime = resolve_runtime_provider()\nturn_route = resolve_turn_route('x', {}, {})\n"
)
(tmp_path / "hermes_state.py").write_text("class SessionDB:\n pass\n")
(tmp_path / "benchmarks").mkdir()
(tmp_path / "benchmarks" / "tool_call_benchmark.py").write_text(
"class ToolCall: ...\n"
)
touchpoints = tz.scan_touchpoints(tmp_path)
labels = {tp.label for tp in touchpoints}
assert "fallback_chain" in labels
assert "runtime_provider" in labels
assert "smart_model_routing" in labels
assert "gateway_provider_routing" in labels
assert "cron_runtime_provider" in labels
assert "session_db" in labels
assert "benchmark_suite" in labels
def test_build_requirement_matrix_marks_canary_as_gap_without_split_support():
touchpoints = [
tz.Touchpoint(
label="runtime_provider",
file_path="hermes_cli/runtime_provider.py",
line_number=10,
matched_text="def resolve_runtime_provider",
),
tz.Touchpoint(
label="provider_routing_config",
file_path="cli.py",
line_number=20,
matched_text='provider_routing',
),
tz.Touchpoint(
label="fallback_chain",
file_path="run_agent.py",
line_number=21,
matched_text='_fallback_chain = []',
),
tz.Touchpoint(
label="smart_model_routing",
file_path="agent/smart_model_routing.py",
line_number=30,
matched_text='resolve_turn_route',
),
tz.Touchpoint(
label="gateway_provider_routing",
file_path="gateway/run.py",
line_number=35,
matched_text='def _load_provider_routing',
),
tz.Touchpoint(
label="cron_runtime_provider",
file_path="cron/scheduler.py",
line_number=36,
matched_text='runtime = resolve_runtime_provider()',
),
tz.Touchpoint(
label="session_db",
file_path="hermes_state.py",
line_number=40,
matched_text='class SessionDB',
),
tz.Touchpoint(
label="trajectory_export",
file_path="batch_runner.py",
line_number=50,
matched_text='trajectory_entry',
),
tz.Touchpoint(
label="benchmark_suite",
file_path="benchmarks/tool_call_benchmark.py",
line_number=60,
matched_text='ToolCall',
),
]
matrix = tz.build_requirement_matrix(touchpoints)
by_key = {row.key: row for row in matrix}
assert by_key["gateway_replacement"].status == "partial"
assert by_key["config_migration"].status == "partial"
assert by_key["canary_rollout"].status == "gap"
assert by_key["session_feedback"].status == "partial"
assert by_key["evaluation_suite"].status == "partial"
def test_build_markdown_renders_recommendation_and_touchpoints():
touchpoints = [
tz.Touchpoint(
label="runtime_provider",
file_path="hermes_cli/runtime_provider.py",
line_number=10,
matched_text="def resolve_runtime_provider",
),
tz.Touchpoint(
label="session_db",
file_path="hermes_state.py",
line_number=40,
matched_text='class SessionDB',
),
]
matrix = tz.build_requirement_matrix(touchpoints)
report = tz.build_report(touchpoints, matrix)
markdown = tz.build_markdown(report)
assert "# TensorZero Evaluation Packet" in markdown
assert "gateway_replacement" not in markdown # human labels, not raw keys
assert "Gateway replacement scope" in markdown
assert "Not ready for direct replacement" in markdown
assert "hermes_cli/runtime_provider.py:10" in markdown
assert "hermes_state.py:40" in markdown
def test_issue_context_is_embedded_in_report():
report = tz.build_report([], [])
markdown = tz.build_markdown(report)
assert "Issue #860" in markdown
assert "tensorzero" in markdown.lower()
assert "10% traffic" in markdown

View File

@@ -1,236 +0,0 @@
"""Tests for the KittenTTS local provider in tools/tts_tool.py."""
import json
from unittest.mock import MagicMock, patch
import numpy as np
import pytest
@pytest.fixture(autouse=True)
def clean_env(monkeypatch):
for key in ("HERMES_SESSION_PLATFORM",):
monkeypatch.delenv(key, raising=False)
@pytest.fixture(autouse=True)
def clear_kittentts_cache():
"""Reset the module-level model cache between tests."""
from tools import tts_tool as _tt
_tt._kittentts_model_cache.clear()
yield
_tt._kittentts_model_cache.clear()
@pytest.fixture
def mock_kittentts_module():
"""Inject a fake kittentts + soundfile module that return stub objects."""
fake_model = MagicMock()
# 24kHz float32 PCM at ~2s of silence
fake_model.generate.return_value = np.zeros(48000, dtype=np.float32)
fake_cls = MagicMock(return_value=fake_model)
fake_kittentts = MagicMock()
fake_kittentts.KittenTTS = fake_cls
# Stub soundfile — the real package isn't installed in CI venv, and
# _generate_kittentts does `import soundfile as sf` at runtime.
fake_sf = MagicMock()
def _fake_write(path, audio, samplerate):
# Emulate writing a real file so downstream path checks succeed.
import pathlib
pathlib.Path(path).write_bytes(b"RIFF\x00\x00\x00\x00WAVEfmt fake")
fake_sf.write = _fake_write
with patch.dict(
"sys.modules",
{"kittentts": fake_kittentts, "soundfile": fake_sf},
):
yield fake_model, fake_cls
class TestGenerateKittenTts:
def test_successful_wav_generation(self, tmp_path, mock_kittentts_module):
from tools.tts_tool import _generate_kittentts
fake_model, fake_cls = mock_kittentts_module
output_path = str(tmp_path / "test.wav")
result = _generate_kittentts("Hello world", output_path, {})
assert result == output_path
assert (tmp_path / "test.wav").exists()
fake_cls.assert_called_once()
fake_model.generate.assert_called_once()
def test_config_passes_voice_speed_cleantext(self, tmp_path, mock_kittentts_module):
from tools.tts_tool import _generate_kittentts
fake_model, _ = mock_kittentts_module
config = {
"kittentts": {
"model": "KittenML/kitten-tts-mini-0.8",
"voice": "Luna",
"speed": 1.25,
"clean_text": False,
}
}
_generate_kittentts("Hi there", str(tmp_path / "out.wav"), config)
call_kwargs = fake_model.generate.call_args.kwargs
assert call_kwargs["voice"] == "Luna"
assert call_kwargs["speed"] == 1.25
assert call_kwargs["clean_text"] is False
def test_default_model_and_voice(self, tmp_path, mock_kittentts_module):
from tools.tts_tool import (
DEFAULT_KITTENTTS_MODEL,
DEFAULT_KITTENTTS_VOICE,
_generate_kittentts,
)
fake_model, fake_cls = mock_kittentts_module
_generate_kittentts("Hi", str(tmp_path / "out.wav"), {})
fake_cls.assert_called_once_with(DEFAULT_KITTENTTS_MODEL)
assert fake_model.generate.call_args.kwargs["voice"] == DEFAULT_KITTENTTS_VOICE
def test_model_is_cached_across_calls(self, tmp_path, mock_kittentts_module):
from tools.tts_tool import _generate_kittentts
_, fake_cls = mock_kittentts_module
_generate_kittentts("One", str(tmp_path / "a.wav"), {})
_generate_kittentts("Two", str(tmp_path / "b.wav"), {})
# Same model name → class instantiated exactly once
assert fake_cls.call_count == 1
def test_different_models_are_cached_separately(self, tmp_path, mock_kittentts_module):
from tools.tts_tool import _generate_kittentts
_, fake_cls = mock_kittentts_module
_generate_kittentts(
"A",
str(tmp_path / "a.wav"),
{"kittentts": {"model": "KittenML/kitten-tts-nano-0.8-int8"}},
)
_generate_kittentts(
"B",
str(tmp_path / "b.wav"),
{"kittentts": {"model": "KittenML/kitten-tts-mini-0.8"}},
)
assert fake_cls.call_count == 2
def test_non_wav_extension_triggers_ffmpeg_conversion(
self, tmp_path, mock_kittentts_module, monkeypatch
):
"""Non-.wav output path causes WAV → target ffmpeg conversion."""
from tools import tts_tool as _tt
calls = []
def fake_shutil_which(cmd):
return "/usr/bin/ffmpeg" if cmd == "ffmpeg" else None
def fake_run(cmd, check=False, timeout=None, **kw):
calls.append(cmd)
# Emulate ffmpeg writing the output file
import pathlib
out_path = cmd[-1]
pathlib.Path(out_path).write_bytes(b"fake-mp3-data")
return MagicMock(returncode=0)
monkeypatch.setattr(_tt.shutil, "which", fake_shutil_which)
monkeypatch.setattr(_tt.subprocess, "run", fake_run)
output_path = str(tmp_path / "test.mp3")
result = _tt._generate_kittentts("Hi", output_path, {})
assert result == output_path
assert len(calls) == 1
assert calls[0][0] == "/usr/bin/ffmpeg"
def test_missing_kittentts_raises_import_error(self, tmp_path, monkeypatch):
"""When kittentts package is not installed, _import_kittentts raises."""
import sys
monkeypatch.setitem(sys.modules, "kittentts", None)
from tools.tts_tool import _generate_kittentts
with pytest.raises((ImportError, TypeError)):
_generate_kittentts("Hi", str(tmp_path / "out.wav"), {})
class TestCheckKittenttsAvailable:
def test_reports_available_when_package_present(self, monkeypatch):
import importlib.util
from tools.tts_tool import _check_kittentts_available
fake_spec = MagicMock()
monkeypatch.setattr(
importlib.util,
"find_spec",
lambda name: fake_spec if name == "kittentts" else None,
)
assert _check_kittentts_available() is True
def test_reports_unavailable_when_package_missing(self, monkeypatch):
import importlib.util
from tools.tts_tool import _check_kittentts_available
monkeypatch.setattr(importlib.util, "find_spec", lambda name: None)
assert _check_kittentts_available() is False
class TestDispatcherBranch:
def test_kittentts_not_installed_returns_helpful_error(self, monkeypatch, tmp_path):
"""When provider=kittentts but package missing, return JSON error with setup hint."""
import sys
monkeypatch.setitem(sys.modules, "kittentts", None)
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
from tools.tts_tool import text_to_speech_tool
# Write a config telling it to use kittentts
import yaml
(tmp_path / "config.yaml").write_text(
yaml.safe_dump({"tts": {"provider": "kittentts"}})
)
result = json.loads(text_to_speech_tool(text="Hello"))
assert result["success"] is False
assert "kittentts" in result["error"].lower()
assert "hermes setup tts" in result["error"].lower()
def test_non_telegram_explicit_wav_path_is_preserved(
self, monkeypatch, tmp_path, mock_kittentts_module
):
"""Explicit WAV outputs should stay WAV outside Telegram sessions."""
import yaml
from tools import tts_tool as _tt
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
(tmp_path / "config.yaml").write_text(
yaml.safe_dump({"tts": {"provider": "kittentts"}})
)
def fail_convert(_path):
raise AssertionError("_convert_to_opus should not run outside Telegram")
monkeypatch.setattr(_tt, "_convert_to_opus", fail_convert)
result = json.loads(
_tt.text_to_speech_tool(
text="Hello from KittenTTS",
output_path=str(tmp_path / "out.wav"),
)
)
assert result["success"] is True
assert result["file_path"] == str(tmp_path / "out.wav")
assert (tmp_path / "out.wav").exists()

View File

@@ -2,14 +2,13 @@
"""
Text-to-Speech Tool Module
Supports seven TTS providers:
Supports six TTS providers:
- Edge TTS (default, free, no API key): Microsoft Edge neural voices
- ElevenLabs (premium): High-quality voices, needs ELEVENLABS_API_KEY
- OpenAI TTS: Good quality, needs OPENAI_API_KEY
- MiniMax TTS: High-quality with voice cloning, needs MINIMAX_API_KEY
- Mistral (Voxtral TTS): Multilingual, native Opus, needs MISTRAL_API_KEY
- NeuTTS (local, free, no API key): On-device TTS via neutts_cli, needs neutts installed
- KittenTTS (local, free, no API key): Lightweight on-device ONNX TTS via kittentts
Output formats:
- Opus (.ogg) for Telegram voice bubbles (requires ffmpeg for Edge TTS)
@@ -78,12 +77,6 @@ def _import_sounddevice():
return sd
def _import_kittentts():
"""Lazy import KittenTTS. Returns the class or raises ImportError."""
from kittentts import KittenTTS
return KittenTTS
# ===========================================================================
# Defaults
# ===========================================================================
@@ -93,8 +86,6 @@ DEFAULT_ELEVENLABS_VOICE_ID = "pNInz6obpgDQGcFmaJgB" # Adam
DEFAULT_ELEVENLABS_MODEL_ID = "eleven_multilingual_v2"
DEFAULT_ELEVENLABS_STREAMING_MODEL_ID = "eleven_flash_v2_5"
DEFAULT_OPENAI_MODEL = "gpt-4o-mini-tts"
DEFAULT_KITTENTTS_MODEL = "KittenML/kitten-tts-nano-0.8-int8" # 25MB
DEFAULT_KITTENTTS_VOICE = "Jasper"
DEFAULT_OPENAI_VOICE = "alloy"
DEFAULT_OPENAI_BASE_URL = "https://api.openai.com/v1"
DEFAULT_MINIMAX_MODEL = "speech-2.8-hd"
@@ -457,15 +448,6 @@ def _check_neutts_available() -> bool:
return False
def _check_kittentts_available() -> bool:
"""Check if the kittentts engine is importable (installed locally)."""
try:
import importlib.util
return importlib.util.find_spec("kittentts") is not None
except Exception:
return False
def _default_neutts_ref_audio() -> str:
"""Return path to the bundled default voice reference audio."""
return str(Path(__file__).parent / "neutts_samples" / "jo.wav")
@@ -529,51 +511,6 @@ def _generate_neutts(text: str, output_path: str, tts_config: Dict[str, Any]) ->
return output_path
# ===========================================================================
# Provider: KittenTTS (local, lightweight)
# ===========================================================================
# Module-level cache for KittenTTS model instances
_kittentts_model_cache: Dict[str, Any] = {}
def _generate_kittentts(text: str, output_path: str, tts_config: Dict[str, Any]) -> str:
"""Generate speech using the local KittenTTS ONNX model."""
KittenTTS = _import_kittentts()
kt_config = tts_config.get("kittentts", {})
model_name = kt_config.get("model", DEFAULT_KITTENTTS_MODEL)
voice = kt_config.get("voice", DEFAULT_KITTENTTS_VOICE)
speed = kt_config.get("speed", 1.0)
clean_text = kt_config.get("clean_text", True)
global _kittentts_model_cache
if model_name not in _kittentts_model_cache:
logger.info("[KittenTTS] Loading model: %s", model_name)
_kittentts_model_cache[model_name] = KittenTTS(model_name)
model = _kittentts_model_cache[model_name]
audio = model.generate(text, voice=voice, speed=speed, clean_text=clean_text)
import soundfile as sf
wav_path = output_path
if not output_path.endswith(".wav"):
wav_path = output_path.rsplit(".", 1)[0] + ".wav"
sf.write(wav_path, audio, 24000)
if wav_path != output_path:
ffmpeg = shutil.which("ffmpeg")
if ffmpeg:
conv_cmd = [ffmpeg, "-i", wav_path, "-y", "-loglevel", "error", output_path]
subprocess.run(conv_cmd, check=True, timeout=30)
os.remove(wav_path)
else:
os.rename(wav_path, output_path)
return output_path
# ===========================================================================
# Main tool function
# ===========================================================================
@@ -685,19 +622,6 @@ def text_to_speech_tool(
logger.info("Generating speech with NeuTTS (local)...")
_generate_neutts(text, file_str, tts_config)
elif provider == "kittentts":
try:
_import_kittentts()
except ImportError:
return json.dumps({
"success": False,
"error": "KittenTTS provider selected but 'kittentts' package not installed. "
"Run 'hermes setup tts' and choose KittenTTS, or install manually: "
"pip install https://github.com/KittenML/KittenTTS/releases/download/0.8.1/kittentts-0.8.1-py3-none-any.whl"
}, ensure_ascii=False)
logger.info("Generating speech with KittenTTS (local, lightweight)...")
_generate_kittentts(text, file_str, tts_config)
else:
# Default: Edge TTS (free), with NeuTTS as local fallback
edge_available = True
@@ -734,10 +658,10 @@ def text_to_speech_tool(
"error": f"TTS generation produced no output (provider: {provider})"
}, ensure_ascii=False)
# Try Opus conversion for Telegram compatibility only.
# Outside Telegram, preserve the caller's explicit output format.
# Try Opus conversion for Telegram compatibility
# Edge TTS outputs MP3, NeuTTS outputs WAV — both need ffmpeg conversion
voice_compatible = False
if want_opus and provider in ("edge", "neutts", "minimax", "kittentts") and not file_str.endswith(".ogg"):
if provider in ("edge", "neutts", "minimax") and not file_str.endswith(".ogg"):
opus_path = _convert_to_opus(file_str)
if opus_path:
file_str = opus_path
@@ -818,8 +742,6 @@ def check_tts_requirements() -> bool:
pass
if _check_neutts_available():
return True
if _check_kittentts_available():
return True
return False

View File

@@ -10,7 +10,7 @@ Hermes Agent supports both text-to-speech output and voice message transcription
## Text-to-Speech
Convert text to speech with seven providers:
Convert text to speech with six providers:
| Provider | Quality | Cost | API Key |
|----------|---------|------|---------|
@@ -20,7 +20,6 @@ Convert text to speech with seven providers:
| **MiniMax TTS** | Excellent | Paid | `MINIMAX_API_KEY` |
| **Mistral (Voxtral TTS)** | Excellent | Paid | `MISTRAL_API_KEY` |
| **NeuTTS** | Good | Free | None needed |
| **KittenTTS** | Good | Free (local) | None needed |
### Platform Delivery
@@ -36,7 +35,7 @@ Convert text to speech with seven providers:
```yaml
# In ~/.hermes/config.yaml
tts:
provider: "edge" # "edge" | "elevenlabs" | "openai" | "minimax" | "mistral" | "neutts" | "kittentts"
provider: "edge" # "edge" | "elevenlabs" | "openai" | "minimax" | "mistral" | "neutts"
speed: 1.0 # Global speed multiplier (provider-specific settings override this)
edge:
voice: "en-US-AriaNeural" # 322 voices, 74 languages
@@ -63,11 +62,6 @@ tts:
ref_text: ''
model: neuphonic/neutts-air-q4-gguf
device: cpu
kittentts:
model: KittenML/kitten-tts-nano-0.8-int8 # 25MB int8 default; also micro and mini variants
voice: Jasper # Jasper, Bella, Luna, Bruno, Rosie, Hugo, Kiki, Leo
speed: 1.0
clean_text: true
```
**Speed control**: The global `tts.speed` value applies to all providers by default. Each provider can override it with its own `speed` setting (e.g., `tts.openai.speed: 1.5`). Provider-specific speed takes precedence over the global value. Default is `1.0` (normal speed).
@@ -80,7 +74,6 @@ Telegram voice bubbles require Opus/OGG audio format:
- **Edge TTS** (default) outputs MP3 and needs **ffmpeg** to convert:
- **MiniMax TTS** outputs MP3 and needs **ffmpeg** to convert for Telegram voice bubbles
- **NeuTTS** outputs WAV and also needs **ffmpeg** to convert for Telegram voice bubbles
- **KittenTTS** outputs WAV and also needs **ffmpeg** to convert for Telegram voice bubbles
```bash
# Ubuntu/Debian
@@ -93,7 +86,7 @@ brew install ffmpeg
sudo dnf install ffmpeg
```
Without ffmpeg, Edge TTS, MiniMax TTS, NeuTTS, and KittenTTS audio are sent as regular audio files (playable, but shown as a rectangular player instead of a voice bubble).
Without ffmpeg, Edge TTS, MiniMax TTS, and NeuTTS audio are sent as regular audio files (playable, but shown as a rectangular player instead of a voice bubble).
:::tip
If you want voice bubbles without installing ffmpeg, switch to the OpenAI, ElevenLabs, or Mistral provider.