feat: add YC-Bench long-horizon agent benchmark environment

Adds eval-only benchmark for YC-Bench (collinear-ai/yc-bench), a deterministic long-horizon benchmark where the agent acts as CEO of an AI startup over a simulated 1-3 year run. Key design decisions verified against the official yc-bench repo: - Uses 'sim init' (NOT 'yc-bench run') to avoid starting a competing built-in agent loop - Correct DB table names: 'companies' and 'sim_events' - Correct 4 domains: research, inference, data_environment, training - Penalty values are preset-dependent (not hardcoded in system prompt) - Sequential evaluation (each run is 100-500 turns) - Follows TerminalBench2 patterns: KeyboardInterrupt handling, cleanup_all_environments(), tqdm logging handler, streaming JSONL yc-bench added as optional dependency: pip install hermes-agent[yc-bench] Closes #340
2026-03-06 19:25:56 -08:00
parent 82d7e9429e
commit b4fbb6fe10
6 changed files with 1040 additions and 0 deletions
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -50,6 +50,7 @@ pty = ["ptyprocess>=0.7.0"]
 honcho = ["honcho-ai>=2.0.1"]
 mcp = ["mcp>=1.2.0"]
 homeassistant = ["aiohttp>=3.9.0"]
+yc-bench = ["yc-bench @ git+https://github.com/collinear-ai/yc-bench.git"]
 all = [
  "hermes-agent[modal]",
  "hermes-agent[daytona]",