Files

teknium1 b4fbb6fe10 feat: add YC-Bench long-horizon agent benchmark environment

Adds eval-only benchmark for YC-Bench (collinear-ai/yc-bench), a
deterministic long-horizon benchmark where the agent acts as CEO of an
AI startup over a simulated 1-3 year run.

Key design decisions verified against the official yc-bench repo:
- Uses 'sim init' (NOT 'yc-bench run') to avoid starting a competing
  built-in agent loop
- Correct DB table names: 'companies' and 'sim_events'
- Correct 4 domains: research, inference, data_environment, training
- Penalty values are preset-dependent (not hardcoded in system prompt)
- Sequential evaluation (each run is 100-500 turns)
- Follows TerminalBench2 patterns: KeyboardInterrupt handling,
  cleanup_all_environments(), tqdm logging handler, streaming JSONL

yc-bench added as optional dependency: pip install hermes-agent[yc-bench]

Closes #340

2026-03-06 19:25:56 -08:00

4.3 KiB

Raw Blame History

YC-Bench: Long-Horizon Agent Benchmark

YC-Bench by Collinear AI is a deterministic, long-horizon benchmark that tests LLM agents' ability to act as a tech startup CEO. The agent manages a simulated company over 1-3 years, making compounding decisions about resource allocation, cash flow, task management, and prestige specialisation across 4 skill domains.

Unlike TerminalBench2 (which evaluates per-task coding ability with binary pass/fail), YC-Bench measures long-term strategic coherence — whether an agent can maintain consistent strategy, manage compounding consequences, and adapt plans over hundreds of turns.

Setup

# Install yc-bench (optional dependency)
pip install "hermes-agent[yc-bench]"

# Or install from source
git clone https://github.com/collinear-ai/yc-bench
cd yc-bench && pip install -e .

# Verify
yc-bench --help

Running

# From the repo root:
bash environments/benchmarks/yc_bench/run_eval.sh

# Or directly:
python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
    --config environments/benchmarks/yc_bench/default.yaml

# Override model:
bash environments/benchmarks/yc_bench/run_eval.sh \
    --openai.model_name anthropic/claude-opus-4-20250514

# Quick single-preset test:
bash environments/benchmarks/yc_bench/run_eval.sh \
    --env.presets '["fast_test"]' --env.seeds '[1]'

How It Works

Architecture

HermesAgentLoop (our agent)
  -> terminal tool -> subprocess("yc-bench company status") -> JSON output
  -> terminal tool -> subprocess("yc-bench task accept --task-id X") -> JSON
  -> terminal tool -> subprocess("yc-bench sim resume") -> JSON (advance time)
  -> ... (100-500 turns per run)

The environment initialises the simulation via yc-bench sim init (NOT yc-bench run, which would start yc-bench's own built-in agent loop). Our HermesAgentLoop then drives all interaction through CLI commands.

Simulation Mechanics

4 skill domains: research, inference, data_environment, training
Prestige system (1.0-10.0): Gates access to higher-paying tasks
Employee management: Junior/Mid/Senior with domain-specific skill rates
Throughput splitting: effective_rate = base_rate / N active tasks per employee
Financial pressure: Monthly payroll, bankruptcy = game over
Deterministic: SHA256-based RNG — same seed + preset = same world

Difficulty Presets

Preset	Employees	Tasks	Focus
tutorial	3	50	Basic loop mechanics
easy	5	100	Throughput awareness
medium	5	150	Prestige climbing + domain specialisation
hard	7	200	Precise ETA reasoning
nightmare	8	300	Sustained perfection under payroll pressure
fast_test	(varies)	(varies)	Quick validation (~50 turns)

Default eval runs fast_test + medium + hard × 3 seeds = 9 runs.

Scoring

composite = 0.5 × survival + 0.5 × normalised_funds

Survival (binary): Did the company avoid bankruptcy?
Normalised funds (0.0-1.0): Log-scale relative to initial $250K capital

Configuration

Key fields in default.yaml:

Field	Default	Description
`presets`	`["fast_test", "medium", "hard"]`	Which presets to evaluate
`seeds`	`[1, 2, 3]`	RNG seeds per preset
`max_agent_turns`	200	Max LLM calls per run
`run_timeout`	3600	Wall-clock timeout per run (seconds)
`survival_weight`	0.5	Weight of survival in composite score
`funds_weight`	0.5	Weight of normalised funds in composite
`horizon_years`	null	Override horizon (null = auto from preset)

Cost & Time Estimates

Each run is 100-500 LLM turns. Approximate costs per run at typical API rates:

Preset	Turns	Time	Est. Cost
fast_test	~50	5-10 min	$1-5
medium	~200	20-40 min	$5-15
hard	~300	30-60 min	$10-25

Full default eval (9 runs): ~3-6 hours, $50-200 depending on model.

References

collinear-ai/yc-bench — Official repository
Collinear AI — Company behind yc-bench
TerminalBench2 — Per-task coding benchmark (complementary)

4.3 KiB Raw Blame History Unescape Escape