environments/benchmarks/terminalbench_2/default.yaml

# Terminal-Bench 2.0 Evaluation -- Default Configuration
#
# Eval-only environment for the TB2 benchmark (89 terminal tasks).
# Uses Modal terminal backend for per-task cloud-isolated sandboxes
# and OpenRouter for inference.
#
# Usage:
#   python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
#       --config environments/benchmarks/terminalbench_2/default.yaml
#
#   # Override model:
#   python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
#       --config environments/benchmarks/terminalbench_2/default.yaml \
#       --openai.model_name anthropic/claude-sonnet-4

env:
  enabled_toolsets: ["terminal", "file"]
  max_agent_turns: 60
  max_token_length: 32000
  agent_temperature: 0.8
  terminal_backend: "modal"
  terminal_timeout: 300        # 5 min per command (builds, pip install)
  tool_pool_size: 128          # thread pool for 89 parallel tasks
  dataset_name: "NousResearch/terminal-bench-2"
  test_timeout: 600
  task_timeout: 1800           # 30 min wall-clock per task, auto-FAIL if exceeded
  tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
  use_wandb: true
  wandb_name: "terminal-bench-2"
  ensure_scores_are_not_same: false
  data_dir_to_save_evals: "environments/benchmarks/evals/terminal-bench-2"
  # CRITICAL: Limit concurrent Modal sandbox creations to avoid deadlocks.
  # Modal's blocking calls (App.lookup, etc.) deadlock when too many sandboxes
  # are created simultaneously inside thread pool workers via asyncio.run().
  max_concurrent_tasks: 8

openai:
  base_url: "https://openrouter.ai/api/v1"
  model_name: "anthropic/claude-opus-4.6"
  server_type: "openai"
  health_check: false
  # api_key loaded from OPENROUTER_API_KEY in .env
Add new environments and enhance tool context functionality - Introduced new environments: Terminal Test Environment and SWE Environment, each with default configurations for testing and software engineering tasks. - Added TerminalBench 2.0 evaluation environment with comprehensive setup for agentic LLMs, including task execution and verification. - Enhanced ToolContext with methods for uploading and downloading files, ensuring binary-safe operations. - Updated documentation across environments to reflect new features and usage instructions. - Refactored existing environment configurations for consistency and clarity. 2026-02-10 19:39:05 +00:00			`# Terminal-Bench 2.0 Evaluation -- Default Configuration`
			`#`
			`# Eval-only environment for the TB2 benchmark (89 terminal tasks).`
			`# Uses Modal terminal backend for per-task cloud-isolated sandboxes`
			`# and OpenRouter for inference.`
			`#`
			`# Usage:`
			`# python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \`
			`# --config environments/benchmarks/terminalbench_2/default.yaml`
			`#`
			`# # Override model:`
			`# python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \`
			`# --config environments/benchmarks/terminalbench_2/default.yaml \`
			`# --openai.model_name anthropic/claude-sonnet-4`

			`env:`
			`enabled_toolsets: ["terminal", "file"]`
			`max_agent_turns: 60`
Update terminalbench_2 configuration for enhanced performance and evaluation - Increased max_token_length from 16000 to 32000 to allow for longer inputs. - Adjusted agent_temperature from 0.6 to 0.8 for more varied responses. - Extended test_timeout from 180 to 600 seconds to accommodate longer evaluations. - Updated data directory path for saving evaluations to ensure proper organization. 2026-02-10 19:48:41 +00:00			`max_token_length: 32000`
			`agent_temperature: 0.8`
Add new environments and enhance tool context functionality - Introduced new environments: Terminal Test Environment and SWE Environment, each with default configurations for testing and software engineering tasks. - Added TerminalBench 2.0 evaluation environment with comprehensive setup for agentic LLMs, including task execution and verification. - Enhanced ToolContext with methods for uploading and downloading files, ensuring binary-safe operations. - Updated documentation across environments to reflect new features and usage instructions. - Refactored existing environment configurations for consistency and clarity. 2026-02-10 19:39:05 +00:00			`terminal_backend: "modal"`
Enhance TerminalBench 2 configuration and evaluation handling - Added task_timeout parameter to enforce a maximum wall-clock time for each task, automatically scoring as FAIL if exceeded. - Introduced terminal_timeout and tool_pool_size parameters to improve command execution and concurrency management. - Updated logging to provide detailed task execution times and timeout handling, enhancing overall monitoring. - Removed outdated evaluate_config.yaml file to streamline configuration management. 2026-02-10 22:53:24 +00:00			`terminal_timeout: 300 # 5 min per command (builds, pip install)`
			`tool_pool_size: 128 # thread pool for 89 parallel tasks`
Add new environments and enhance tool context functionality - Introduced new environments: Terminal Test Environment and SWE Environment, each with default configurations for testing and software engineering tasks. - Added TerminalBench 2.0 evaluation environment with comprehensive setup for agentic LLMs, including task execution and verification. - Enhanced ToolContext with methods for uploading and downloading files, ensuring binary-safe operations. - Updated documentation across environments to reflect new features and usage instructions. - Refactored existing environment configurations for consistency and clarity. 2026-02-10 19:39:05 +00:00			`dataset_name: "NousResearch/terminal-bench-2"`
Update terminalbench_2 configuration for enhanced performance and evaluation - Increased max_token_length from 16000 to 32000 to allow for longer inputs. - Adjusted agent_temperature from 0.6 to 0.8 for more varied responses. - Extended test_timeout from 180 to 600 seconds to accommodate longer evaluations. - Updated data directory path for saving evaluations to ensure proper organization. 2026-02-10 19:48:41 +00:00			`test_timeout: 600`
Enhance TerminalBench 2 configuration and evaluation handling - Added task_timeout parameter to enforce a maximum wall-clock time for each task, automatically scoring as FAIL if exceeded. - Introduced terminal_timeout and tool_pool_size parameters to improve command execution and concurrency management. - Updated logging to provide detailed task execution times and timeout handling, enhancing overall monitoring. - Removed outdated evaluate_config.yaml file to streamline configuration management. 2026-02-10 22:53:24 +00:00			`task_timeout: 1800 # 30 min wall-clock per task, auto-FAIL if exceeded`
Add new environments and enhance tool context functionality - Introduced new environments: Terminal Test Environment and SWE Environment, each with default configurations for testing and software engineering tasks. - Added TerminalBench 2.0 evaluation environment with comprehensive setup for agentic LLMs, including task execution and verification. - Enhanced ToolContext with methods for uploading and downloading files, ensuring binary-safe operations. - Updated documentation across environments to reflect new features and usage instructions. - Refactored existing environment configurations for consistency and clarity. 2026-02-10 19:39:05 +00:00			`tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"`
			`use_wandb: true`
			`wandb_name: "terminal-bench-2"`
			`ensure_scores_are_not_same: false`
Update terminalbench_2 configuration for enhanced performance and evaluation - Increased max_token_length from 16000 to 32000 to allow for longer inputs. - Adjusted agent_temperature from 0.6 to 0.8 for more varied responses. - Extended test_timeout from 180 to 600 seconds to accommodate longer evaluations. - Updated data directory path for saving evaluations to ensure proper organization. 2026-02-10 19:48:41 +00:00			`data_dir_to_save_evals: "environments/benchmarks/evals/terminal-bench-2"`
fix: limit concurrent Modal sandbox creations to avoid deadlocks - Add max_concurrent_tasks config (default 8) with semaphore in TB2 eval - Pass cwd: /app via register_task_env_overrides for TB2 tasks - Add /home/ to host path prefixes as safety net for container backends When all 86 TerminalBench2 tasks fire simultaneously, each creates a Modal sandbox via asyncio.run() inside a thread pool worker. Modal's blocking calls deadlock when too many are created at once. The semaphore ensures max 8 concurrent creations. Co-Authored-By: hermes-agent[bot] <hermes-agent[bot]@users.noreply.github.com> 2026-03-07 21:34:06 +00:00			`# CRITICAL: Limit concurrent Modal sandbox creations to avoid deadlocks.`
			`# Modal's blocking calls (App.lookup, etc.) deadlock when too many sandboxes`
			`# are created simultaneously inside thread pool workers via asyncio.run().`
			`max_concurrent_tasks: 8`
Add new environments and enhance tool context functionality - Introduced new environments: Terminal Test Environment and SWE Environment, each with default configurations for testing and software engineering tasks. - Added TerminalBench 2.0 evaluation environment with comprehensive setup for agentic LLMs, including task execution and verification. - Enhanced ToolContext with methods for uploading and downloading files, ensuring binary-safe operations. - Updated documentation across environments to reflect new features and usage instructions. - Refactored existing environment configurations for consistency and clarity. 2026-02-10 19:39:05 +00:00
			`openai:`
			`base_url: "https://openrouter.ai/api/v1"`
			`model_name: "anthropic/claude-opus-4.6"`
			`server_type: "openai"`
			`health_check: false`
			`# api_key loaded from OPENROUTER_API_KEY in .env`