Files
hermes-agent/environments/benchmarks/terminalbench_2/run_eval.sh
teknium 35ad3146a8 Add new environments and enhance tool context functionality
- Introduced new environments: Terminal Test Environment and SWE Environment, each with default configurations for testing and software engineering tasks.
- Added TerminalBench 2.0 evaluation environment with comprehensive setup for agentic LLMs, including task execution and verification.
- Enhanced ToolContext with methods for uploading and downloading files, ensuring binary-safe operations.
- Updated documentation across environments to reflect new features and usage instructions.
- Refactored existing environment configurations for consistency and clarity.
2026-02-10 19:39:05 +00:00

33 lines
838 B
Bash
Executable File

#!/bin/bash
# Terminal-Bench 2.0 Evaluation
#
# Run from repo root:
# bash environments/benchmarks/terminalbench_2/run_eval.sh
#
# Override model:
# bash environments/benchmarks/terminalbench_2/run_eval.sh \
# --openai.model_name anthropic/claude-sonnet-4
#
# Run a subset:
# bash environments/benchmarks/terminalbench_2/run_eval.sh \
# --env.task_filter fix-git,git-multibranch
mkdir -p logs evals/terminal-bench-2
LOG_FILE="logs/terminalbench2_$(date +%Y%m%d_%H%M%S).log"
echo "Terminal-Bench 2.0 Evaluation"
echo "Log: $LOG_FILE"
echo ""
export TERMINAL_ENV=modal
export TERMINAL_TIMEOUT=300
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
--config environments/benchmarks/terminalbench_2/default.yaml \
"$@" \
2>&1 | tee "$LOG_FILE"
echo ""
echo "Log saved to: $LOG_FILE"