hermes-agent

Author	SHA1	Message	Date
Blake Johnson	c6df39955c	fix: limit concurrent Modal sandbox creations to avoid deadlocks - Add max_concurrent_tasks config (default 8) with semaphore in TB2 eval - Pass cwd: /app via register_task_env_overrides for TB2 tasks - Add /home/ to host path prefixes as safety net for container backends When all 86 TerminalBench2 tasks fire simultaneously, each creates a Modal sandbox via asyncio.run() inside a thread pool worker. Modal's blocking calls deadlock when too many are created at once. The semaphore ensures max 8 concurrent creations. Co-Authored-By: hermes-agent[bot] <hermes-agent[bot]@users.noreply.github.com>	2026-03-07 14:02:34 -08:00
teknium1	ee7fde6531	feat: add OpenThoughts-TBLite evaluation script Introduced a new evaluation script for the OpenThoughts-TBLite environment, enabling users to run evaluations with customizable options. The script includes logging capabilities and real-time output, enhancing the evaluation process for terminal agents. This addition complements the existing benchmarking tools and improves usability for users.	2026-03-04 12:55:56 +00:00
teknium	1b7bc299f3	Enhance TerminalBench2 environment with task filtering due to incompat with modal and logging improvements - Updated task filter descriptions for clarity and added a new skip task feature to exclude incompatible tasks. - Introduced a set of modal incompatible tasks to prevent execution errors in cloud environments. - Implemented streaming JSONL logging for task results, preserving data even on interruptions. - Refactored task evaluation logic to include skipped task reporting and improved error handling.	2026-02-12 05:36:45 +00:00
teknium	85e629e915	Add cleanup functionality for orphaned sandboxes in TerminalBench2EvalEnv - Implemented a cleanup process to terminate any remaining sandboxes after evaluation, addressing issues with orphaned thread pool workers. - Enhanced logging to inform users about the cleanup process, ensuring better resource management and user awareness.	2026-02-10 23:48:49 +00:00
teknium	ba3fea24f1	Enhance TerminalBench 2 configuration and evaluation handling - Added task_timeout parameter to enforce a maximum wall-clock time for each task, automatically scoring as FAIL if exceeded. - Introduced terminal_timeout and tool_pool_size parameters to improve command execution and concurrency management. - Updated logging to provide detailed task execution times and timeout handling, enhancing overall monitoring. - Removed outdated evaluate_config.yaml file to streamline configuration management.	2026-02-10 22:53:24 +00:00
teknium	ad042fdd68	Update terminalbench_2 configuration for enhanced performance and evaluation - Increased max_token_length from 16000 to 32000 to allow for longer inputs. - Adjusted agent_temperature from 0.6 to 0.8 for more varied responses. - Extended test_timeout from 180 to 600 seconds to accommodate longer evaluations. - Updated data directory path for saving evaluations to ensure proper organization.	2026-02-10 19:48:41 +00:00
teknium	35ad3146a8	Add new environments and enhance tool context functionality - Introduced new environments: Terminal Test Environment and SWE Environment, each with default configurations for testing and software engineering tasks. - Added TerminalBench 2.0 evaluation environment with comprehensive setup for agentic LLMs, including task execution and verification. - Enhanced ToolContext with methods for uploading and downloading files, ensuring binary-safe operations. - Updated documentation across environments to reflect new features and usage instructions. - Refactored existing environment configurations for consistency and clarity.	2026-02-10 19:39:05 +00:00

7 Commits