- Updated task filter descriptions for clarity and added a new skip task feature to exclude incompatible tasks.
- Introduced a set of modal incompatible tasks to prevent execution errors in cloud environments.
- Implemented streaming JSONL logging for task results, preserving data even on interruptions.
- Refactored task evaluation logic to include skipped task reporting and improved error handling.
- Implemented a cleanup process to terminate any remaining sandboxes after evaluation, addressing issues with orphaned thread pool workers.
- Enhanced logging to inform users about the cleanup process, ensuring better resource management and user awareness.
- Added task_timeout parameter to enforce a maximum wall-clock time for each task, automatically scoring as FAIL if exceeded.
- Introduced terminal_timeout and tool_pool_size parameters to improve command execution and concurrency management.
- Updated logging to provide detailed task execution times and timeout handling, enhancing overall monitoring.
- Removed outdated evaluate_config.yaml file to streamline configuration management.
- Introduced terminal_timeout and terminal_lifetime parameters to control command execution and sandbox inactivity.
- Updated environment variable handling to allow configuration overrides for terminal settings.
- Enhanced logging to provide detailed information about terminal settings during initialization.
- Added tool_pool_size parameter to dynamically resize the thread pool for tool execution, improving concurrency management.
- Increased thread pool size for tool execution from 8 to 128 to improve concurrency and prevent starvation.
- Added a function to resize the tool executor dynamically based on configuration.
- Enhanced logging to track API call durations and tool execution times, including warnings for slow tools.
- Improved overall performance monitoring by logging detailed information for each turn in the agent loop.
- Increased max_token_length from 16000 to 32000 to allow for longer inputs.
- Adjusted agent_temperature from 0.6 to 0.8 for more varied responses.
- Extended test_timeout from 180 to 600 seconds to accommodate longer evaluations.
- Updated data directory path for saving evaluations to ensure proper organization.
- Introduced new environments: Terminal Test Environment and SWE Environment, each with default configurations for testing and software engineering tasks.
- Added TerminalBench 2.0 evaluation environment with comprehensive setup for agentic LLMs, including task execution and verification.
- Enhanced ToolContext with methods for uploading and downloading files, ensuring binary-safe operations.
- Updated documentation across environments to reflect new features and usage instructions.
- Refactored existing environment configurations for consistency and clarity.
- Updated `.gitignore` to exclude `testlogs` directory.
- Refactored `handle_web_function_call` in `model_tools.py` to support running async functions in existing event loops, improving compatibility with Atropos.
- Introduced a thread pool executor in `agent_loop.py` for running synchronous tool calls that internally use `asyncio.run()`, preventing deadlocks.
- Added `ToolError` class to track tool execution errors, enhancing error reporting during agent loops.
- Updated `wandb_log` method in `hermes_base_env.py` to log tool error statistics for better monitoring.
- Implemented patches in `patches.py` to ensure async-safe operation of tools within Atropos's event loop.
- Enhanced `ToolContext` and `terminal_tool.py` to utilize the new async handling, improving overall tool execution reliability.
- Added new environments for reinforcement learning, including `HermesSweEnv` for software engineering tasks and `TerminalTestEnv` for inline testing.
- Introduced `ToolContext` for unrestricted access to tools during reward computation.
- Updated `.gitignore` to exclude `wandb/` directory.
- Enhanced `README.md` with detailed architecture and usage instructions for Atropos environments.
- Added configuration files for SWE and terminal test environments to streamline setup.
- Removed unnecessary compiled Python files from `__pycache__`.