hermes-agent

Author	SHA1	Message	Date
Teknium	475205e30b	fix: restore terminalbench2_env.py from patch-tool redaction corruption (#3801 ) Commit `ed27b826` introduced patch-tool redaction corruption that: - Replaced max_token_length=16000 with max_token_length=*** - Truncated api_key=os.getenv(...) to api_key=os.get...EY - Truncated tokenizer_name to NousRe...1-8B - Deleted 409 lines including _run_tests(), _eval_with_timeout(), evaluate(), wandb_log(), and the __main__ entry point Restores the file from pre-corruption state (ed27b826^) and re-applies the two legitimate changes from subsequent commits: - eval_concurrency config field (from `ed27b826`) - docker_image registration in register_task_env_overrides (from `ed27b826`) - ManagedServer branching for vLLM/SGLang backends (from `13f54596`) Closes #1737, #1740.	2026-03-29 15:33:52 -07:00
dmahan93	366de72a38	add a local vllm instance	2026-03-11 06:52:55 -07:00
dmahan93	13f5459670	fix: use ManagedServer for vLLM in TBLite eval + local_vllm config TBLite eval was bypassing ManagedServer and calling ServerManager directly, which uses /v1/chat/completions — not available on the atropos vllm_api_server (/generate only). Now uses _use_managed_server() to detect vLLM/SGLang backends and route through ManagedServer (Phase 2) with proper tool_parser and /generate endpoint. Falls back to Phase 1 for OpenAI endpoints. Also adds local_vllm.yaml config for running against a local vLLM server with Docker sandboxes.	2026-03-11 06:52:55 -07:00
dmahan93	ed27b826c5	feat: add eval_concurrency limit + Docker local config for TBLite - Add eval_concurrency config field with asyncio.Semaphore - Add local.yaml config using Docker backend (sandboxed, no cloud costs) - Register docker_image alongside modal_image for backend flexibility - Default: 8 parallel tasks for local runs	2026-03-11 06:52:26 -07:00
Blake Johnson	c6df39955c	fix: limit concurrent Modal sandbox creations to avoid deadlocks - Add max_concurrent_tasks config (default 8) with semaphore in TB2 eval - Pass cwd: /app via register_task_env_overrides for TB2 tasks - Add /home/ to host path prefixes as safety net for container backends When all 86 TerminalBench2 tasks fire simultaneously, each creates a Modal sandbox via asyncio.run() inside a thread pool worker. Modal's blocking calls deadlock when too many are created at once. The semaphore ensures max 8 concurrent creations. Co-Authored-By: hermes-agent[bot] <hermes-agent[bot]@users.noreply.github.com>	2026-03-07 14:02:34 -08:00
teknium1	ce28f847ce	fix: update OpenRouter model names for yc-bench config Use anthropic/claude-sonnet-4.6 (OpenRouter format) instead of anthropic/claude-sonnet-4-20250514 (direct API format).	2026-03-06 19:58:56 -08:00
teknium1	b4fbb6fe10	feat: add YC-Bench long-horizon agent benchmark environment Adds eval-only benchmark for YC-Bench (collinear-ai/yc-bench), a deterministic long-horizon benchmark where the agent acts as CEO of an AI startup over a simulated 1-3 year run. Key design decisions verified against the official yc-bench repo: - Uses 'sim init' (NOT 'yc-bench run') to avoid starting a competing built-in agent loop - Correct DB table names: 'companies' and 'sim_events' - Correct 4 domains: research, inference, data_environment, training - Penalty values are preset-dependent (not hardcoded in system prompt) - Sequential evaluation (each run is 100-500 turns) - Follows TerminalBench2 patterns: KeyboardInterrupt handling, cleanup_all_environments(), tqdm logging handler, streaming JSONL yc-bench added as optional dependency: pip install hermes-agent[yc-bench] Closes #340	2026-03-06 19:25:56 -08:00
teknium1	ee7fde6531	feat: add OpenThoughts-TBLite evaluation script Introduced a new evaluation script for the OpenThoughts-TBLite environment, enabling users to run evaluations with customizable options. The script includes logging capabilities and real-time output, enhancing the evaluation process for terminal agents. This addition complements the existing benchmarking tools and improves usability for users.	2026-03-04 12:55:56 +00:00
teknium1	0ea6c34325	feat: add OpenThoughts-TBLite evaluation environment and configuration files Introduced a new evaluation environment for OpenThoughts-TBLite, including the main evaluation script, configuration YAML, and README documentation. This environment provides a faster alternative to Terminal-Bench 2.0, featuring 100 difficulty-calibrated tasks for terminal agents. The setup allows for easy evaluation and configuration, enhancing the benchmarking capabilities for terminal agents.	2026-03-04 11:42:41 +00:00
teknium	1b7bc299f3	Enhance TerminalBench2 environment with task filtering due to incompat with modal and logging improvements - Updated task filter descriptions for clarity and added a new skip task feature to exclude incompatible tasks. - Introduced a set of modal incompatible tasks to prevent execution errors in cloud environments. - Implemented streaming JSONL logging for task results, preserving data even on interruptions. - Refactored task evaluation logic to include skipped task reporting and improved error handling.	2026-02-12 05:36:45 +00:00
teknium	85e629e915	Add cleanup functionality for orphaned sandboxes in TerminalBench2EvalEnv - Implemented a cleanup process to terminate any remaining sandboxes after evaluation, addressing issues with orphaned thread pool workers. - Enhanced logging to inform users about the cleanup process, ensuring better resource management and user awareness.	2026-02-10 23:48:49 +00:00
teknium	ba3fea24f1	Enhance TerminalBench 2 configuration and evaluation handling - Added task_timeout parameter to enforce a maximum wall-clock time for each task, automatically scoring as FAIL if exceeded. - Introduced terminal_timeout and tool_pool_size parameters to improve command execution and concurrency management. - Updated logging to provide detailed task execution times and timeout handling, enhancing overall monitoring. - Removed outdated evaluate_config.yaml file to streamline configuration management.	2026-02-10 22:53:24 +00:00
teknium	ad042fdd68	Update terminalbench_2 configuration for enhanced performance and evaluation - Increased max_token_length from 16000 to 32000 to allow for longer inputs. - Adjusted agent_temperature from 0.6 to 0.8 for more varied responses. - Extended test_timeout from 180 to 600 seconds to accommodate longer evaluations. - Updated data directory path for saving evaluations to ensure proper organization.	2026-02-10 19:48:41 +00:00
teknium	35ad3146a8	Add new environments and enhance tool context functionality - Introduced new environments: Terminal Test Environment and SWE Environment, each with default configurations for testing and software engineering tasks. - Added TerminalBench 2.0 evaluation environment with comprehensive setup for agentic LLMs, including task execution and verification. - Enhanced ToolContext with methods for uploading and downloading files, ensuring binary-safe operations. - Updated documentation across environments to reflect new features and usage instructions. - Refactored existing environment configurations for consistency and clarity.	2026-02-10 19:39:05 +00:00

14 Commits