This directory contains the integration layer between **hermes-agent's** tool-calling capabilities and the **Atropos** RL training framework. It provides everything needed to run agentic LLMs through multi-turn tool-calling loops, score their output with arbitrary reward functions, and feed results into Atropos for training or evaluation.
3. Append tool results to the conversation and go back to step 1
4. If the response has no tool_calls, the agent is done
Tool calls are executed in a thread pool (`run_in_executor`) so backends that use `asyncio.run()` internally (Modal, Docker) don't deadlock inside Atropos's event loop.
Returns an `AgentResult` containing the full conversation history, turn count, reasoning content per turn, tool errors, and optional ManagedServer state (for Phase 2).
### Tool Context (`tool_context.py`)
`ToolContext` is a per-rollout handle that gives reward/verification functions direct access to **all** hermes-agent tools, scoped to the rollout's `task_id`. The same `task_id` means the terminal/browser session is the SAME one the model used during its rollout -- all state (files, processes, browser tabs) is preserved.
**Problem**: Some hermes-agent tools use `asyncio.run()` internally (e.g., the Modal backend). This crashes when called from inside Atropos's event loop because `asyncio.run()` cannot be nested.
**Solution**: `ModalEnvironment` uses a dedicated `_AsyncWorker` background thread with its own event loop. The calling code sees a sync interface, but internally all async Modal SDK calls happen on the worker thread so they don't conflict with Atropos's loop. This is built directly into `tools/environments/modal.py` — no monkey-patching required.
Client-side parsers that extract structured `tool_calls` from raw model output text. Used in **Phase 2** (VLLM server type) where ManagedServer's `/generate` endpoint returns raw text without tool call parsing.
Each parser is a standalone reimplementation of the corresponding VLLM parser's `extract_tool_calls()` logic. No VLLM dependency -- only standard library (`re`, `json`, `uuid`) and `openai` types.
Available parsers:
-`hermes` -- Hermes/ChatML `<tool_call>` XML format
-`mistral` -- Mistral `[TOOL_CALLS]` format
-`llama3_json` -- Llama 3 JSON tool calling
-`qwen` -- Qwen tool calling format
-`qwen3_coder` -- Qwen3 Coder format
-`deepseek_v3` -- DeepSeek V3 format
-`deepseek_v3_1` -- DeepSeek V3.1 format
-`kimi_k2` -- Kimi K2 format
-`longcat` -- Longcat format
-`glm45` / `glm47` -- GLM model formats
Usage:
```python
from environments.tool_call_parsers import get_parser
A self-contained environment with inline tasks (no external dataset needed) for validating the full stack end-to-end. Each task asks the model to create a file at a known path, and the verifier checks the content matches.
SWE-bench style training environment. The model gets a coding task, uses terminal + file + web tools to solve it, and the reward function runs tests in the same Modal sandbox.
**Eval-only** environment for the Terminal-Bench 2.0 benchmark (89 tasks). Each task gets a pre-built Docker Hub image, a natural language instruction, and a test suite. The agent uses terminal + file tools to solve the task, then the test suite verifies correctness.
Follows the standard Atropos eval pattern (like GPQA, MMLU, etc.):
- Run via `evaluate` subcommand (no `run-api` needed)
-`setup()` loads the dataset, `evaluate()` runs all tasks
-`rollout_and_score_eval()` handles per-task agent loop + test verification