diff --git a/skills/mlops/training/hermes-atropos-environments/SKILL.md b/skills/mlops/training/hermes-atropos-environments/SKILL.md new file mode 100644 index 000000000..9dff46687 --- /dev/null +++ b/skills/mlops/training/hermes-atropos-environments/SKILL.md @@ -0,0 +1,302 @@ +--- +name: hermes-atropos-environments +description: Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, evaluation with tools, wandb logging, and the three CLI modes (serve/process/evaluate). Use when creating, reviewing, or fixing RL environments in the hermes-agent repo. +version: 1.1.0 +author: Hermes Agent +license: MIT +metadata: + hermes: + tags: [atropos, rl, environments, training, reinforcement-learning, reward-functions] + related_skills: [axolotl, grpo-rl-training, trl-fine-tuning, lm-evaluation-harness] +--- + +# Hermes Agent Atropos Environments + +Guide for building RL environments in the hermes-agent repo that integrate with the Atropos training framework. + +## Architecture Overview + +``` +Atropos BaseEnv (atroposlib/envs/base.py) + └── HermesAgentBaseEnv (environments/hermes_base_env.py) + ├── Handles agent loop orchestration + ├── Handles tool resolution per group + ├── Handles ToolContext for reward verification + └── YOUR ENVIRONMENT (environments/your_env.py) + Only implements: setup, get_next_item, format_prompt, + compute_reward, evaluate, wandb_log +``` + +Hermes environments are special because they run a **multi-turn agent loop with tool calling** — not just single-turn completions. The base env handles the loop; you implement the task and scoring. + +## File Locations + +| File | Purpose | +|------|---------| +| `environments/hermes_base_env.py` | Base class with agent loop + tool resolution | +| `environments/agent_loop.py` | `HermesAgentLoop` + `AgentResult` dataclass | +| `environments/tool_context.py` | `ToolContext` for reward verification | +| `environments/tool_call_parsers.py` | Phase 2 tool call parsers (hermes, mistral, etc.) | +| `environments/your_env.py` | Your environment implementation | + +## Inference Setup — Ask the User First + +**IMPORTANT:** Before running any test, evaluation, or data generation command, always ask the user how they want to handle inference. Do NOT assume OpenRouter or any specific endpoint. Present these options: + +1. **OpenRouter** — Ask which model they want to use (e.g., `anthropic/claude-sonnet-4.5`, `google/gemini-2.5-pro`, `meta-llama/llama-3.3-70b-instruct`, etc.). Requires `OPENROUTER_API_KEY` in environment. +2. **Self-hosted VLLM endpoint** — Ask for their base URL (e.g., `http://localhost:8000/v1`) and model name. Set `--openai.server_type vllm`. +3. **Other OpenAI-compatible API** — Ask for the base URL, model name, and any required API key. Set `--openai.server_type openai` and `--openai.health_check false`. +4. **Local Atropos training server** — For `serve` mode with a live training loop. Default `http://localhost:8000/v1`. + +Once the user tells you their setup, use those values in all CLI commands for that session. Example prompts: + +> "Before I run this, how would you like to handle inference? +> 1. OpenRouter (I'll need your preferred model, e.g. claude-sonnet-4.5) +> 2. A self-hosted VLLM endpoint (give me the URL and model name) +> 3. Another OpenAI-compatible API (give me the URL, model, and any auth details) +> 4. Local Atropos training server (serve mode)" + +### Key flags by provider: + +| Provider | `--openai.server_type` | `--openai.health_check` | `--openai.api_key` | +|----------|----------------------|------------------------|-------------------| +| OpenRouter | `openai` | `false` | `$OPENROUTER_API_KEY` | +| VLLM (self-hosted) | `vllm` | (default) | (not needed) | +| Other OpenAI-compatible | `openai` | `false` | As needed | +| Local Atropos | (default) | (default) | (not needed) | + +## Required Methods + +### 1. `setup()` — Load dataset and initialize state + +```python +async def setup(self) -> None: + """Called once at startup. Load datasets, initialize state.""" + # Try HuggingFace first, fallback to built-in samples + try: + from datasets import load_dataset + ds = load_dataset("your/dataset", split="test") + self._items = [...] + except Exception: + self._items = BUILTIN_SAMPLES + + # Always split into train/eval + random.shuffle(self._items) + eval_size = max(20, int(len(self._items) * 0.1)) + self._eval_items = self._items[:eval_size] + self._items = self._items[eval_size:] +``` + +### 2. `get_next_item()` — Return next training item + +```python +async def get_next_item(self) -> dict: + """Return next item, cycling through dataset.""" + item = self._items[self._index % len(self._items)] + self._index += 1 + return item +``` + +### 3. `format_prompt(item)` — Convert item to user message + +```python +def format_prompt(self, item: dict) -> str: + """Convert a dataset item into the user-facing prompt.""" + return f"Research this question: {item['question']}" +``` + +### 4. `compute_reward(item, result, ctx)` — Score the rollout + +**CRITICAL**: `result` is an `AgentResult`, NOT a dict. It has these attributes: +- `result.messages` — List of message dicts (OpenAI format) +- `result.turns_used` — Number of LLM calls made +- `result.finished_naturally` — True if model stopped voluntarily +- `result.tool_errors` — List of ToolError objects + +**AgentResult does NOT have**: `final_response`, `tool_calls`, `tools_used`. +You must extract these from `result.messages`: + +```python +async def compute_reward(self, item, result: AgentResult, ctx: ToolContext) -> float: + # Extract final response (last assistant message with content) + final_response = "" + tools_used = [] + for msg in reversed(result.messages): + if msg.get("role") == "assistant" and msg.get("content") and not final_response: + final_response = msg["content"] + if msg.get("role") == "assistant" and msg.get("tool_calls"): + for tc in msg["tool_calls"]: + fn = tc.get("function", {}) if isinstance(tc, dict) else {} + name = fn.get("name", "") + if name: + tools_used.append(name) + + # Score using LLM judge, heuristic, or ToolContext verification + correctness = await self._llm_judge(item, final_response) + return correctness +``` + +`ctx` (ToolContext) gives you terminal/file access to the agent's sandbox for verification: +```python +# Run tests in the agent's sandbox +result = ctx.terminal("pytest /workspace/test.py") +return 1.0 if result["exit_code"] == 0 else 0.0 +``` + +### 5. `evaluate()` — Periodic evaluation with full agent loop + +**MUST use the full agent loop with tools**, not single-turn chat_completion. +The whole point of hermes-agent environments is agentic evaluation: + +```python +async def evaluate(self, *args, **kwargs) -> None: + import time, uuid + from environments.agent_loop import HermesAgentLoop + from environments.tool_context import ToolContext + + start_time = time.time() + tools, valid_names = self._resolve_tools_for_group() + samples = [] + + for item in self._eval_items[:self.config.eval_size]: + task_id = str(uuid.uuid4()) + messages = [] + if self.config.system_prompt: + messages.append({"role": "system", "content": self.config.system_prompt}) + messages.append({"role": "user", "content": self.format_prompt(item)}) + + agent = HermesAgentLoop( + server=self.server, + tool_schemas=tools, + valid_tool_names=valid_names, + max_turns=self.config.max_agent_turns, + task_id=task_id, + temperature=0.0, # Deterministic for eval + max_tokens=self.config.max_token_length, + extra_body=self.config.extra_body, + ) + result = await agent.run(messages) + + ctx = ToolContext(task_id) + try: + reward = await self.compute_reward(item, result, ctx) + finally: + ctx.cleanup() + + samples.append({"prompt": ..., "response": ..., "reward": reward}) + + eval_metrics = {"eval/mean_reward": ...} + await self.evaluate_log(metrics=eval_metrics, samples=samples, + start_time=start_time, end_time=time.time()) +``` + +### 6. `wandb_log()` — Custom metrics logging + +Always call `super().wandb_log()` at the end: + +```python +async def wandb_log(self, wandb_metrics=None): + if wandb_metrics is None: + wandb_metrics = {} + if self._reward_buffer: + n = len(self._reward_buffer) + wandb_metrics["train/mean_reward"] = sum(self._reward_buffer) / n + self._reward_buffer.clear() + await super().wandb_log(wandb_metrics) # MUST call super +``` + +**Pitfall**: `compute_reward` appends to metric buffers. During eval, this pollutes training metrics. Roll back buffer entries added during eval. + +## Config Class + +Always create a custom config subclass with Pydantic Field descriptors. Key inherited fields you can tune: `enabled_toolsets`, `max_agent_turns`, `agent_temperature`, `system_prompt`, `terminal_backend`, `group_size`, `steps_per_eval`, `total_steps`. + +## config_init() — Default Configuration + +Classmethod returning `(YourEnvConfig, [APIServerConfig(...)])`. Set server_type to "openai" for OpenRouter/external APIs. Load API key from environment variable. + +## Three CLI Modes + +```bash +# SERVE — Full training loop (connects to Atropos API server) +python environments/my_env.py serve --openai.base_url http://localhost:8000/v1 + +# PROCESS — Offline data generation (saves JSONL) +python environments/my_env.py process --env.total_steps 10 --env.group_size 1 \ + --env.use_wandb false --env.data_path_to_save_groups output.jsonl \ + --openai.base_url "" \ + --openai.model_name "" \ + --openai.server_type --openai.health_check false + +# EVALUATE — Standalone eval (runs setup + evaluate only) +python environments/my_env.py evaluate --env.eval_size 20 \ + --env.data_dir_to_save_evals /tmp/eval_results \ + --openai.base_url "" \ + --openai.model_name "" \ + --openai.server_type --openai.health_check false +``` + +Config priority: CLI args > YAML file > config_init() defaults. + +## Common Pitfalls + +1. **AgentResult has .messages, not .final_response** — Extract the final response by iterating reversed(result.messages) looking for the last assistant message with content. + +2. **evaluate() must use HermesAgentLoop, not chat_completion** — Single-turn chat_completion has no tools. The whole point of hermes-agent benchmarks is agentic evaluation with tool use. + +3. **Don't call _llm_judge twice** — If compute_reward already calls it, extract the score from the buffer instead of calling judge separately in evaluate(). + +4. **Eval pollutes training buffers** — compute_reward appends to metric buffers. During eval, roll back buffer entries to keep training metrics clean. + +5. **Always set health_check=false for OpenRouter** — OpenRouter has no /health endpoint. + +6. **Set data_dir_to_save_evals in evaluate mode** — Without it, results aren't saved. + +7. **default_toolsets class variable vs enabled_toolsets config** — The class variable is a hint; the config field is what actually controls tool resolution. + +8. **Tool call parsing in messages** — Tool calls are dicts with `{"function": {"name": ..., "arguments": ...}}`. Always check `isinstance(tc, dict)`. + +9. **ToolContext.cleanup()** — Always call in a finally block to release sandbox resources. + +10. **server_type must be "openai" for external APIs** — Without it, Atropos assumes a local VLLM server. + +11. **Always ask the user for their inference setup** — Never hardcode or assume a specific provider/model. See the "Inference Setup" section above. + +## Reward Function Patterns + +### LLM Judge (for open-ended tasks) +Use `self.server.chat_completion()` with a scoring prompt. Parse JSON response for score float. Always include a heuristic fallback (keyword overlap) for when the judge call fails. + +### Binary Verification (for code/terminal tasks) +Use `ctx.terminal("pytest test.py -q")` to run tests in the agent's sandbox. Return 1.0 for pass, 0.0 for fail. + +### Multi-Signal (combine multiple indicators) +Weight correctness (0.6) + tool usage (0.2) + efficiency (0.2) + optional bonuses. Clamp to [0, 1]. + +## Testing Your Environment + +1. **Import test**: `python -c "from environments.my_env import MyEnv; print('OK')"` +2. **Ask the user for inference setup** (see "Inference Setup" section above) +3. **Process mode** (1 item): Verify JSONL output has valid tokens, masks, scores +4. **Evaluate mode**: Verify full agent loop runs with tools, metrics logged correctly +5. **Check reward range**: Scores should be in [0, 1], not all identical + +## Minimum Implementation Checklist + +```python +class MyEnv(HermesAgentBaseEnv): + name = "my-env" + env_config_cls = MyEnvConfig + + @classmethod + def config_init(cls): ... # Default server + env config + async def setup(self): ... # Load dataset + train/eval split + async def get_next_item(self): ... # Cycle through training items + def format_prompt(self, item): ... # Item → user message string + async def compute_reward(self, item, result, ctx): ... # Score rollout + async def evaluate(self, *args, **kwargs): ... # Full agent loop eval + async def wandb_log(self, metrics=None): ... # Custom metrics + super() + +if __name__ == "__main__": + MyEnv.cli() +``` diff --git a/skills/mlops/training/hermes-atropos-environments/references/agentresult-fields.md b/skills/mlops/training/hermes-atropos-environments/references/agentresult-fields.md new file mode 100644 index 000000000..bc6d60505 --- /dev/null +++ b/skills/mlops/training/hermes-atropos-environments/references/agentresult-fields.md @@ -0,0 +1,59 @@ +# AgentResult Fields Reference + +`AgentResult` is defined in `environments/agent_loop.py` as a dataclass. + +## Fields + +| Field | Type | Description | +|-------|------|-------------| +| `messages` | `List[Dict[str, Any]]` | Full conversation history in OpenAI message format | +| `managed_state` | `Optional[Dict]` | ManagedServer.get_state() if Phase 2, else None | +| `turns_used` | `int` | Number of LLM calls made during the loop | +| `finished_naturally` | `bool` | True if model stopped calling tools on its own | +| `reasoning_per_turn` | `List[Optional[str]]` | Extracted reasoning content per turn | +| `tool_errors` | `List[ToolError]` | Tool errors encountered during the loop | + +## ToolError Fields + +| Field | Type | Description | +|-------|------|-------------| +| `turn` | `int` | Which turn the error occurred | +| `tool_name` | `str` | Name of the tool that failed | +| `arguments` | `str` | Arguments passed to the tool | +| `error` | `str` | Error message | +| `tool_result` | `str` | The result returned to the model | + +## Extracting Data from Messages + +Messages follow OpenAI format. Common patterns: + +```python +# Get final assistant response +for msg in reversed(result.messages): + if msg.get("role") == "assistant" and msg.get("content"): + final_response = msg["content"] + break + +# Get all tool names used +tools = [] +for msg in result.messages: + if msg.get("role") == "assistant" and msg.get("tool_calls"): + for tc in msg["tool_calls"]: + fn = tc.get("function", {}) if isinstance(tc, dict) else {} + tools.append(fn.get("name", "")) + +# Get tool results +for msg in result.messages: + if msg.get("role") == "tool": + tool_output = msg.get("content", "") + call_id = msg.get("tool_call_id", "") +``` + +## Fields that DO NOT EXIST + +These are common mistakes — AgentResult does NOT have: +- `final_response` — extract from messages +- `tool_calls` — extract from messages +- `tools_used` — extract from messages +- `output` — extract from messages +- `response` — extract from messages diff --git a/skills/mlops/training/hermes-atropos-environments/references/atropos-base-env.md b/skills/mlops/training/hermes-atropos-environments/references/atropos-base-env.md new file mode 100644 index 000000000..e76895905 --- /dev/null +++ b/skills/mlops/training/hermes-atropos-environments/references/atropos-base-env.md @@ -0,0 +1,65 @@ +# Atropos BaseEnv Reference + +Source: `atroposlib/envs/base.py` (~2124 lines) + +## Abstract Methods (MUST implement) + +| Method | Signature | Description | +|--------|-----------|-------------| +| `get_next_item()` | `async def get_next_item(self) -> Item` | Return next item for trajectory. Return None to pause. | +| `evaluate()` | `async def evaluate(self, *args, **kwargs)` | Called every steps_per_eval steps. | +| `setup()` | `async def setup(self)` | Called once at start. Load datasets, init models. | +| `collect_trajectory()` | `async def collect_trajectory(self, item) -> Tuple[Optional[ScoredDataItem], List[Item]]` | Single rollout. Or override collect_trajectories instead. | + +## Overridable Methods + +| Method | Default Behavior | Override When | +|--------|-----------------|---------------| +| `collect_trajectories()` | Runs collect_trajectory group_size times in parallel | Batch generation, MCTS, coupled rollouts | +| `wandb_log()` | Logs completion lengths, rollout table, perf stats | Add custom metrics (always call super) | +| `config_init()` | Returns (env_config_cls(), ServerBaseline()) | Custom defaults + server configs | +| `postprocess_histories()` | Passthrough | Final processing before sending to trainer | +| `save_checkpoint()` | Saves JSON to checkpoint_dir | Custom serialization | +| `cleanup()` | No-op | Release resources after each rollout | + +## ScoredDataGroup Structure + +```python +ScoredDataGroup = TypedDict with: + tokens: List[List[int]] # Token IDs per rollout + masks: List[List[int]] # -100=prompt, token_id=completion + scores: List[float] # Score per rollout + advantages: Optional[...] # Per-token advantages + ref_logprobs: Optional[...] # Reference model logprobs + messages: Optional[...] # OpenAI-format messages + inference_logprobs: Optional[...] # Inference logprobs +``` + +## BaseEnvConfig Key Fields + +| Field | Default | Description | +|-------|---------|-------------| +| `group_size` | 4 | Responses grouped for scoring | +| `steps_per_eval` | 100 | Steps between evaluations | +| `max_token_length` | 2048 | Max token length for generations | +| `total_steps` | 1000 | Total training steps | +| `use_wandb` | True | Enable wandb logging | +| `tokenizer_name` | DeepHermes-3 | Tokenizer for token encoding | +| `ensure_scores_are_not_same` | True | Skip groups with identical scores | +| `worker_timeout` | 600 | Task timeout seconds | + +## Data Flow + +``` +env_manager() → add_train_workers() → handle_env() + → collect_trajectories() → postprocess_histories() + → handle_send_to_api() → training server +``` + +## Atropos Environment Statistics (82 environments analyzed) + +- 95% implement setup, collect_trajectories, evaluate, get_next_item +- 76% override wandb_log +- 54% have custom config class +- Most use collect_trajectories (plural), not collect_trajectory (singular) +- Common reward patterns: LLM-judge (~40), regex-extract (~35), code-exec (~12) diff --git a/skills/mlops/training/hermes-atropos-environments/references/usage-patterns.md b/skills/mlops/training/hermes-atropos-environments/references/usage-patterns.md new file mode 100644 index 000000000..57e4b912e --- /dev/null +++ b/skills/mlops/training/hermes-atropos-environments/references/usage-patterns.md @@ -0,0 +1,199 @@ +# Usage Patterns — Testing Environments and Evaluating Models + +## Pattern 1: Test Your Environment Works (process mode) + +Use `process` mode to verify your environment runs end-to-end before +committing. This generates trajectories without needing an Atropos +training server. + +**Before running:** Ask the user for their inference setup (see SKILL.md "Inference Setup" section). Replace ``, ``, and `` below with their chosen values. + +### Step 1: Run 1 trajectory + +```bash +cd ~/.hermes/hermes-agent +source .venv/bin/activate + +python environments/your_env.py process \ + --env.total_steps 1 \ + --env.group_size 1 \ + --env.use_wandb false \ + --env.data_path_to_save_groups /tmp/test_output.jsonl \ + --openai.base_url "" \ + --openai.model_name "" \ + --openai.server_type \ + --openai.health_check false +``` + +### Step 2: Verify the output + +```python +import json +for line in open("/tmp/test_output.jsonl"): + data = json.loads(line) + print(f"Scores: {data.get('scores', [])}") + print(f"Token sequences: {len(data.get('tokens', []))}") + # Check messages include tool calls + for msg_list in data.get("messages", []): + roles = [m.get("role") for m in msg_list] + print(f"Roles: {roles}") + for m in reversed(msg_list): + if m.get("role") == "assistant" and m.get("content"): + print(f"Response: {m['content'][:200]}...") + break +``` + +### What to check: +- **Scores are not all 0.0** — if so, compute_reward is broken +- **Scores are in [0, 1]** — not negative, not >1 +- **Messages include "tool" role entries** — agent used tools +- **Token sequences are non-empty** +- **An HTML visualization is generated** next to the .jsonl + +### Common failures: +- `'AgentResult' object has no attribute 'X'` — accessing a field that doesn't exist. See agentresult-fields.md. +- Score always 0.0 — reward function erroring silently +- Score always 1.0 — verification too lenient or not running + + +## Pattern 2: Evaluate a Model (evaluate mode) + +Use `evaluate` mode to benchmark a model on your environment's eval +split. This runs the full agent loop with tools for each eval item. + +### Step 1: Run evaluation + +```bash +python environments/your_env.py evaluate \ + --env.eval_size 20 \ + --env.use_wandb false \ + --env.data_dir_to_save_evals /tmp/eval_results \ + --openai.base_url "" \ + --openai.model_name "" \ + --openai.server_type \ + --openai.health_check false +``` + +### Step 2: Read results + +Stdout shows a lighteval-compatible table: + +``` +Evaluation Results: your-env_eval +|Metric | Value| +|mean correctness| 0.850 | +|mean reward | 0.920 | +|mean tool calls | 4.300 | +|n items | 20 | +Evaluation completed in 367 seconds +``` + +JSON results saved to the eval directory: + +```python +import json +data = json.load(open("/tmp/eval_results/metrics.json")) +for metric, value in data["results"]["all"].items(): + print(f"{metric}: {value}") +``` + +### Step 3: Compare models + +Run evaluate with different models and compare the metrics.json files. + +### What to check: +- **"data_dir_to_save_evals is not set"** — you forgot the flag, results won't be saved +- **Tool usage rate = 0** — evaluate() is using chat_completion instead of HermesAgentLoop +- **All scores identical** — judge failing, falling back to heuristic +- **Very slow** — each item runs a full agent loop (~30-90s). Use `--env.eval_size 5` for quick checks. + + +## Pattern 3: Generate Training Data (process mode, larger scale) + +Generate trajectory data for offline training or analysis: + +```bash +python environments/your_env.py process \ + --env.total_steps 50 \ + --env.group_size 4 \ + --env.use_wandb false \ + --env.data_path_to_save_groups data/trajectories.jsonl \ + --openai.base_url "" \ + --openai.model_name "" \ + --openai.server_type \ + --openai.health_check false +``` + +### Analyze the distribution: + +```python +import json +scores = [] +for line in open("data/trajectories.jsonl"): + data = json.loads(line) + scores.extend(data.get("scores", [])) + +print(f"Total: {len(scores)}, Mean: {sum(scores)/len(scores):.3f}") +for bucket in [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]: + count = sum(1 for s in scores if abs(s - bucket) < 0.1) + print(f" {bucket:.1f}: {'█' * count} ({count})") +``` + +### What to check: +- **Score distribution has variance** — RL needs score variance. All-same scores are useless. + + +## Pattern 4: Full RL Training (serve mode) + +For actual RL training with Atropos: + +```bash +# Terminal 1: Start Atropos API server +run-api + +# Terminal 2: Start your environment +python environments/your_env.py serve \ + --config environments/your_env/default.yaml +``` + +For Phase 2 with VLLM: + +```bash +# Terminal 1: VLLM server +python -m vllm.entrypoints.openai.api_server --model your-model --port 8000 + +# Terminal 2: Atropos API +run-api + +# Terminal 3: Environment +python environments/your_env.py serve \ + --openai.base_url http://localhost:8000/v1 \ + --openai.model_name your-model \ + --openai.server_type vllm +``` + + +## Pattern 5: Quick Smoke Test + +Verify imports and config before spending money on API calls: + +```python +from environments.your_env import YourEnv +print(f"Name: {YourEnv.name}") +cfg, servers = YourEnv.config_init() +print(f"Toolsets: {cfg.enabled_toolsets}") +print(f"Server: {servers[0].model_name}") +print("All imports OK") +``` + + +## Timing Expectations + +| Mode | Items | Time per item | Total | +|------|-------|--------------|-------| +| process (1 item) | 1 | 30-90s | ~1 min | +| evaluate (5 items) | 5 | 30-90s | ~5 min | +| evaluate (20 items) | 20 | 30-90s | ~15-30 min | +| process (50 items) | 50 | 30-90s | ~30-75 min | + +Times are for cloud APIs with Claude Sonnet-class models. Local models may be faster or slower depending on hardware.