Merge pull request #815 from NousResearch/hermes/hermes-5ab2a29e

Add hermes-atropos-environments bundled skill
2026-03-09 23:06:19 -07:00
parent 36328a996f 6ab3ebf195
commit 9be8d88ccc
4 changed files with 625 additions and 0 deletions
--- a/skills/mlops/training/hermes-atropos-environments/SKILL.md
+++ b/skills/mlops/training/hermes-atropos-environments/SKILL.md
@@ -0,0 +1,302 @@
+---
+name: hermes-atropos-environments
+description: Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, evaluation with tools, wandb logging, and the three CLI modes (serve/process/evaluate). Use when creating, reviewing, or fixing RL environments in the hermes-agent repo.
+version: 1.1.0
+author: Hermes Agent
+license: MIT
+metadata:
+  hermes:
+    tags: [atropos, rl, environments, training, reinforcement-learning, reward-functions]
+    related_skills: [axolotl, grpo-rl-training, trl-fine-tuning, lm-evaluation-harness]
+---
+
+# Hermes Agent Atropos Environments
+
+Guide for building RL environments in the hermes-agent repo that integrate with the Atropos training framework.
+
+## Architecture Overview
+
+```
+Atropos BaseEnv (atroposlib/envs/base.py)
+    └── HermesAgentBaseEnv (environments/hermes_base_env.py)
+            ├── Handles agent loop orchestration
+            ├── Handles tool resolution per group
+            ├── Handles ToolContext for reward verification
+            └── YOUR ENVIRONMENT (environments/your_env.py)
+                    Only implements: setup, get_next_item, format_prompt,
+                                    compute_reward, evaluate, wandb_log
+```
+
+Hermes environments are special because they run a **multi-turn agent loop with tool calling** — not just single-turn completions. The base env handles the loop; you implement the task and scoring.
+
+## File Locations
+
+| File | Purpose |
+|------|---------|
+| `environments/hermes_base_env.py` | Base class with agent loop + tool resolution |
+| `environments/agent_loop.py` | `HermesAgentLoop` + `AgentResult` dataclass |
+| `environments/tool_context.py` | `ToolContext` for reward verification |
+| `environments/tool_call_parsers.py` | Phase 2 tool call parsers (hermes, mistral, etc.) |
+| `environments/your_env.py` | Your environment implementation |
+
+## Inference Setup — Ask the User First
+
+**IMPORTANT:** Before running any test, evaluation, or data generation command, always ask the user how they want to handle inference. Do NOT assume OpenRouter or any specific endpoint. Present these options:
+
+1. **OpenRouter** — Ask which model they want to use (e.g., `anthropic/claude-sonnet-4.5`, `google/gemini-2.5-pro`, `meta-llama/llama-3.3-70b-instruct`, etc.). Requires `OPENROUTER_API_KEY` in environment.
+2. **Self-hosted VLLM endpoint** — Ask for their base URL (e.g., `http://localhost:8000/v1`) and model name. Set `--openai.server_type vllm`.
+3. **Other OpenAI-compatible API** — Ask for the base URL, model name, and any required API key. Set `--openai.server_type openai` and `--openai.health_check false`.
+4. **Local Atropos training server** — For `serve` mode with a live training loop. Default `http://localhost:8000/v1`.
+
+Once the user tells you their setup, use those values in all CLI commands for that session. Example prompts:
+
+> "Before I run this, how would you like to handle inference?
+> 1. OpenRouter (I'll need your preferred model, e.g. claude-sonnet-4.5)
+> 2. A self-hosted VLLM endpoint (give me the URL and model name)
+> 3. Another OpenAI-compatible API (give me the URL, model, and any auth details)
+> 4. Local Atropos training server (serve mode)"
+
+### Key flags by provider:
+
+| Provider | `--openai.server_type` | `--openai.health_check` | `--openai.api_key` |
+|----------|----------------------|------------------------|-------------------|
+| OpenRouter | `openai` | `false` | `$OPENROUTER_API_KEY` |
+| VLLM (self-hosted) | `vllm` | (default) | (not needed) |
+| Other OpenAI-compatible | `openai` | `false` | As needed |
+| Local Atropos | (default) | (default) | (not needed) |
+
+## Required Methods
+
+### 1. `setup()` — Load dataset and initialize state
+
+```python
+async def setup(self) -> None:
+    """Called once at startup. Load datasets, initialize state."""
+    # Try HuggingFace first, fallback to built-in samples
+    try:
+        from datasets import load_dataset
+        ds = load_dataset("your/dataset", split="test")
+        self._items = [...]
+    except Exception:
+        self._items = BUILTIN_SAMPLES
+
+    # Always split into train/eval
+    random.shuffle(self._items)
+    eval_size = max(20, int(len(self._items) * 0.1))
+    self._eval_items = self._items[:eval_size]
+    self._items = self._items[eval_size:]
+```
+
+### 2. `get_next_item()` — Return next training item
+
+```python
+async def get_next_item(self) -> dict:
+    """Return next item, cycling through dataset."""
+    item = self._items[self._index % len(self._items)]
+    self._index += 1
+    return item
+```
+
+### 3. `format_prompt(item)` — Convert item to user message
+
+```python
+def format_prompt(self, item: dict) -> str:
+    """Convert a dataset item into the user-facing prompt."""
+    return f"Research this question: {item['question']}"
+```
+
+### 4. `compute_reward(item, result, ctx)` — Score the rollout
+
+**CRITICAL**: `result` is an `AgentResult`, NOT a dict. It has these attributes:
+- `result.messages` — List of message dicts (OpenAI format)
+- `result.turns_used` — Number of LLM calls made
+- `result.finished_naturally` — True if model stopped voluntarily
+- `result.tool_errors` — List of ToolError objects
+
+**AgentResult does NOT have**: `final_response`, `tool_calls`, `tools_used`.
+You must extract these from `result.messages`:
+
+```python
+async def compute_reward(self, item, result: AgentResult, ctx: ToolContext) -> float:
+    # Extract final response (last assistant message with content)
+    final_response = ""
+    tools_used = []
+    for msg in reversed(result.messages):
+        if msg.get("role") == "assistant" and msg.get("content") and not final_response:
+            final_response = msg["content"]
+        if msg.get("role") == "assistant" and msg.get("tool_calls"):
+            for tc in msg["tool_calls"]:
+                fn = tc.get("function", {}) if isinstance(tc, dict) else {}
+                name = fn.get("name", "")
+                if name:
+                    tools_used.append(name)
+
+    # Score using LLM judge, heuristic, or ToolContext verification
+    correctness = await self._llm_judge(item, final_response)
+    return correctness
+```
+
+`ctx` (ToolContext) gives you terminal/file access to the agent's sandbox for verification:
+```python
+# Run tests in the agent's sandbox
+result = ctx.terminal("pytest /workspace/test.py")
+return 1.0 if result["exit_code"] == 0 else 0.0
+```
+
+### 5. `evaluate()` — Periodic evaluation with full agent loop
+
+**MUST use the full agent loop with tools**, not single-turn chat_completion.
+The whole point of hermes-agent environments is agentic evaluation:
+
+```python
+async def evaluate(self, *args, **kwargs) -> None:
+    import time, uuid
+    from environments.agent_loop import HermesAgentLoop
+    from environments.tool_context import ToolContext
+
+    start_time = time.time()
+    tools, valid_names = self._resolve_tools_for_group()
+    samples = []
+
+    for item in self._eval_items[:self.config.eval_size]:
+        task_id = str(uuid.uuid4())
+        messages = []
+        if self.config.system_prompt:
+            messages.append({"role": "system", "content": self.config.system_prompt})
+        messages.append({"role": "user", "content": self.format_prompt(item)})
+
+        agent = HermesAgentLoop(
+            server=self.server,
+            tool_schemas=tools,
+            valid_tool_names=valid_names,
+            max_turns=self.config.max_agent_turns,
+            task_id=task_id,
+            temperature=0.0,  # Deterministic for eval
+            max_tokens=self.config.max_token_length,
+            extra_body=self.config.extra_body,
+        )
+        result = await agent.run(messages)
+
+        ctx = ToolContext(task_id)
+        try:
+            reward = await self.compute_reward(item, result, ctx)
+        finally:
+            ctx.cleanup()
+
+        samples.append({"prompt": ..., "response": ..., "reward": reward})
+
+    eval_metrics = {"eval/mean_reward": ...}
+    await self.evaluate_log(metrics=eval_metrics, samples=samples,
+                            start_time=start_time, end_time=time.time())
+```
+
+### 6. `wandb_log()` — Custom metrics logging
+
+Always call `super().wandb_log()` at the end:
+
+```python
+async def wandb_log(self, wandb_metrics=None):
+    if wandb_metrics is None:
+        wandb_metrics = {}
+    if self._reward_buffer:
+        n = len(self._reward_buffer)
+        wandb_metrics["train/mean_reward"] = sum(self._reward_buffer) / n
+        self._reward_buffer.clear()
+    await super().wandb_log(wandb_metrics)  # MUST call super
+```
+
+**Pitfall**: `compute_reward` appends to metric buffers. During eval, this pollutes training metrics. Roll back buffer entries added during eval.
+
+## Config Class
+
+Always create a custom config subclass with Pydantic Field descriptors. Key inherited fields you can tune: `enabled_toolsets`, `max_agent_turns`, `agent_temperature`, `system_prompt`, `terminal_backend`, `group_size`, `steps_per_eval`, `total_steps`.
+
+## config_init() — Default Configuration
+
+Classmethod returning `(YourEnvConfig, [APIServerConfig(...)])`. Set server_type to "openai" for OpenRouter/external APIs. Load API key from environment variable.
+
+## Three CLI Modes
+
+```bash
+# SERVE — Full training loop (connects to Atropos API server)
+python environments/my_env.py serve --openai.base_url http://localhost:8000/v1
+
+# PROCESS — Offline data generation (saves JSONL)
+python environments/my_env.py process --env.total_steps 10 --env.group_size 1 \
+    --env.use_wandb false --env.data_path_to_save_groups output.jsonl \
+    --openai.base_url "<USER_BASE_URL>" \
+    --openai.model_name "<USER_MODEL>" \
+    --openai.server_type <USER_SERVER_TYPE> --openai.health_check false
+
+# EVALUATE — Standalone eval (runs setup + evaluate only)
+python environments/my_env.py evaluate --env.eval_size 20 \
+    --env.data_dir_to_save_evals /tmp/eval_results \
+    --openai.base_url "<USER_BASE_URL>" \
+    --openai.model_name "<USER_MODEL>" \
+    --openai.server_type <USER_SERVER_TYPE> --openai.health_check false
+```
+
+Config priority: CLI args > YAML file > config_init() defaults.
+
+## Common Pitfalls
+
+1. **AgentResult has .messages, not .final_response** — Extract the final response by iterating reversed(result.messages) looking for the last assistant message with content.
+
+2. **evaluate() must use HermesAgentLoop, not chat_completion** — Single-turn chat_completion has no tools. The whole point of hermes-agent benchmarks is agentic evaluation with tool use.
+
+3. **Don't call _llm_judge twice** — If compute_reward already calls it, extract the score from the buffer instead of calling judge separately in evaluate().
+
+4. **Eval pollutes training buffers** — compute_reward appends to metric buffers. During eval, roll back buffer entries to keep training metrics clean.
+
+5. **Always set health_check=false for OpenRouter** — OpenRouter has no /health endpoint.
+
+6. **Set data_dir_to_save_evals in evaluate mode** — Without it, results aren't saved.
+
+7. **default_toolsets class variable vs enabled_toolsets config** — The class variable is a hint; the config field is what actually controls tool resolution.
+
+8. **Tool call parsing in messages** — Tool calls are dicts with `{"function": {"name": ..., "arguments": ...}}`. Always check `isinstance(tc, dict)`.
+
+9. **ToolContext.cleanup()** — Always call in a finally block to release sandbox resources.
+
+10. **server_type must be "openai" for external APIs** — Without it, Atropos assumes a local VLLM server.
+
+11. **Always ask the user for their inference setup** — Never hardcode or assume a specific provider/model. See the "Inference Setup" section above.
+
+## Reward Function Patterns
+
+### LLM Judge (for open-ended tasks)
+Use `self.server.chat_completion()` with a scoring prompt. Parse JSON response for score float. Always include a heuristic fallback (keyword overlap) for when the judge call fails.
+
+### Binary Verification (for code/terminal tasks)
+Use `ctx.terminal("pytest test.py -q")` to run tests in the agent's sandbox. Return 1.0 for pass, 0.0 for fail.
+
+### Multi-Signal (combine multiple indicators)
+Weight correctness (0.6) + tool usage (0.2) + efficiency (0.2) + optional bonuses. Clamp to [0, 1].
+
+## Testing Your Environment
+
+1. **Import test**: `python -c "from environments.my_env import MyEnv; print('OK')"`
+2. **Ask the user for inference setup** (see "Inference Setup" section above)
+3. **Process mode** (1 item): Verify JSONL output has valid tokens, masks, scores
+4. **Evaluate mode**: Verify full agent loop runs with tools, metrics logged correctly
+5. **Check reward range**: Scores should be in [0, 1], not all identical
+
+## Minimum Implementation Checklist
+
+```python
+class MyEnv(HermesAgentBaseEnv):
+    name = "my-env"
+    env_config_cls = MyEnvConfig
+
+    @classmethod
+    def config_init(cls): ...          # Default server + env config
+    async def setup(self): ...         # Load dataset + train/eval split
+    async def get_next_item(self): ... # Cycle through training items
+    def format_prompt(self, item): ... # Item → user message string
+    async def compute_reward(self, item, result, ctx): ...  # Score rollout
+    async def evaluate(self, *args, **kwargs): ...  # Full agent loop eval
+    async def wandb_log(self, metrics=None): ...    # Custom metrics + super()
+
+if __name__ == "__main__":
+    MyEnv.cli()
+```
--- a/skills/mlops/training/hermes-atropos-environments/references/agentresult-fields.md
+++ b/skills/mlops/training/hermes-atropos-environments/references/agentresult-fields.md
@@ -0,0 +1,59 @@
+# AgentResult Fields Reference
+
+`AgentResult` is defined in `environments/agent_loop.py` as a dataclass.
+
+## Fields
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `messages` | `List[Dict[str, Any]]` | Full conversation history in OpenAI message format |
+| `managed_state` | `Optional[Dict]` | ManagedServer.get_state() if Phase 2, else None |
+| `turns_used` | `int` | Number of LLM calls made during the loop |
+| `finished_naturally` | `bool` | True if model stopped calling tools on its own |
+| `reasoning_per_turn` | `List[Optional[str]]` | Extracted reasoning content per turn |
+| `tool_errors` | `List[ToolError]` | Tool errors encountered during the loop |
+
+## ToolError Fields
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `turn` | `int` | Which turn the error occurred |
+| `tool_name` | `str` | Name of the tool that failed |
+| `arguments` | `str` | Arguments passed to the tool |
+| `error` | `str` | Error message |
+| `tool_result` | `str` | The result returned to the model |
+
+## Extracting Data from Messages
+
+Messages follow OpenAI format. Common patterns:
+
+```python
+# Get final assistant response
+for msg in reversed(result.messages):
+    if msg.get("role") == "assistant" and msg.get("content"):
+        final_response = msg["content"]
+        break
+
+# Get all tool names used
+tools = []
+for msg in result.messages:
+    if msg.get("role") == "assistant" and msg.get("tool_calls"):
+        for tc in msg["tool_calls"]:
+            fn = tc.get("function", {}) if isinstance(tc, dict) else {}
+            tools.append(fn.get("name", ""))
+
+# Get tool results
+for msg in result.messages:
+    if msg.get("role") == "tool":
+        tool_output = msg.get("content", "")
+        call_id = msg.get("tool_call_id", "")
+```
+
+## Fields that DO NOT EXIST
+
+These are common mistakes — AgentResult does NOT have:
+- `final_response` — extract from messages
+- `tool_calls` — extract from messages  
+- `tools_used` — extract from messages
+- `output` — extract from messages
+- `response` — extract from messages
--- a/skills/mlops/training/hermes-atropos-environments/references/atropos-base-env.md
+++ b/skills/mlops/training/hermes-atropos-environments/references/atropos-base-env.md
@@ -0,0 +1,65 @@
+# Atropos BaseEnv Reference
+
+Source: `atroposlib/envs/base.py` (~2124 lines)
+
+## Abstract Methods (MUST implement)
+
+| Method | Signature | Description |
+|--------|-----------|-------------|
+| `get_next_item()` | `async def get_next_item(self) -> Item` | Return next item for trajectory. Return None to pause. |
+| `evaluate()` | `async def evaluate(self, *args, **kwargs)` | Called every steps_per_eval steps. |
+| `setup()` | `async def setup(self)` | Called once at start. Load datasets, init models. |
+| `collect_trajectory()` | `async def collect_trajectory(self, item) -> Tuple[Optional[ScoredDataItem], List[Item]]` | Single rollout. Or override collect_trajectories instead. |
+
+## Overridable Methods
+
+| Method | Default Behavior | Override When |
+|--------|-----------------|---------------|
+| `collect_trajectories()` | Runs collect_trajectory group_size times in parallel | Batch generation, MCTS, coupled rollouts |
+| `wandb_log()` | Logs completion lengths, rollout table, perf stats | Add custom metrics (always call super) |
+| `config_init()` | Returns (env_config_cls(), ServerBaseline()) | Custom defaults + server configs |
+| `postprocess_histories()` | Passthrough | Final processing before sending to trainer |
+| `save_checkpoint()` | Saves JSON to checkpoint_dir | Custom serialization |
+| `cleanup()` | No-op | Release resources after each rollout |
+
+## ScoredDataGroup Structure
+
+```python
+ScoredDataGroup = TypedDict with:
+    tokens:             List[List[int]]       # Token IDs per rollout
+    masks:              List[List[int]]       # -100=prompt, token_id=completion
+    scores:             List[float]           # Score per rollout
+    advantages:         Optional[...]         # Per-token advantages
+    ref_logprobs:       Optional[...]         # Reference model logprobs
+    messages:           Optional[...]         # OpenAI-format messages
+    inference_logprobs: Optional[...]         # Inference logprobs
+```
+
+## BaseEnvConfig Key Fields
+
+| Field | Default | Description |
+|-------|---------|-------------|
+| `group_size` | 4 | Responses grouped for scoring |
+| `steps_per_eval` | 100 | Steps between evaluations |
+| `max_token_length` | 2048 | Max token length for generations |
+| `total_steps` | 1000 | Total training steps |
+| `use_wandb` | True | Enable wandb logging |
+| `tokenizer_name` | DeepHermes-3 | Tokenizer for token encoding |
+| `ensure_scores_are_not_same` | True | Skip groups with identical scores |
+| `worker_timeout` | 600 | Task timeout seconds |
+
+## Data Flow
+
+```
+env_manager() → add_train_workers() → handle_env()
+    → collect_trajectories() → postprocess_histories()
+    → handle_send_to_api() → training server
+```
+
+## Atropos Environment Statistics (82 environments analyzed)
+
+- 95% implement setup, collect_trajectories, evaluate, get_next_item
+- 76% override wandb_log
+- 54% have custom config class
+- Most use collect_trajectories (plural), not collect_trajectory (singular)
+- Common reward patterns: LLM-judge (~40), regex-extract (~35), code-exec (~12)
--- a/skills/mlops/training/hermes-atropos-environments/references/usage-patterns.md
+++ b/skills/mlops/training/hermes-atropos-environments/references/usage-patterns.md
@@ -0,0 +1,199 @@
+# Usage Patterns — Testing Environments and Evaluating Models
+
+## Pattern 1: Test Your Environment Works (process mode)
+
+Use `process` mode to verify your environment runs end-to-end before
+committing. This generates trajectories without needing an Atropos
+training server.
+
+**Before running:** Ask the user for their inference setup (see SKILL.md "Inference Setup" section). Replace `<BASE_URL>`, `<MODEL>`, and `<SERVER_TYPE>` below with their chosen values.
+
+### Step 1: Run 1 trajectory
+
+```bash
+cd ~/.hermes/hermes-agent
+source .venv/bin/activate
+
+python environments/your_env.py process \
+  --env.total_steps 1 \
+  --env.group_size 1 \
+  --env.use_wandb false \
+  --env.data_path_to_save_groups /tmp/test_output.jsonl \
+  --openai.base_url "<BASE_URL>" \
+  --openai.model_name "<MODEL>" \
+  --openai.server_type <SERVER_TYPE> \
+  --openai.health_check false
+```
+
+### Step 2: Verify the output
+
+```python
+import json
+for line in open("/tmp/test_output.jsonl"):
+    data = json.loads(line)
+    print(f"Scores: {data.get('scores', [])}")
+    print(f"Token sequences: {len(data.get('tokens', []))}")
+    # Check messages include tool calls
+    for msg_list in data.get("messages", []):
+        roles = [m.get("role") for m in msg_list]
+        print(f"Roles: {roles}")
+        for m in reversed(msg_list):
+            if m.get("role") == "assistant" and m.get("content"):
+                print(f"Response: {m['content'][:200]}...")
+                break
+```
+
+### What to check:
+- **Scores are not all 0.0** — if so, compute_reward is broken
+- **Scores are in [0, 1]** — not negative, not >1
+- **Messages include "tool" role entries** — agent used tools
+- **Token sequences are non-empty**
+- **An HTML visualization is generated** next to the .jsonl
+
+### Common failures:
+- `'AgentResult' object has no attribute 'X'` — accessing a field that doesn't exist. See agentresult-fields.md.
+- Score always 0.0 — reward function erroring silently
+- Score always 1.0 — verification too lenient or not running
+
+
+## Pattern 2: Evaluate a Model (evaluate mode)
+
+Use `evaluate` mode to benchmark a model on your environment's eval
+split. This runs the full agent loop with tools for each eval item.
+
+### Step 1: Run evaluation
+
+```bash
+python environments/your_env.py evaluate \
+  --env.eval_size 20 \
+  --env.use_wandb false \
+  --env.data_dir_to_save_evals /tmp/eval_results \
+  --openai.base_url "<BASE_URL>" \
+  --openai.model_name "<MODEL>" \
+  --openai.server_type <SERVER_TYPE> \
+  --openai.health_check false
+```
+
+### Step 2: Read results
+
+Stdout shows a lighteval-compatible table:
+
+```
+Evaluation Results: your-env_eval
+|Metric          |  Value|
+|mean correctness| 0.850 |
+|mean reward     | 0.920 |
+|mean tool calls | 4.300 |
+|n items         | 20    |
+Evaluation completed in 367 seconds
+```
+
+JSON results saved to the eval directory:
+
+```python
+import json
+data = json.load(open("/tmp/eval_results/metrics.json"))
+for metric, value in data["results"]["all"].items():
+    print(f"{metric}: {value}")
+```
+
+### Step 3: Compare models
+
+Run evaluate with different models and compare the metrics.json files.
+
+### What to check:
+- **"data_dir_to_save_evals is not set"** — you forgot the flag, results won't be saved
+- **Tool usage rate = 0** — evaluate() is using chat_completion instead of HermesAgentLoop
+- **All scores identical** — judge failing, falling back to heuristic
+- **Very slow** — each item runs a full agent loop (~30-90s). Use `--env.eval_size 5` for quick checks.
+
+
+## Pattern 3: Generate Training Data (process mode, larger scale)
+
+Generate trajectory data for offline training or analysis:
+
+```bash
+python environments/your_env.py process \
+  --env.total_steps 50 \
+  --env.group_size 4 \
+  --env.use_wandb false \
+  --env.data_path_to_save_groups data/trajectories.jsonl \
+  --openai.base_url "<BASE_URL>" \
+  --openai.model_name "<MODEL>" \
+  --openai.server_type <SERVER_TYPE> \
+  --openai.health_check false
+```
+
+### Analyze the distribution:
+
+```python
+import json
+scores = []
+for line in open("data/trajectories.jsonl"):
+    data = json.loads(line)
+    scores.extend(data.get("scores", []))
+
+print(f"Total: {len(scores)}, Mean: {sum(scores)/len(scores):.3f}")
+for bucket in [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]:
+    count = sum(1 for s in scores if abs(s - bucket) < 0.1)
+    print(f"  {bucket:.1f}: {'█' * count} ({count})")
+```
+
+### What to check:
+- **Score distribution has variance** — RL needs score variance. All-same scores are useless.
+
+
+## Pattern 4: Full RL Training (serve mode)
+
+For actual RL training with Atropos:
+
+```bash
+# Terminal 1: Start Atropos API server
+run-api
+
+# Terminal 2: Start your environment
+python environments/your_env.py serve \
+  --config environments/your_env/default.yaml
+```
+
+For Phase 2 with VLLM:
+
+```bash
+# Terminal 1: VLLM server
+python -m vllm.entrypoints.openai.api_server --model your-model --port 8000
+
+# Terminal 2: Atropos API
+run-api
+
+# Terminal 3: Environment
+python environments/your_env.py serve \
+  --openai.base_url http://localhost:8000/v1 \
+  --openai.model_name your-model \
+  --openai.server_type vllm
+```
+
+
+## Pattern 5: Quick Smoke Test
+
+Verify imports and config before spending money on API calls:
+
+```python
+from environments.your_env import YourEnv
+print(f"Name: {YourEnv.name}")
+cfg, servers = YourEnv.config_init()
+print(f"Toolsets: {cfg.enabled_toolsets}")
+print(f"Server: {servers[0].model_name}")
+print("All imports OK")
+```
+
+
+## Timing Expectations
+
+| Mode | Items | Time per item | Total |
+|------|-------|--------------|-------|
+| process (1 item) | 1 | 30-90s | ~1 min |
+| evaluate (5 items) | 5 | 30-90s | ~5 min |
+| evaluate (20 items) | 20 | 30-90s | ~15-30 min |
+| process (50 items) | 50 | 30-90s | ~30-75 min |
+
+Times are for cloud APIs with Claude Sonnet-class models. Local models may be faster or slower depending on hardware.