Files
hermes-agent/website/docs/user-guide/features/rl-training.md
Teknium ee3f3e756d docs: fix stale and incorrect documentation across 18 files
Cross-referenced all 84 docs pages against the actual codebase and
corrected every discrepancy found.

Reference docs:
- faq.md: Fix non-existent commands (/stats→/usage, /context→/usage,
  hermes models→hermes model, hermes config get→hermes config show,
  hermes gateway logs→cat gateway.log, async→sync chat() call)
- cli-commands.md: Fix --provider choices list (remove providers not
  in argparse), add undocumented -s/--skills flag
- slash-commands.md: Add missing /queue and /resume commands, fix
  /approve args_hint to show [session|always]
- tools-reference.md: Remove duplicate vision and web toolset sections
- environment-variables.md: Fix HERMES_INFERENCE_PROVIDER list (add
  copilot-acp, remove alibaba to match actual argparse choices)

Configuration & user guide:
- configuration.md: Fix approval_mode→approvals.mode (manual not ask),
  checkpoints.enabled default true not false, human_delay defaults
  (500/2000→800/2500), remove non-existent delegation.max_iterations
  and delegation.default_toolsets, fix website_blocklist nesting
  under security:, add .hermes.md and CLAUDE.md to context files
  table with priority system explanation
- security.md: Fix website_blocklist nesting under security:
- context-files.md: Add .hermes.md/HERMES.md and CLAUDE.md support,
  document priority-based first-match-wins loading behavior
- cli.md: Fix personalities config nesting (top-level, not under agent:)
- delegation.md: Fix model override docs (config-level, not per-call
  tool parameter)
- rl-training.md: Fix log directory (tinker-atropos/logs/→
  ~/.hermes/logs/rl_training/)
- tts.md: Fix Discord delivery format (voice bubble with fallback,
  not just file attachment)
- git-worktrees.md: Remove outdated v0.2.0 version reference

Developer guide:
- prompt-assembly.md: Add .hermes.md, CLAUDE.md, document priority
  system for context files
- agent-loop.md: Fix callback list (remove non-existent
  message_callback, add stream_delta_callback, tool_gen_callback,
  status_callback)

Messaging & guides:
- webhooks.md: Fix command (hermes setup gateway→hermes gateway setup)
- tips.md: Fix session idle timeout (120min→24h), config file
  (gateway.json→config.yaml)
- build-a-hermes-plugin.md: Fix plugin.yaml provides: format
  (provides_tools/provides_hooks as lists), note register_command()
  as not yet implemented
2026-03-24 07:53:07 -07:00

8.8 KiB
Raw Permalink Blame History

sidebar_position, title, description
sidebar_position title description
13 RL Training Reinforcement learning on agent behaviors with Tinker-Atropos — environment discovery, training, and evaluation

RL Training

Hermes Agent includes an integrated RL (Reinforcement Learning) training pipeline built on Tinker-Atropos. This enables training language models on environment-specific tasks using GRPO (Group Relative Policy Optimization) with LoRA adapters, orchestrated entirely through the agent's tool interface.

Overview

The RL training system consists of three components:

  1. Atropos — A trajectory API server that coordinates environment interactions, manages rollout groups, and computes advantages
  2. Tinker — A training service that handles model weights, LoRA training, sampling/inference, and optimizer steps
  3. Environments — Python classes that define tasks, scoring, and reward functions (e.g., GSM8K math problems)

The agent can discover environments, configure training parameters, launch training runs, and monitor metrics — all through a set of rl_* tools.

Requirements

RL training requires:

  • Python >= 3.11 (Tinker package requirement)
  • TINKER_API_KEY — API key for the Tinker training service
  • WANDB_API_KEY — API key for Weights & Biases metrics tracking
  • The tinker-atropos submodule (at tinker-atropos/ relative to the Hermes root)
# Set up API keys
hermes config set TINKER_API_KEY your-tinker-key
hermes config set WANDB_API_KEY your-wandb-key

When both keys are present and Python >= 3.11 is available, the rl toolset is automatically enabled.

Available Tools

Tool Description
rl_list_environments Discover available RL environments
rl_select_environment Select an environment and load its config
rl_get_current_config View configurable and locked fields
rl_edit_config Modify configurable training parameters
rl_start_training Launch a training run (spawns 3 processes)
rl_check_status Monitor training progress and WandB metrics
rl_stop_training Stop a running training job
rl_get_results Get final metrics and model weights path
rl_list_runs List all active and completed runs
rl_test_inference Quick inference test using OpenRouter

Workflow

1. Discover Environments

List the available RL environments

The agent calls rl_list_environments() which scans tinker-atropos/tinker_atropos/environments/ using AST parsing to find Python classes inheriting from BaseEnv. Each environment defines:

  • Dataset loading — where training data comes from (e.g., HuggingFace datasets)
  • Prompt construction — how to format items for the model
  • Scoring/verification — how to evaluate model outputs and assign rewards

2. Select and Configure

Select the GSM8K environment and show me the configuration

The agent calls rl_select_environment("gsm8k_tinker"), then rl_get_current_config() to see all parameters.

Configuration fields are divided into two categories:

Configurable fields (can be modified):

  • group_size — Number of completions per item (default: 16)
  • batch_size — Training batch size (default: 128)
  • wandb_name — WandB run name (auto-set to {env}-{timestamp})
  • Other environment-specific parameters

Locked fields (infrastructure settings, cannot be changed):

  • tokenizer_name — Model tokenizer (e.g., Qwen/Qwen3-8B)
  • rollout_server_url — Atropos API URL (http://localhost:8000)
  • max_token_length — Maximum token length (8192)
  • max_num_workers — Maximum parallel workers (2048)
  • total_steps — Total training steps (2500)
  • lora_rank — LoRA adapter rank (32)
  • learning_rate — Learning rate (4e-5)
  • max_token_trainer_length — Max tokens for trainer (9000)

3. Start Training

Start the training run

The agent calls rl_start_training() which:

  1. Generates a YAML config file merging locked settings with configurable overrides
  2. Creates a unique run ID
  3. Spawns three processes:
    • Atropos API server (run-api) — trajectory coordination
    • Tinker trainer (launch_training.py) — LoRA training + FastAPI inference server on port 8001
    • Environment (environment.py serve) — the selected environment connecting to Atropos

The processes start with staggered delays (5s for API, 30s for trainer, 90s more for environment) to ensure proper initialization order.

4. Monitor Progress

Check the status of training run abc12345

The agent calls rl_check_status(run_id) which reports:

  • Process status (running/exited for each of the 3 processes)
  • Running time
  • WandB metrics (step, reward mean, percent correct, eval accuracy)
  • Log file locations for debugging

:::note Rate Limiting Status checks are rate-limited to once every 30 minutes per run ID. This prevents excessive polling during long-running training jobs that take hours. :::

5. Stop or Get Results

Stop the training run
# or
Get the final results for run abc12345

rl_stop_training() terminates all three processes in reverse order (environment → trainer → API). rl_get_results() retrieves final WandB metrics and training history.

Inference Testing

Before committing to a full training run, you can test if an environment works correctly using rl_test_inference. This runs a few steps of inference and scoring using OpenRouter — no Tinker API needed, just an OPENROUTER_API_KEY.

Test the selected environment with inference

Default configuration:

  • 3 steps × 16 completions = 48 rollouts per model
  • Tests 3 models at different scales for robustness:
    • qwen/qwen3-8b (small)
    • z-ai/glm-4.7-flash (medium)
    • minimax/minimax-m2.7 (large)
  • Total: ~144 rollouts

This validates:

  • Environment loads correctly
  • Prompt construction works
  • Inference response parsing is robust across model scales
  • Verifier/scoring logic produces valid rewards

Tinker API Integration

The trainer uses the Tinker API for model training operations:

  • ServiceClient — Creates training and sampling clients
  • Training client — Handles forward-backward passes with importance sampling loss, optimizer steps (Adam), and weight checkpointing
  • Sampling client — Provides inference using the latest trained weights

The training loop:

  1. Fetches a batch of rollouts from Atropos (prompt + completions + scores)
  2. Converts to Tinker Datum objects with padded logprobs and advantages
  3. Runs forward-backward pass with importance sampling loss
  4. Takes an optimizer step (Adam: lr=4e-5, β1=0.9, β2=0.95)
  5. Saves weights and creates a new sampling client for next-step inference
  6. Logs metrics to WandB

Architecture Diagram

flowchart LR
    api["Atropos API<br/>run-api<br/>port 8000"]
    env["Environment<br/>BaseEnv implementation"]
    infer["OpenAI / sglang<br/>inference API<br/>port 8001"]
    trainer["Tinker Trainer<br/>LoRA training + FastAPI"]

    env <--> api
    env --> infer
    api -->|"batches: tokens, scores, logprobs"| trainer
    trainer -->|"serves inference"| infer

Creating Custom Environments

To create a new RL environment:

  1. Create a Python file in tinker-atropos/tinker_atropos/environments/
  2. Define a class that inherits from BaseEnv
  3. Implement the required methods:
    • load_dataset() — Load your training data
    • get_next_item() — Provide the next item to the model
    • score_answer() — Score model outputs and assign rewards
    • collect_trajectories() — Collect and return trajectories
  4. Optionally define a custom config class inheriting from BaseEnvConfig

Study the existing gsm8k_tinker.py as a template. The agent can help you create new environments — it can read existing environment files, inspect HuggingFace datasets, and write new environment code.

WandB Metrics

Training runs log to Weights & Biases with these key metrics:

Metric Description
train/loss Training loss (importance sampling)
train/learning_rate Current learning rate
reward/mean Mean reward across groups
logprobs/mean Mean reference logprobs
logprobs/mean_training Mean training logprobs
logprobs/diff Logprob drift (reference - training)
advantages/mean Mean advantage values
advantages/std Advantage standard deviation

Log Files

Each training run generates log files in ~/.hermes/logs/rl_training/:

logs/
├── api_{run_id}.log        # Atropos API server logs
├── trainer_{run_id}.log    # Tinker trainer logs
├── env_{run_id}.log        # Environment process logs
└── inference_tests/        # Inference test results
    ├── test_{env}_{model}.jsonl
    └── test_{env}_{model}.log

These are invaluable for debugging when training fails or produces unexpected results.