Files
hermes-config/skills/mlops/training/hermes-atropos-environments/references/atropos-base-env.md
Alexander Whitestone 11cc14d707 init: Hermes config, skills, memories, cron
Sovereign backup of all Hermes Agent configuration and data.
Excludes: secrets, auth tokens, sessions, caches, code (separate repo).

Tracked:
- config.yaml (model, fallback chain, toolsets, display prefs)
- SOUL.md (Timmy personality charter)
- memories/ (persistent MEMORY.md + USER.md)
- skills/ (371 files — full skill library)
- cron/jobs.json (scheduled tasks)
- channel_directory.json (platform channels)
- hooks/ (custom hooks)
2026-03-14 14:42:33 -04:00

3.0 KiB

Atropos BaseEnv Reference

Source: atroposlib/envs/base.py (~2124 lines)

Abstract Methods (MUST implement)

Method Signature Description
get_next_item() async def get_next_item(self) -> Item Return next item for trajectory. Return None to pause.
evaluate() async def evaluate(self, *args, **kwargs) Called every steps_per_eval steps.
setup() async def setup(self) Called once at start. Load datasets, init models.
collect_trajectory() async def collect_trajectory(self, item) -> Tuple[Optional[ScoredDataItem], List[Item]] Single rollout. Or override collect_trajectories instead.

Overridable Methods

Method Default Behavior Override When
collect_trajectories() Runs collect_trajectory group_size times in parallel Batch generation, MCTS, coupled rollouts
wandb_log() Logs completion lengths, rollout table, perf stats Add custom metrics (always call super)
config_init() Returns (env_config_cls(), ServerBaseline()) Custom defaults + server configs
postprocess_histories() Passthrough Final processing before sending to trainer
save_checkpoint() Saves JSON to checkpoint_dir Custom serialization
cleanup() No-op Release resources after each rollout

ScoredDataGroup Structure

ScoredDataGroup = TypedDict with:
    tokens:             List[List[int]]       # Token IDs per rollout
    masks:              List[List[int]]       # -100=prompt, token_id=completion
    scores:             List[float]           # Score per rollout
    advantages:         Optional[...]         # Per-token advantages
    ref_logprobs:       Optional[...]         # Reference model logprobs
    messages:           Optional[...]         # OpenAI-format messages
    inference_logprobs: Optional[...]         # Inference logprobs

BaseEnvConfig Key Fields

Field Default Description
group_size 4 Responses grouped for scoring
steps_per_eval 100 Steps between evaluations
max_token_length 2048 Max token length for generations
total_steps 1000 Total training steps
use_wandb True Enable wandb logging
tokenizer_name DeepHermes-3 Tokenizer for token encoding
ensure_scores_are_not_same True Skip groups with identical scores
worker_timeout 600 Task timeout seconds

Data Flow

env_manager() → add_train_workers() → handle_env()
    → collect_trajectories() → postprocess_histories()
    → handle_send_to_api() → training server

Atropos Environment Statistics (82 environments analyzed)

  • 95% implement setup, collect_trajectories, evaluate, get_next_item
  • 76% override wandb_log
  • 54% have custom config class
  • Most use collect_trajectories (plural), not collect_trajectory (singular)
  • Common reward patterns: LLM-judge (~40), regex-extract (~35), code-exec (~12)