Files
hermes-agent/wizard-bootstrap/FORGE_OPERATIONS_GUIDE.md

6.9 KiB

Forge Operations Guide

Audience: Forge wizards joining the hermes-agent project Purpose: Practical patterns, common pitfalls, and operational wisdom Companion to: WIZARD_ENVIRONMENT_CONTRACT.md


The One Rule

Read the actual state before acting.

Before touching any service, config, or codebase: ps aux | grep hermes, cat ~/.hermes/gateway_state.json, curl http://127.0.0.1:8642/health. The forge punishes assumptions harder than it rewards speed. Evidence always beats intuition.


First 15 Minutes on a New System

# 1. Validate your environment
python wizard-bootstrap/wizard_bootstrap.py

# 2. Check what is actually running
ps aux | grep -E 'hermes|python|gateway'

# 3. Check the data directory
ls -la ~/.hermes/
cat ~/.hermes/gateway_state.json 2>/dev/null | python3 -m json.tool

# 4. Verify health endpoints (if gateway is up)
curl -sf http://127.0.0.1:8642/health | python3 -m json.tool

# 5. Run the smoke test
source venv/bin/activate
python -m pytest tests/ -q -x --timeout=60 2>&1 | tail -20

Do not begin work until all five steps return clean output.


Import Chain — Know It, Respect It

The dependency order is load-bearing. Violating it causes silent failures:

tools/registry.py   ← no deps; imported by everything
       ↑
tools/*.py          ← each calls registry.register() at import time
       ↑
model_tools.py      ← imports registry; triggers tool discovery
       ↑
run_agent.py / cli.py / batch_runner.py

If you add a tool file, you must also:

  1. Add its import to model_tools.py _discover_tools()
  2. Add it to toolsets.py (core or a named toolset)

Missing either step causes the tool to silently not appear — no error, just absence.


The Five Profile Rules

Hermes supports isolated profiles (hermes -p myprofile). Profile-unsafe code has caused repeated bugs. Memorize these:

Do this Not this
get_hermes_home() Path.home() / ".hermes"
display_hermes_home() in user messages hardcoded ~/.hermes strings
get_hermes_home() / "sessions" in tests ~/.hermes/sessions in tests

Import both from hermes_constants. Every ~/.hermes hardcode is a latent profile bug.


Prompt Caching — Do Not Break It

The agent caches system prompts. Cache breaks force re-billing of the entire context window on every turn. The following actions break caching mid-conversation and are forbidden:

  • Altering past context
  • Changing the active toolset
  • Reloading memories or rebuilding the system prompt

The only sanctioned context alteration is the context compressor (agent/context_compressor.py). If your feature touches the message history, read that file first.


Adding a Slash Command (Checklist)

Four files, in order:

  1. hermes_cli/commands.py — add CommandDef to COMMAND_REGISTRY
  2. cli.py — add handler branch in HermesCLI.process_command()
  3. gateway/run.py — add handler if it should work in messaging platforms
  4. Aliases — add to the aliases tuple on the CommandDef; everything else updates automatically

All downstream consumers (Telegram menu, Slack routing, autocomplete, help text) derive from COMMAND_REGISTRY. You never touch them directly.


Tool Schema Pitfalls

Do NOT cross-reference other toolsets in schema descriptions. Writing "prefer web_search over this tool" in a browser tool's description will cause the model to hallucinate calls to web_search when it's not loaded. Cross-references belong in get_tool_definitions() post-processing blocks in model_tools.py.

Do NOT use \033[K (ANSI erase-to-EOL) in display code. Under prompt_toolkit's patch_stdout, it leaks as literal ?[K. Use space-padding instead: f"\r{line}{' ' * pad}".

Do NOT use simple_term_menu for interactive menus. It ghosts on scroll in tmux/iTerm2. Use curses (stdlib). See hermes_cli/tools_config.py for the pattern.


Health Check Anatomy

A healthy instance returns:

{
  "status": "ok",
  "gateway_state": "running",
  "platforms": {
    "telegram": {"state": "connected"}
  }
}
Field Healthy value What a bad value means
status "ok" HTTP server down
gateway_state "running" Still starting or crashed
platforms.<name>.state "connected" Auth failure or network issue

gateway_state: "starting" is normal for up to 60 s on boot. Beyond that, check logs for auth errors:

journalctl -u hermes-gateway --since "2 minutes ago" | grep -i "error\|token\|auth"

Gateway Won't Start — Diagnosis Order

  1. ss -tlnp | grep 8642 — port conflict?
  2. cat ~/.hermes/gateway.pidps -p <pid> — stale PID file?
  3. hermes gateway start --replace — clears stale locks and PIDs
  4. HERMES_LOG_LEVEL=DEBUG hermes gateway start — verbose output
  5. Check ~/.hermes/.env — missing or placeholder token?

Before Every PR

source venv/bin/activate
python -m pytest tests/ -q          # full suite: ~3 min, ~3000 tests
python scripts/deploy-validate       # deployment health check
python wizard-bootstrap/wizard_bootstrap.py  # environment sanity

All three must exit 0. Do not skip. "It works locally" is not sufficient evidence.


Session and State Files

Store Location Notes
Sessions ~/.hermes/sessions/*.json Persisted across restarts
Memories ~/.hermes/memories/*.md Written by the agent's memory tool
Cron jobs ~/.hermes/cron/*.json Scheduler state
Gateway state ~/.hermes/gateway_state.json Live platform connection status
Response store ~/.hermes/response_store.db SQLite WAL — API server only

All paths go through get_hermes_home(). Never hardcode. Always backup before a major update:

tar czf ~/backups/hermes_$(date +%F_%H%M).tar.gz ~/.hermes/

Writing Tests

python -m pytest tests/path/to/test.py -q    # single file
python -m pytest tests/ -q -k "test_name"    # by name
python -m pytest tests/ -q -x               # stop on first failure

Test isolation rules:

  • tests/conftest.py has an autouse fixture that redirects HERMES_HOME to a temp dir. Never write to ~/.hermes/ in tests.
  • Profile tests must mock both Path.home() and HERMES_HOME. See tests/hermes_cli/test_profiles.py for the pattern.
  • Do not mock the database. Integration tests should use real SQLite with a temp path.

Commit Conventions

feat: add X           # new capability
fix: correct Y        # bug fix
refactor: restructure Z  # no behaviour change
test: add tests for W    # test-only
chore: update deps       # housekeeping
docs: clarify X          # documentation only

Include Fixes #NNN or Refs #NNN in the commit message body to close or reference issues automatically.


This guide lives in wizard-bootstrap/. Update it when you discover a new pitfall or pattern worth preserving.