docs: add Forge Operations Guide for wizard onboarding

Captures practical patterns, pitfalls, and operational wisdom for forge wizards joining the hermes-agent project. Covers: - First-15-minutes system inspection checklist - Import chain order and tool registration requirements - Profile safety rules (get_hermes_home vs hardcoded paths) - Prompt caching constraints - Slash command addition checklist - Tool schema pitfalls (ANSI codes, cross-toolset references) - Health check anatomy and gateway diagnosis order - Pre-PR test gate (pytest + deploy-validate + bootstrap) - Test isolation and commit conventions Companion document to WIZARD_ENVIRONMENT_CONTRACT.md. Refs #142 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 22:05:12 -04:00
1 changed files with 215 additions and 0 deletions
--- a/wizard-bootstrap/FORGE_OPERATIONS_GUIDE.md
+++ b/wizard-bootstrap/FORGE_OPERATIONS_GUIDE.md
@@ -0,0 +1,215 @@
+# Forge Operations Guide
+
+> **Audience:** Forge wizards joining the hermes-agent project
+> **Purpose:** Practical patterns, common pitfalls, and operational wisdom
+> **Companion to:** `WIZARD_ENVIRONMENT_CONTRACT.md`
+
+---
+
+## The One Rule
+
+**Read the actual state before acting.**
+
+Before touching any service, config, or codebase: `ps aux | grep hermes`, `cat ~/.hermes/gateway_state.json`, `curl http://127.0.0.1:8642/health`. The forge punishes assumptions harder than it rewards speed. Evidence always beats intuition.
+
+---
+
+## First 15 Minutes on a New System
+
+```bash
+# 1. Validate your environment
+python wizard-bootstrap/wizard_bootstrap.py
+
+# 2. Check what is actually running
+ps aux | grep -E 'hermes|python|gateway'
+
+# 3. Check the data directory
+ls -la ~/.hermes/
+cat ~/.hermes/gateway_state.json 2>/dev/null | python3 -m json.tool
+
+# 4. Verify health endpoints (if gateway is up)
+curl -sf http://127.0.0.1:8642/health | python3 -m json.tool
+
+# 5. Run the smoke test
+source venv/bin/activate
+python -m pytest tests/ -q -x --timeout=60 2>&1 | tail -20
+```
+
+Do not begin work until all five steps return clean output.
+
+---
+
+## Import Chain — Know It, Respect It
+
+The dependency order is load-bearing. Violating it causes silent failures:
+
+```
+tools/registry.py   ← no deps; imported by everything
+       ↑
+tools/*.py          ← each calls registry.register() at import time
+       ↑
+model_tools.py      ← imports registry; triggers tool discovery
+       ↑
+run_agent.py / cli.py / batch_runner.py
+```
+
+**If you add a tool file**, you must also:
+1. Add its import to `model_tools.py` `_discover_tools()`
+2. Add it to `toolsets.py` (core or a named toolset)
+
+Missing either step causes the tool to silently not appear — no error, just absence.
+
+---
+
+## The Five Profile Rules
+
+Hermes supports isolated profiles (`hermes -p myprofile`). Profile-unsafe code has caused repeated bugs. Memorize these:
+
+| Do this | Not this |
+|---------|----------|
+| `get_hermes_home()` | `Path.home() / ".hermes"` |
+| `display_hermes_home()` in user messages | hardcoded `~/.hermes` strings |
+| `get_hermes_home() / "sessions"` in tests | `~/.hermes/sessions` in tests |
+
+Import both from `hermes_constants`. Every `~/.hermes` hardcode is a latent profile bug.
+
+---
+
+## Prompt Caching — Do Not Break It
+
+The agent caches system prompts. Cache breaks force re-billing of the entire context window on every turn. The following actions break caching mid-conversation and are forbidden:
+
+- Altering past context
+- Changing the active toolset
+- Reloading memories or rebuilding the system prompt
+
+The only sanctioned context alteration is the context compressor (`agent/context_compressor.py`). If your feature touches the message history, read that file first.
+
+---
+
+## Adding a Slash Command (Checklist)
+
+Four files, in order:
+
+1. **`hermes_cli/commands.py`** — add `CommandDef` to `COMMAND_REGISTRY`
+2. **`cli.py`** — add handler branch in `HermesCLI.process_command()`
+3. **`gateway/run.py`** — add handler if it should work in messaging platforms
+4. **Aliases** — add to the `aliases` tuple on the `CommandDef`; everything else updates automatically
+
+All downstream consumers (Telegram menu, Slack routing, autocomplete, help text) derive from `COMMAND_REGISTRY`. You never touch them directly.
+
+---
+
+## Tool Schema Pitfalls
+
+**Do NOT cross-reference other toolsets in schema descriptions.**
+Writing "prefer `web_search` over this tool" in a browser tool's description will cause the model to hallucinate calls to `web_search` when it's not loaded. Cross-references belong in `get_tool_definitions()` post-processing blocks in `model_tools.py`.
+
+**Do NOT use `\033[K` (ANSI erase-to-EOL) in display code.**
+Under `prompt_toolkit`'s `patch_stdout`, it leaks as literal `?[K`. Use space-padding instead: `f"\r{line}{' ' * pad}"`.
+
+**Do NOT use `simple_term_menu` for interactive menus.**
+It ghosts on scroll in tmux/iTerm2. Use `curses` (stdlib). See `hermes_cli/tools_config.py` for the pattern.
+
+---
+
+## Health Check Anatomy
+
+A healthy instance returns:
+
+```json
+{
+  "status": "ok",
+  "gateway_state": "running",
+  "platforms": {
+    "telegram": {"state": "connected"}
+  }
+}
+```
+
+| Field | Healthy value | What a bad value means |
+|-------|--------------|----------------------|
+| `status` | `"ok"` | HTTP server down |
+| `gateway_state` | `"running"` | Still starting or crashed |
+| `platforms.<name>.state` | `"connected"` | Auth failure or network issue |
+
+`gateway_state: "starting"` is normal for up to 60 s on boot. Beyond that, check logs for auth errors:
+
+```bash
+journalctl -u hermes-gateway --since "2 minutes ago" | grep -i "error\|token\|auth"
+```
+
+---
+
+## Gateway Won't Start — Diagnosis Order
+
+1. `ss -tlnp | grep 8642` — port conflict?
+2. `cat ~/.hermes/gateway.pid` → `ps -p <pid>` — stale PID file?
+3. `hermes gateway start --replace` — clears stale locks and PIDs
+4. `HERMES_LOG_LEVEL=DEBUG hermes gateway start` — verbose output
+5. Check `~/.hermes/.env` — missing or placeholder token?
+
+---
+
+## Before Every PR
+
+```bash
+source venv/bin/activate
+python -m pytest tests/ -q          # full suite: ~3 min, ~3000 tests
+python scripts/deploy-validate       # deployment health check
+python wizard-bootstrap/wizard_bootstrap.py  # environment sanity
+```
+
+All three must exit 0. Do not skip. "It works locally" is not sufficient evidence.
+
+---
+
+## Session and State Files
+
+| Store | Location | Notes |
+|-------|----------|-------|
+| Sessions | `~/.hermes/sessions/*.json` | Persisted across restarts |
+| Memories | `~/.hermes/memories/*.md` | Written by the agent's memory tool |
+| Cron jobs | `~/.hermes/cron/*.json` | Scheduler state |
+| Gateway state | `~/.hermes/gateway_state.json` | Live platform connection status |
+| Response store | `~/.hermes/response_store.db` | SQLite WAL — API server only |
+
+All paths go through `get_hermes_home()`. Never hardcode. Always backup before a major update:
+
+```bash
+tar czf ~/backups/hermes_$(date +%F_%H%M).tar.gz ~/.hermes/
+```
+
+---
+
+## Writing Tests
+
+```bash
+python -m pytest tests/path/to/test.py -q    # single file
+python -m pytest tests/ -q -k "test_name"    # by name
+python -m pytest tests/ -q -x               # stop on first failure
+```
+
+**Test isolation rules:**
+- `tests/conftest.py` has an autouse fixture that redirects `HERMES_HOME` to a temp dir. Never write to `~/.hermes/` in tests.
+- Profile tests must mock both `Path.home()` and `HERMES_HOME`. See `tests/hermes_cli/test_profiles.py` for the pattern.
+- Do not mock the database. Integration tests should use real SQLite with a temp path.
+
+---
+
+## Commit Conventions
+
+```
+feat: add X           # new capability
+fix: correct Y        # bug fix
+refactor: restructure Z  # no behaviour change
+test: add tests for W    # test-only
+chore: update deps       # housekeeping
+docs: clarify X          # documentation only
+```
+
+Include `Fixes #NNN` or `Refs #NNN` in the commit message body to close or reference issues automatically.
+
+---
+
+*This guide lives in `wizard-bootstrap/`. Update it when you discover a new pitfall or pattern worth preserving.*