Compare commits

...

1 Commits

Author SHA1 Message Date
Alexander Whitestone
17048c7dff docs: add Forge Operations Guide for wizard onboarding
Some checks failed
Docker Build and Publish / build-and-push (pull_request) Failing after 18s
Secret Scan / Scan for secrets (pull_request) Failing after 2s
Supply Chain Audit / Scan PR for supply chain risks (pull_request) Failing after 1s
Tests / test (pull_request) Failing after 3s
Captures practical patterns, pitfalls, and operational wisdom for
forge wizards joining the hermes-agent project. Covers:

- First-15-minutes system inspection checklist
- Import chain order and tool registration requirements
- Profile safety rules (get_hermes_home vs hardcoded paths)
- Prompt caching constraints
- Slash command addition checklist
- Tool schema pitfalls (ANSI codes, cross-toolset references)
- Health check anatomy and gateway diagnosis order
- Pre-PR test gate (pytest + deploy-validate + bootstrap)
- Test isolation and commit conventions

Companion document to WIZARD_ENVIRONMENT_CONTRACT.md.

Refs #142

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 22:05:12 -04:00

View File

@@ -0,0 +1,215 @@
# Forge Operations Guide
> **Audience:** Forge wizards joining the hermes-agent project
> **Purpose:** Practical patterns, common pitfalls, and operational wisdom
> **Companion to:** `WIZARD_ENVIRONMENT_CONTRACT.md`
---
## The One Rule
**Read the actual state before acting.**
Before touching any service, config, or codebase: `ps aux | grep hermes`, `cat ~/.hermes/gateway_state.json`, `curl http://127.0.0.1:8642/health`. The forge punishes assumptions harder than it rewards speed. Evidence always beats intuition.
---
## First 15 Minutes on a New System
```bash
# 1. Validate your environment
python wizard-bootstrap/wizard_bootstrap.py
# 2. Check what is actually running
ps aux | grep -E 'hermes|python|gateway'
# 3. Check the data directory
ls -la ~/.hermes/
cat ~/.hermes/gateway_state.json 2>/dev/null | python3 -m json.tool
# 4. Verify health endpoints (if gateway is up)
curl -sf http://127.0.0.1:8642/health | python3 -m json.tool
# 5. Run the smoke test
source venv/bin/activate
python -m pytest tests/ -q -x --timeout=60 2>&1 | tail -20
```
Do not begin work until all five steps return clean output.
---
## Import Chain — Know It, Respect It
The dependency order is load-bearing. Violating it causes silent failures:
```
tools/registry.py ← no deps; imported by everything
tools/*.py ← each calls registry.register() at import time
model_tools.py ← imports registry; triggers tool discovery
run_agent.py / cli.py / batch_runner.py
```
**If you add a tool file**, you must also:
1. Add its import to `model_tools.py` `_discover_tools()`
2. Add it to `toolsets.py` (core or a named toolset)
Missing either step causes the tool to silently not appear — no error, just absence.
---
## The Five Profile Rules
Hermes supports isolated profiles (`hermes -p myprofile`). Profile-unsafe code has caused repeated bugs. Memorize these:
| Do this | Not this |
|---------|----------|
| `get_hermes_home()` | `Path.home() / ".hermes"` |
| `display_hermes_home()` in user messages | hardcoded `~/.hermes` strings |
| `get_hermes_home() / "sessions"` in tests | `~/.hermes/sessions` in tests |
Import both from `hermes_constants`. Every `~/.hermes` hardcode is a latent profile bug.
---
## Prompt Caching — Do Not Break It
The agent caches system prompts. Cache breaks force re-billing of the entire context window on every turn. The following actions break caching mid-conversation and are forbidden:
- Altering past context
- Changing the active toolset
- Reloading memories or rebuilding the system prompt
The only sanctioned context alteration is the context compressor (`agent/context_compressor.py`). If your feature touches the message history, read that file first.
---
## Adding a Slash Command (Checklist)
Four files, in order:
1. **`hermes_cli/commands.py`** — add `CommandDef` to `COMMAND_REGISTRY`
2. **`cli.py`** — add handler branch in `HermesCLI.process_command()`
3. **`gateway/run.py`** — add handler if it should work in messaging platforms
4. **Aliases** — add to the `aliases` tuple on the `CommandDef`; everything else updates automatically
All downstream consumers (Telegram menu, Slack routing, autocomplete, help text) derive from `COMMAND_REGISTRY`. You never touch them directly.
---
## Tool Schema Pitfalls
**Do NOT cross-reference other toolsets in schema descriptions.**
Writing "prefer `web_search` over this tool" in a browser tool's description will cause the model to hallucinate calls to `web_search` when it's not loaded. Cross-references belong in `get_tool_definitions()` post-processing blocks in `model_tools.py`.
**Do NOT use `\033[K` (ANSI erase-to-EOL) in display code.**
Under `prompt_toolkit`'s `patch_stdout`, it leaks as literal `?[K`. Use space-padding instead: `f"\r{line}{' ' * pad}"`.
**Do NOT use `simple_term_menu` for interactive menus.**
It ghosts on scroll in tmux/iTerm2. Use `curses` (stdlib). See `hermes_cli/tools_config.py` for the pattern.
---
## Health Check Anatomy
A healthy instance returns:
```json
{
"status": "ok",
"gateway_state": "running",
"platforms": {
"telegram": {"state": "connected"}
}
}
```
| Field | Healthy value | What a bad value means |
|-------|--------------|----------------------|
| `status` | `"ok"` | HTTP server down |
| `gateway_state` | `"running"` | Still starting or crashed |
| `platforms.<name>.state` | `"connected"` | Auth failure or network issue |
`gateway_state: "starting"` is normal for up to 60 s on boot. Beyond that, check logs for auth errors:
```bash
journalctl -u hermes-gateway --since "2 minutes ago" | grep -i "error\|token\|auth"
```
---
## Gateway Won't Start — Diagnosis Order
1. `ss -tlnp | grep 8642` — port conflict?
2. `cat ~/.hermes/gateway.pid``ps -p <pid>` — stale PID file?
3. `hermes gateway start --replace` — clears stale locks and PIDs
4. `HERMES_LOG_LEVEL=DEBUG hermes gateway start` — verbose output
5. Check `~/.hermes/.env` — missing or placeholder token?
---
## Before Every PR
```bash
source venv/bin/activate
python -m pytest tests/ -q # full suite: ~3 min, ~3000 tests
python scripts/deploy-validate # deployment health check
python wizard-bootstrap/wizard_bootstrap.py # environment sanity
```
All three must exit 0. Do not skip. "It works locally" is not sufficient evidence.
---
## Session and State Files
| Store | Location | Notes |
|-------|----------|-------|
| Sessions | `~/.hermes/sessions/*.json` | Persisted across restarts |
| Memories | `~/.hermes/memories/*.md` | Written by the agent's memory tool |
| Cron jobs | `~/.hermes/cron/*.json` | Scheduler state |
| Gateway state | `~/.hermes/gateway_state.json` | Live platform connection status |
| Response store | `~/.hermes/response_store.db` | SQLite WAL — API server only |
All paths go through `get_hermes_home()`. Never hardcode. Always backup before a major update:
```bash
tar czf ~/backups/hermes_$(date +%F_%H%M).tar.gz ~/.hermes/
```
---
## Writing Tests
```bash
python -m pytest tests/path/to/test.py -q # single file
python -m pytest tests/ -q -k "test_name" # by name
python -m pytest tests/ -q -x # stop on first failure
```
**Test isolation rules:**
- `tests/conftest.py` has an autouse fixture that redirects `HERMES_HOME` to a temp dir. Never write to `~/.hermes/` in tests.
- Profile tests must mock both `Path.home()` and `HERMES_HOME`. See `tests/hermes_cli/test_profiles.py` for the pattern.
- Do not mock the database. Integration tests should use real SQLite with a temp path.
---
## Commit Conventions
```
feat: add X # new capability
fix: correct Y # bug fix
refactor: restructure Z # no behaviour change
test: add tests for W # test-only
chore: update deps # housekeeping
docs: clarify X # documentation only
```
Include `Fixes #NNN` or `Refs #NNN` in the commit message body to close or reference issues automatically.
---
*This guide lives in `wizard-bootstrap/`. Update it when you discover a new pitfall or pattern worth preserving.*