216 lines
6.9 KiB
Markdown
216 lines
6.9 KiB
Markdown
|
|
# Forge Operations Guide
|
||
|
|
|
||
|
|
> **Audience:** Forge wizards joining the hermes-agent project
|
||
|
|
> **Purpose:** Practical patterns, common pitfalls, and operational wisdom
|
||
|
|
> **Companion to:** `WIZARD_ENVIRONMENT_CONTRACT.md`
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## The One Rule
|
||
|
|
|
||
|
|
**Read the actual state before acting.**
|
||
|
|
|
||
|
|
Before touching any service, config, or codebase: `ps aux | grep hermes`, `cat ~/.hermes/gateway_state.json`, `curl http://127.0.0.1:8642/health`. The forge punishes assumptions harder than it rewards speed. Evidence always beats intuition.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## First 15 Minutes on a New System
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Validate your environment
|
||
|
|
python wizard-bootstrap/wizard_bootstrap.py
|
||
|
|
|
||
|
|
# 2. Check what is actually running
|
||
|
|
ps aux | grep -E 'hermes|python|gateway'
|
||
|
|
|
||
|
|
# 3. Check the data directory
|
||
|
|
ls -la ~/.hermes/
|
||
|
|
cat ~/.hermes/gateway_state.json 2>/dev/null | python3 -m json.tool
|
||
|
|
|
||
|
|
# 4. Verify health endpoints (if gateway is up)
|
||
|
|
curl -sf http://127.0.0.1:8642/health | python3 -m json.tool
|
||
|
|
|
||
|
|
# 5. Run the smoke test
|
||
|
|
source venv/bin/activate
|
||
|
|
python -m pytest tests/ -q -x --timeout=60 2>&1 | tail -20
|
||
|
|
```
|
||
|
|
|
||
|
|
Do not begin work until all five steps return clean output.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Import Chain — Know It, Respect It
|
||
|
|
|
||
|
|
The dependency order is load-bearing. Violating it causes silent failures:
|
||
|
|
|
||
|
|
```
|
||
|
|
tools/registry.py ← no deps; imported by everything
|
||
|
|
↑
|
||
|
|
tools/*.py ← each calls registry.register() at import time
|
||
|
|
↑
|
||
|
|
model_tools.py ← imports registry; triggers tool discovery
|
||
|
|
↑
|
||
|
|
run_agent.py / cli.py / batch_runner.py
|
||
|
|
```
|
||
|
|
|
||
|
|
**If you add a tool file**, you must also:
|
||
|
|
1. Add its import to `model_tools.py` `_discover_tools()`
|
||
|
|
2. Add it to `toolsets.py` (core or a named toolset)
|
||
|
|
|
||
|
|
Missing either step causes the tool to silently not appear — no error, just absence.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## The Five Profile Rules
|
||
|
|
|
||
|
|
Hermes supports isolated profiles (`hermes -p myprofile`). Profile-unsafe code has caused repeated bugs. Memorize these:
|
||
|
|
|
||
|
|
| Do this | Not this |
|
||
|
|
|---------|----------|
|
||
|
|
| `get_hermes_home()` | `Path.home() / ".hermes"` |
|
||
|
|
| `display_hermes_home()` in user messages | hardcoded `~/.hermes` strings |
|
||
|
|
| `get_hermes_home() / "sessions"` in tests | `~/.hermes/sessions` in tests |
|
||
|
|
|
||
|
|
Import both from `hermes_constants`. Every `~/.hermes` hardcode is a latent profile bug.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Prompt Caching — Do Not Break It
|
||
|
|
|
||
|
|
The agent caches system prompts. Cache breaks force re-billing of the entire context window on every turn. The following actions break caching mid-conversation and are forbidden:
|
||
|
|
|
||
|
|
- Altering past context
|
||
|
|
- Changing the active toolset
|
||
|
|
- Reloading memories or rebuilding the system prompt
|
||
|
|
|
||
|
|
The only sanctioned context alteration is the context compressor (`agent/context_compressor.py`). If your feature touches the message history, read that file first.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Adding a Slash Command (Checklist)
|
||
|
|
|
||
|
|
Four files, in order:
|
||
|
|
|
||
|
|
1. **`hermes_cli/commands.py`** — add `CommandDef` to `COMMAND_REGISTRY`
|
||
|
|
2. **`cli.py`** — add handler branch in `HermesCLI.process_command()`
|
||
|
|
3. **`gateway/run.py`** — add handler if it should work in messaging platforms
|
||
|
|
4. **Aliases** — add to the `aliases` tuple on the `CommandDef`; everything else updates automatically
|
||
|
|
|
||
|
|
All downstream consumers (Telegram menu, Slack routing, autocomplete, help text) derive from `COMMAND_REGISTRY`. You never touch them directly.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Tool Schema Pitfalls
|
||
|
|
|
||
|
|
**Do NOT cross-reference other toolsets in schema descriptions.**
|
||
|
|
Writing "prefer `web_search` over this tool" in a browser tool's description will cause the model to hallucinate calls to `web_search` when it's not loaded. Cross-references belong in `get_tool_definitions()` post-processing blocks in `model_tools.py`.
|
||
|
|
|
||
|
|
**Do NOT use `\033[K` (ANSI erase-to-EOL) in display code.**
|
||
|
|
Under `prompt_toolkit`'s `patch_stdout`, it leaks as literal `?[K`. Use space-padding instead: `f"\r{line}{' ' * pad}"`.
|
||
|
|
|
||
|
|
**Do NOT use `simple_term_menu` for interactive menus.**
|
||
|
|
It ghosts on scroll in tmux/iTerm2. Use `curses` (stdlib). See `hermes_cli/tools_config.py` for the pattern.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Health Check Anatomy
|
||
|
|
|
||
|
|
A healthy instance returns:
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"status": "ok",
|
||
|
|
"gateway_state": "running",
|
||
|
|
"platforms": {
|
||
|
|
"telegram": {"state": "connected"}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
| Field | Healthy value | What a bad value means |
|
||
|
|
|-------|--------------|----------------------|
|
||
|
|
| `status` | `"ok"` | HTTP server down |
|
||
|
|
| `gateway_state` | `"running"` | Still starting or crashed |
|
||
|
|
| `platforms.<name>.state` | `"connected"` | Auth failure or network issue |
|
||
|
|
|
||
|
|
`gateway_state: "starting"` is normal for up to 60 s on boot. Beyond that, check logs for auth errors:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
journalctl -u hermes-gateway --since "2 minutes ago" | grep -i "error\|token\|auth"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Gateway Won't Start — Diagnosis Order
|
||
|
|
|
||
|
|
1. `ss -tlnp | grep 8642` — port conflict?
|
||
|
|
2. `cat ~/.hermes/gateway.pid` → `ps -p <pid>` — stale PID file?
|
||
|
|
3. `hermes gateway start --replace` — clears stale locks and PIDs
|
||
|
|
4. `HERMES_LOG_LEVEL=DEBUG hermes gateway start` — verbose output
|
||
|
|
5. Check `~/.hermes/.env` — missing or placeholder token?
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Before Every PR
|
||
|
|
|
||
|
|
```bash
|
||
|
|
source venv/bin/activate
|
||
|
|
python -m pytest tests/ -q # full suite: ~3 min, ~3000 tests
|
||
|
|
python scripts/deploy-validate # deployment health check
|
||
|
|
python wizard-bootstrap/wizard_bootstrap.py # environment sanity
|
||
|
|
```
|
||
|
|
|
||
|
|
All three must exit 0. Do not skip. "It works locally" is not sufficient evidence.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Session and State Files
|
||
|
|
|
||
|
|
| Store | Location | Notes |
|
||
|
|
|-------|----------|-------|
|
||
|
|
| Sessions | `~/.hermes/sessions/*.json` | Persisted across restarts |
|
||
|
|
| Memories | `~/.hermes/memories/*.md` | Written by the agent's memory tool |
|
||
|
|
| Cron jobs | `~/.hermes/cron/*.json` | Scheduler state |
|
||
|
|
| Gateway state | `~/.hermes/gateway_state.json` | Live platform connection status |
|
||
|
|
| Response store | `~/.hermes/response_store.db` | SQLite WAL — API server only |
|
||
|
|
|
||
|
|
All paths go through `get_hermes_home()`. Never hardcode. Always backup before a major update:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
tar czf ~/backups/hermes_$(date +%F_%H%M).tar.gz ~/.hermes/
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Writing Tests
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python -m pytest tests/path/to/test.py -q # single file
|
||
|
|
python -m pytest tests/ -q -k "test_name" # by name
|
||
|
|
python -m pytest tests/ -q -x # stop on first failure
|
||
|
|
```
|
||
|
|
|
||
|
|
**Test isolation rules:**
|
||
|
|
- `tests/conftest.py` has an autouse fixture that redirects `HERMES_HOME` to a temp dir. Never write to `~/.hermes/` in tests.
|
||
|
|
- Profile tests must mock both `Path.home()` and `HERMES_HOME`. See `tests/hermes_cli/test_profiles.py` for the pattern.
|
||
|
|
- Do not mock the database. Integration tests should use real SQLite with a temp path.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Commit Conventions
|
||
|
|
|
||
|
|
```
|
||
|
|
feat: add X # new capability
|
||
|
|
fix: correct Y # bug fix
|
||
|
|
refactor: restructure Z # no behaviour change
|
||
|
|
test: add tests for W # test-only
|
||
|
|
chore: update deps # housekeeping
|
||
|
|
docs: clarify X # documentation only
|
||
|
|
```
|
||
|
|
|
||
|
|
Include `Fixes #NNN` or `Refs #NNN` in the commit message body to close or reference issues automatically.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
*This guide lives in `wizard-bootstrap/`. Update it when you discover a new pitfall or pattern worth preserving.*
|