570 lines
14 KiB
Markdown
570 lines
14 KiB
Markdown
# Hermes Agent — Sovereign Deployment Runbook
|
|
|
|
> **Goal**: A new VPS can go from bare OS to a running Hermes instance in under 30 minutes using only this document.
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Prerequisites](#1-prerequisites)
|
|
2. [Environment Setup](#2-environment-setup)
|
|
3. [Secret Injection](#3-secret-injection)
|
|
4. [Installation](#4-installation)
|
|
5. [Starting the Stack](#5-starting-the-stack)
|
|
6. [Health Checks](#6-health-checks)
|
|
7. [Stop / Restart Procedures](#7-stop--restart-procedures)
|
|
8. [Zero-Downtime Restart](#8-zero-downtime-restart)
|
|
9. [Rollback Procedure](#9-rollback-procedure)
|
|
10. [Database / State Migrations](#10-database--state-migrations)
|
|
11. [Docker Compose Deployment](#11-docker-compose-deployment)
|
|
12. [systemd Deployment](#12-systemd-deployment)
|
|
13. [Monitoring & Logs](#13-monitoring--logs)
|
|
14. [Security Checklist](#14-security-checklist)
|
|
15. [Troubleshooting](#15-troubleshooting)
|
|
|
|
---
|
|
|
|
## 1. Prerequisites
|
|
|
|
| Requirement | Minimum | Recommended |
|
|
|-------------|---------|-------------|
|
|
| OS | Ubuntu 22.04 LTS | Ubuntu 24.04 LTS |
|
|
| RAM | 512 MB | 2 GB |
|
|
| CPU | 1 vCPU | 2 vCPU |
|
|
| Disk | 5 GB | 20 GB |
|
|
| Python | 3.11 | 3.12 |
|
|
| Node.js | 18 | 20 |
|
|
| Git | any | any |
|
|
|
|
**Optional but recommended:**
|
|
- Docker Engine ≥ 24 + Compose plugin (for containerised deployment)
|
|
- `curl`, `jq` (for health-check scripting)
|
|
|
|
---
|
|
|
|
## 2. Environment Setup
|
|
|
|
### 2a. Create a dedicated system user (bare-metal deployments)
|
|
|
|
```bash
|
|
sudo useradd -m -s /bin/bash hermes
|
|
sudo su - hermes
|
|
```
|
|
|
|
### 2b. Install Hermes
|
|
|
|
```bash
|
|
# Official one-liner installer
|
|
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
|
|
|
|
# Reload PATH so `hermes` is available
|
|
source ~/.bashrc
|
|
```
|
|
|
|
The installer places:
|
|
- The agent code at `~/.local/lib/python3.x/site-packages/` (pip editable install)
|
|
- The `hermes` entry point at `~/.local/bin/hermes`
|
|
- Default config directory at `~/.hermes/`
|
|
|
|
### 2c. Verify installation
|
|
|
|
```bash
|
|
hermes --version
|
|
hermes doctor
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Secret Injection
|
|
|
|
**Rule: secrets never live in the repository. They live only in `~/.hermes/.env`.**
|
|
|
|
```bash
|
|
# Copy the template (do NOT edit the repo copy)
|
|
cp /path/to/hermes-agent/.env.example ~/.hermes/.env
|
|
chmod 600 ~/.hermes/.env
|
|
|
|
# Edit with your preferred editor
|
|
nano ~/.hermes/.env
|
|
```
|
|
|
|
### Minimum required keys
|
|
|
|
| Variable | Purpose | Where to get it |
|
|
|----------|---------|----------------|
|
|
| `OPENROUTER_API_KEY` | LLM inference | https://openrouter.ai/keys |
|
|
| `TELEGRAM_BOT_TOKEN` | Telegram gateway | @BotFather on Telegram |
|
|
|
|
### Optional but common keys
|
|
|
|
| Variable | Purpose |
|
|
|----------|---------|
|
|
| `DISCORD_BOT_TOKEN` | Discord gateway |
|
|
| `SLACK_BOT_TOKEN` + `SLACK_APP_TOKEN` | Slack gateway |
|
|
| `EXA_API_KEY` | Web search tool |
|
|
| `FAL_KEY` | Image generation |
|
|
| `ANTHROPIC_API_KEY` | Direct Anthropic inference |
|
|
|
|
### Pre-flight validation
|
|
|
|
Before starting the stack, run:
|
|
|
|
```bash
|
|
python scripts/deploy-validate --check-ports --skip-health
|
|
```
|
|
|
|
This catches missing keys, placeholder values, and misconfigurations without touching running services.
|
|
|
|
---
|
|
|
|
## 4. Installation
|
|
|
|
### 4a. Clone the repository (if not using the installer)
|
|
|
|
```bash
|
|
git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent.git
|
|
cd hermes-agent
|
|
pip install -e ".[all]" --user
|
|
npm install
|
|
```
|
|
|
|
### 4b. Run the setup wizard
|
|
|
|
```bash
|
|
hermes setup
|
|
```
|
|
|
|
The wizard configures your LLM provider, messaging platforms, and data directory interactively.
|
|
|
|
---
|
|
|
|
## 5. Starting the Stack
|
|
|
|
### Bare-metal (foreground — useful for first run)
|
|
|
|
```bash
|
|
# Agent + gateway combined
|
|
hermes gateway start
|
|
|
|
# Or just the CLI agent (no messaging)
|
|
hermes
|
|
```
|
|
|
|
### Bare-metal (background daemon)
|
|
|
|
```bash
|
|
hermes gateway start &
|
|
echo $! > ~/.hermes/gateway.pid
|
|
```
|
|
|
|
### Via systemd (recommended for production)
|
|
|
|
See [Section 12](#12-systemd-deployment).
|
|
|
|
### Via Docker Compose
|
|
|
|
See [Section 11](#11-docker-compose-deployment).
|
|
|
|
---
|
|
|
|
## 6. Health Checks
|
|
|
|
### 6a. API server liveness probe
|
|
|
|
The API server (enabled via `api_server` platform in gateway config) exposes `/health`:
|
|
|
|
```bash
|
|
curl -s http://127.0.0.1:8642/health | jq .
|
|
```
|
|
|
|
Expected response:
|
|
|
|
```json
|
|
{
|
|
"status": "ok",
|
|
"platform": "hermes-agent",
|
|
"version": "0.5.0",
|
|
"uptime_seconds": 123,
|
|
"gateway_state": "running",
|
|
"platforms": {
|
|
"telegram": {"state": "connected"},
|
|
"discord": {"state": "connected"}
|
|
}
|
|
}
|
|
```
|
|
|
|
| Field | Meaning |
|
|
|-------|---------|
|
|
| `status` | `"ok"` — HTTP server is alive. Any non-200 = down. |
|
|
| `gateway_state` | `"running"` — all platforms started. `"starting"` — still initialising. |
|
|
| `platforms` | Per-adapter connection state. |
|
|
|
|
### 6b. Gateway runtime status file
|
|
|
|
```bash
|
|
cat ~/.hermes/gateway_state.json | jq '{state: .gateway_state, platforms: .platforms}'
|
|
```
|
|
|
|
### 6c. Deploy-validate script
|
|
|
|
```bash
|
|
python scripts/deploy-validate
|
|
```
|
|
|
|
Runs all checks and prints a pass/fail summary. Exit code 0 = healthy.
|
|
|
|
### 6d. systemd health
|
|
|
|
```bash
|
|
systemctl status hermes-gateway
|
|
journalctl -u hermes-gateway --since "5 minutes ago"
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Stop / Restart Procedures
|
|
|
|
### Graceful stop
|
|
|
|
```bash
|
|
# systemd
|
|
sudo systemctl stop hermes-gateway
|
|
|
|
# Docker Compose
|
|
docker compose -f deploy/docker-compose.yml down
|
|
|
|
# Process signal (if running ad-hoc)
|
|
kill -TERM $(cat ~/.hermes/gateway.pid)
|
|
```
|
|
|
|
### Restart
|
|
|
|
```bash
|
|
# systemd
|
|
sudo systemctl restart hermes-gateway
|
|
|
|
# Docker Compose
|
|
docker compose -f deploy/docker-compose.yml restart hermes
|
|
|
|
# Ad-hoc
|
|
hermes gateway start --replace
|
|
```
|
|
|
|
The `--replace` flag removes stale PID/lock files from an unclean shutdown before starting.
|
|
|
|
---
|
|
|
|
## 8. Zero-Downtime Restart
|
|
|
|
Hermes is a stateful long-running process (persistent sessions, active cron jobs). True zero-downtime requires careful sequencing.
|
|
|
|
### Strategy A — systemd rolling restart (recommended)
|
|
|
|
systemd's `Restart=on-failure` with a 5-second back-off ensures automatic recovery from crashes. For intentional restarts, use:
|
|
|
|
```bash
|
|
sudo systemctl reload-or-restart hermes-gateway
|
|
```
|
|
|
|
`hermes-gateway.service` uses `TimeoutStopSec=30` so in-flight agent turns finish before the old process dies.
|
|
|
|
> **Note:** Active messaging conversations will see a brief pause (< 30 s) while the gateway reconnects to platforms. The session store is file-based and persists across restarts — conversations resume where they left off.
|
|
|
|
### Strategy B — Blue/green with two HERMES_HOME directories
|
|
|
|
For zero-downtime where even a brief pause is unacceptable:
|
|
|
|
```bash
|
|
# 1. Prepare the new environment (different HERMES_HOME)
|
|
export HERMES_HOME=/home/hermes/.hermes-green
|
|
hermes setup # configure green env with same .env
|
|
|
|
# 2. Start green on a different port (e.g. 8643)
|
|
API_SERVER_PORT=8643 hermes gateway start &
|
|
|
|
# 3. Verify green is healthy
|
|
curl -s http://127.0.0.1:8643/health | jq .gateway_state
|
|
|
|
# 4. Switch load balancer (nginx/caddy) to port 8643
|
|
|
|
# 5. Gracefully stop blue
|
|
kill -TERM $(cat ~/.hermes/.hermes/gateway.pid)
|
|
```
|
|
|
|
### Strategy C — Docker Compose rolling update
|
|
|
|
```bash
|
|
# Pull the new image
|
|
docker compose -f deploy/docker-compose.yml pull hermes
|
|
|
|
# Recreate with zero-downtime if you have a replicated setup
|
|
docker compose -f deploy/docker-compose.yml up -d --no-deps hermes
|
|
```
|
|
|
|
Docker stops the old container only after the new one passes its healthcheck.
|
|
|
|
---
|
|
|
|
## 9. Rollback Procedure
|
|
|
|
### 9a. Code rollback (pip install)
|
|
|
|
```bash
|
|
# Find the previous version tag
|
|
git log --oneline --tags | head -10
|
|
|
|
# Roll back to a specific tag
|
|
git checkout v0.4.0
|
|
pip install -e ".[all]" --user --quiet
|
|
|
|
# Restart the gateway
|
|
sudo systemctl restart hermes-gateway
|
|
```
|
|
|
|
### 9b. Docker image rollback
|
|
|
|
```bash
|
|
# Pull a specific version
|
|
docker pull ghcr.io/nousresearch/hermes-agent:v0.4.0
|
|
|
|
# Update docker-compose.yml image tag, then:
|
|
docker compose -f deploy/docker-compose.yml up -d
|
|
```
|
|
|
|
### 9c. State / data rollback
|
|
|
|
The data directory (`~/.hermes/` or the Docker volume `hermes_data`) contains sessions, memories, cron jobs, and the response store. Back it up before every update:
|
|
|
|
```bash
|
|
# Backup (run BEFORE updating)
|
|
tar czf ~/backups/hermes_data_$(date +%F_%H%M).tar.gz ~/.hermes/
|
|
|
|
# Restore from backup
|
|
sudo systemctl stop hermes-gateway
|
|
rm -rf ~/.hermes/
|
|
tar xzf ~/backups/hermes_data_2026-04-06_1200.tar.gz -C ~/
|
|
sudo systemctl start hermes-gateway
|
|
```
|
|
|
|
> **Tested rollback**: The rollback procedure above was validated in staging on 2026-04-06. Data integrity was confirmed by checking session count before/after: `ls ~/.hermes/sessions/ | wc -l`.
|
|
|
|
---
|
|
|
|
## 10. Database / State Migrations
|
|
|
|
Hermes uses two persistent stores:
|
|
|
|
| Store | Location | Format |
|
|
|-------|----------|--------|
|
|
| Session store | `~/.hermes/sessions/*.json` | JSON files |
|
|
| Response store (API server) | `~/.hermes/response_store.db` | SQLite WAL |
|
|
| Gateway state | `~/.hermes/gateway_state.json` | JSON |
|
|
| Memories | `~/.hermes/memories/*.md` | Markdown files |
|
|
| Cron jobs | `~/.hermes/cron/*.json` | JSON files |
|
|
|
|
### Migration steps (between versions)
|
|
|
|
1. **Stop** the gateway before migrating.
|
|
2. **Backup** the data directory (see Section 9c).
|
|
3. **Check release notes** for migration instructions (see `RELEASE_*.md`).
|
|
4. **Run** `hermes doctor` after starting the new version — it validates state compatibility.
|
|
5. **Verify** health via `python scripts/deploy-validate`.
|
|
|
|
There are currently no SQL migrations to run manually. The SQLite schema is
|
|
created automatically on first use with `CREATE TABLE IF NOT EXISTS`.
|
|
|
|
---
|
|
|
|
## 11. Docker Compose Deployment
|
|
|
|
### First-time setup
|
|
|
|
```bash
|
|
# 1. Copy .env.example to .env in the repo root
|
|
cp .env.example .env
|
|
nano .env # fill in your API keys
|
|
|
|
# 2. Validate config before starting
|
|
python scripts/deploy-validate --skip-health
|
|
|
|
# 3. Start the stack
|
|
docker compose -f deploy/docker-compose.yml up -d
|
|
|
|
# 4. Watch startup logs
|
|
docker compose -f deploy/docker-compose.yml logs -f
|
|
|
|
# 5. Verify health
|
|
curl -s http://127.0.0.1:8642/health | jq .
|
|
```
|
|
|
|
### Updating to a new version
|
|
|
|
```bash
|
|
# Pull latest image
|
|
docker compose -f deploy/docker-compose.yml pull
|
|
|
|
# Recreate container (Docker waits for healthcheck before stopping old)
|
|
docker compose -f deploy/docker-compose.yml up -d
|
|
|
|
# Watch logs
|
|
docker compose -f deploy/docker-compose.yml logs -f --since 2m
|
|
```
|
|
|
|
### Data backup (Docker)
|
|
|
|
```bash
|
|
docker run --rm \
|
|
-v hermes_data:/data \
|
|
-v $(pwd)/backups:/backup \
|
|
alpine tar czf /backup/hermes_data_$(date +%F).tar.gz /data
|
|
```
|
|
|
|
---
|
|
|
|
## 12. systemd Deployment
|
|
|
|
### Install unit files
|
|
|
|
```bash
|
|
# From the repo root
|
|
sudo cp deploy/hermes-agent.service /etc/systemd/system/
|
|
sudo cp deploy/hermes-gateway.service /etc/systemd/system/
|
|
|
|
sudo systemctl daemon-reload
|
|
|
|
# Enable on boot + start now
|
|
sudo systemctl enable --now hermes-gateway
|
|
|
|
# (Optional) also run the CLI agent as a background service
|
|
# sudo systemctl enable --now hermes-agent
|
|
```
|
|
|
|
### Adjust the unit file for your user/paths
|
|
|
|
Edit `/etc/systemd/system/hermes-gateway.service`:
|
|
|
|
```ini
|
|
[Service]
|
|
User=youruser # change from 'hermes'
|
|
WorkingDirectory=/home/youruser
|
|
EnvironmentFile=/home/youruser/.hermes/.env
|
|
ExecStart=/home/youruser/.local/bin/hermes gateway start --replace
|
|
```
|
|
|
|
Then:
|
|
|
|
```bash
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl restart hermes-gateway
|
|
```
|
|
|
|
### Verify
|
|
|
|
```bash
|
|
systemctl status hermes-gateway
|
|
journalctl -u hermes-gateway -f
|
|
```
|
|
|
|
---
|
|
|
|
## 13. Monitoring & Logs
|
|
|
|
### Log locations
|
|
|
|
| Log | Location |
|
|
|-----|----------|
|
|
| Gateway (systemd) | `journalctl -u hermes-gateway` |
|
|
| Gateway (Docker) | `docker compose logs hermes` |
|
|
| Session trajectories | `~/.hermes/logs/session_*.json` |
|
|
| Deploy events | `~/.hermes/logs/deploy.log` |
|
|
| Runtime state | `~/.hermes/gateway_state.json` |
|
|
|
|
### Useful log commands
|
|
|
|
```bash
|
|
# Last 100 lines, follow
|
|
journalctl -u hermes-gateway -n 100 -f
|
|
|
|
# Errors only
|
|
journalctl -u hermes-gateway -p err --since today
|
|
|
|
# Docker: structured logs with timestamps
|
|
docker compose -f deploy/docker-compose.yml logs --timestamps hermes
|
|
```
|
|
|
|
### Alerting
|
|
|
|
Add a cron job on the host to page you if the health check fails:
|
|
|
|
```bash
|
|
# /etc/cron.d/hermes-healthcheck
|
|
* * * * * root curl -sf http://127.0.0.1:8642/health > /dev/null || \
|
|
echo "Hermes unhealthy at $(date)" | mail -s "ALERT: Hermes down" ops@example.com
|
|
```
|
|
|
|
---
|
|
|
|
## 14. Security Checklist
|
|
|
|
- [ ] `.env` has permissions `600` and is **not** tracked by git (`git ls-files .env` returns nothing).
|
|
- [ ] `API_SERVER_KEY` is set if the API server is exposed beyond `127.0.0.1`.
|
|
- [ ] API server is bound to `127.0.0.1` (not `0.0.0.0`) unless behind a TLS-terminating reverse proxy.
|
|
- [ ] Firewall allows only the ports your platforms require (no unnecessary open ports).
|
|
- [ ] systemd unit uses `NoNewPrivileges=true`, `PrivateTmp=true`, `ProtectSystem=strict`.
|
|
- [ ] Docker container has resource limits set (`deploy.resources.limits`).
|
|
- [ ] Backups of `~/.hermes/` are stored outside the server (e.g. S3, remote NAS).
|
|
- [ ] `hermes doctor` returns no errors on the running instance.
|
|
- [ ] `python scripts/deploy-validate` exits 0 after every configuration change.
|
|
|
|
---
|
|
|
|
## 15. Troubleshooting
|
|
|
|
### Gateway won't start
|
|
|
|
```bash
|
|
hermes gateway start --replace # clears stale PID files
|
|
|
|
# Check for port conflicts
|
|
ss -tlnp | grep 8642
|
|
|
|
# Verbose logs
|
|
HERMES_LOG_LEVEL=DEBUG hermes gateway start
|
|
```
|
|
|
|
### Health check returns `gateway_state: "starting"` for more than 60 s
|
|
|
|
Platform adapters take time to authenticate (especially Telegram + Discord). Check logs for auth errors:
|
|
|
|
```bash
|
|
journalctl -u hermes-gateway --since "2 minutes ago" | grep -i "error\|token\|auth"
|
|
```
|
|
|
|
### `/health` returns connection refused
|
|
|
|
The API server platform may not be enabled. Verify your gateway config (`~/.hermes/config.yaml`) includes:
|
|
|
|
```yaml
|
|
gateway:
|
|
platforms:
|
|
- api_server
|
|
```
|
|
|
|
### Rollback needed after failed update
|
|
|
|
See [Section 9](#9-rollback-procedure). If you backed up before updating, rollback takes < 5 minutes.
|
|
|
|
### Sessions lost after restart
|
|
|
|
Sessions are file-based in `~/.hermes/sessions/`. They persist across restarts. If they are gone, check:
|
|
|
|
```bash
|
|
ls -la ~/.hermes/sessions/
|
|
# Verify the volume is mounted (Docker):
|
|
docker exec hermes-agent ls /opt/data/sessions/
|
|
```
|
|
|
|
---
|
|
|
|
*This runbook is owned by the Bezalel epic backlog. Update it whenever deployment procedures change.*
|