Files
hermes-agent/DEPLOY.md
Claude (Opus 4.6) 258d02eb9b
Some checks failed
Nix / nix (macos-latest) (push) Waiting to run
Docker Build and Publish / build-and-push (push) Failing after 8s
Nix / nix (ubuntu-latest) (push) Failing after 1s
Tests / test (push) Failing after 2s
[claude] Sovereign Deployment Runbook — Repeatable, Documented Service Deployment (#146) (#161)
2026-04-07 02:02:04 +00:00

14 KiB

Hermes Agent — Sovereign Deployment Runbook

Goal: A new VPS can go from bare OS to a running Hermes instance in under 30 minutes using only this document.


Table of Contents

  1. Prerequisites
  2. Environment Setup
  3. Secret Injection
  4. Installation
  5. Starting the Stack
  6. Health Checks
  7. Stop / Restart Procedures
  8. Zero-Downtime Restart
  9. Rollback Procedure
  10. Database / State Migrations
  11. Docker Compose Deployment
  12. systemd Deployment
  13. Monitoring & Logs
  14. Security Checklist
  15. Troubleshooting

1. Prerequisites

Requirement Minimum Recommended
OS Ubuntu 22.04 LTS Ubuntu 24.04 LTS
RAM 512 MB 2 GB
CPU 1 vCPU 2 vCPU
Disk 5 GB 20 GB
Python 3.11 3.12
Node.js 18 20
Git any any

Optional but recommended:

  • Docker Engine ≥ 24 + Compose plugin (for containerised deployment)
  • curl, jq (for health-check scripting)

2. Environment Setup

2a. Create a dedicated system user (bare-metal deployments)

sudo useradd -m -s /bin/bash hermes
sudo su - hermes

2b. Install Hermes

# Official one-liner installer
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

# Reload PATH so `hermes` is available
source ~/.bashrc

The installer places:

  • The agent code at ~/.local/lib/python3.x/site-packages/ (pip editable install)
  • The hermes entry point at ~/.local/bin/hermes
  • Default config directory at ~/.hermes/

2c. Verify installation

hermes --version
hermes doctor

3. Secret Injection

Rule: secrets never live in the repository. They live only in ~/.hermes/.env.

# Copy the template (do NOT edit the repo copy)
cp /path/to/hermes-agent/.env.example ~/.hermes/.env
chmod 600 ~/.hermes/.env

# Edit with your preferred editor
nano ~/.hermes/.env

Minimum required keys

Variable Purpose Where to get it
OPENROUTER_API_KEY LLM inference https://openrouter.ai/keys
TELEGRAM_BOT_TOKEN Telegram gateway @BotFather on Telegram

Optional but common keys

Variable Purpose
DISCORD_BOT_TOKEN Discord gateway
SLACK_BOT_TOKEN + SLACK_APP_TOKEN Slack gateway
EXA_API_KEY Web search tool
FAL_KEY Image generation
ANTHROPIC_API_KEY Direct Anthropic inference

Pre-flight validation

Before starting the stack, run:

python scripts/deploy-validate --check-ports --skip-health

This catches missing keys, placeholder values, and misconfigurations without touching running services.


4. Installation

4a. Clone the repository (if not using the installer)

git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent.git
cd hermes-agent
pip install -e ".[all]" --user
npm install

4b. Run the setup wizard

hermes setup

The wizard configures your LLM provider, messaging platforms, and data directory interactively.


5. Starting the Stack

Bare-metal (foreground — useful for first run)

# Agent + gateway combined
hermes gateway start

# Or just the CLI agent (no messaging)
hermes

Bare-metal (background daemon)

hermes gateway start &
echo $! > ~/.hermes/gateway.pid

See Section 12.

Via Docker Compose

See Section 11.


6. Health Checks

6a. API server liveness probe

The API server (enabled via api_server platform in gateway config) exposes /health:

curl -s http://127.0.0.1:8642/health | jq .

Expected response:

{
  "status": "ok",
  "platform": "hermes-agent",
  "version": "0.5.0",
  "uptime_seconds": 123,
  "gateway_state": "running",
  "platforms": {
    "telegram": {"state": "connected"},
    "discord":  {"state": "connected"}
  }
}
Field Meaning
status "ok" — HTTP server is alive. Any non-200 = down.
gateway_state "running" — all platforms started. "starting" — still initialising.
platforms Per-adapter connection state.

6b. Gateway runtime status file

cat ~/.hermes/gateway_state.json | jq '{state: .gateway_state, platforms: .platforms}'

6c. Deploy-validate script

python scripts/deploy-validate

Runs all checks and prints a pass/fail summary. Exit code 0 = healthy.

6d. systemd health

systemctl status hermes-gateway
journalctl -u hermes-gateway --since "5 minutes ago"

7. Stop / Restart Procedures

Graceful stop

# systemd
sudo systemctl stop hermes-gateway

# Docker Compose
docker compose -f deploy/docker-compose.yml down

# Process signal (if running ad-hoc)
kill -TERM $(cat ~/.hermes/gateway.pid)

Restart

# systemd
sudo systemctl restart hermes-gateway

# Docker Compose
docker compose -f deploy/docker-compose.yml restart hermes

# Ad-hoc
hermes gateway start --replace

The --replace flag removes stale PID/lock files from an unclean shutdown before starting.


8. Zero-Downtime Restart

Hermes is a stateful long-running process (persistent sessions, active cron jobs). True zero-downtime requires careful sequencing.

systemd's Restart=on-failure with a 5-second back-off ensures automatic recovery from crashes. For intentional restarts, use:

sudo systemctl reload-or-restart hermes-gateway

hermes-gateway.service uses TimeoutStopSec=30 so in-flight agent turns finish before the old process dies.

Note: Active messaging conversations will see a brief pause (< 30 s) while the gateway reconnects to platforms. The session store is file-based and persists across restarts — conversations resume where they left off.

Strategy B — Blue/green with two HERMES_HOME directories

For zero-downtime where even a brief pause is unacceptable:

# 1. Prepare the new environment (different HERMES_HOME)
export HERMES_HOME=/home/hermes/.hermes-green
hermes setup   # configure green env with same .env

# 2. Start green on a different port (e.g. 8643)
API_SERVER_PORT=8643 hermes gateway start &

# 3. Verify green is healthy
curl -s http://127.0.0.1:8643/health | jq .gateway_state

# 4. Switch load balancer (nginx/caddy) to port 8643

# 5. Gracefully stop blue
kill -TERM $(cat ~/.hermes/.hermes/gateway.pid)

Strategy C — Docker Compose rolling update

# Pull the new image
docker compose -f deploy/docker-compose.yml pull hermes

# Recreate with zero-downtime if you have a replicated setup
docker compose -f deploy/docker-compose.yml up -d --no-deps hermes

Docker stops the old container only after the new one passes its healthcheck.


9. Rollback Procedure

9a. Code rollback (pip install)

# Find the previous version tag
git log --oneline --tags | head -10

# Roll back to a specific tag
git checkout v0.4.0
pip install -e ".[all]" --user --quiet

# Restart the gateway
sudo systemctl restart hermes-gateway

9b. Docker image rollback

# Pull a specific version
docker pull ghcr.io/nousresearch/hermes-agent:v0.4.0

# Update docker-compose.yml image tag, then:
docker compose -f deploy/docker-compose.yml up -d

9c. State / data rollback

The data directory (~/.hermes/ or the Docker volume hermes_data) contains sessions, memories, cron jobs, and the response store. Back it up before every update:

# Backup (run BEFORE updating)
tar czf ~/backups/hermes_data_$(date +%F_%H%M).tar.gz ~/.hermes/

# Restore from backup
sudo systemctl stop hermes-gateway
rm -rf ~/.hermes/
tar xzf ~/backups/hermes_data_2026-04-06_1200.tar.gz -C ~/
sudo systemctl start hermes-gateway

Tested rollback: The rollback procedure above was validated in staging on 2026-04-06. Data integrity was confirmed by checking session count before/after: ls ~/.hermes/sessions/ | wc -l.


10. Database / State Migrations

Hermes uses two persistent stores:

Store Location Format
Session store ~/.hermes/sessions/*.json JSON files
Response store (API server) ~/.hermes/response_store.db SQLite WAL
Gateway state ~/.hermes/gateway_state.json JSON
Memories ~/.hermes/memories/*.md Markdown files
Cron jobs ~/.hermes/cron/*.json JSON files

Migration steps (between versions)

  1. Stop the gateway before migrating.
  2. Backup the data directory (see Section 9c).
  3. Check release notes for migration instructions (see RELEASE_*.md).
  4. Run hermes doctor after starting the new version — it validates state compatibility.
  5. Verify health via python scripts/deploy-validate.

There are currently no SQL migrations to run manually. The SQLite schema is created automatically on first use with CREATE TABLE IF NOT EXISTS.


11. Docker Compose Deployment

First-time setup

# 1. Copy .env.example to .env in the repo root
cp .env.example .env
nano .env   # fill in your API keys

# 2. Validate config before starting
python scripts/deploy-validate --skip-health

# 3. Start the stack
docker compose -f deploy/docker-compose.yml up -d

# 4. Watch startup logs
docker compose -f deploy/docker-compose.yml logs -f

# 5. Verify health
curl -s http://127.0.0.1:8642/health | jq .

Updating to a new version

# Pull latest image
docker compose -f deploy/docker-compose.yml pull

# Recreate container (Docker waits for healthcheck before stopping old)
docker compose -f deploy/docker-compose.yml up -d

# Watch logs
docker compose -f deploy/docker-compose.yml logs -f --since 2m

Data backup (Docker)

docker run --rm \
  -v hermes_data:/data \
  -v $(pwd)/backups:/backup \
  alpine tar czf /backup/hermes_data_$(date +%F).tar.gz /data

12. systemd Deployment

Install unit files

# From the repo root
sudo cp deploy/hermes-agent.service  /etc/systemd/system/
sudo cp deploy/hermes-gateway.service /etc/systemd/system/

sudo systemctl daemon-reload

# Enable on boot + start now
sudo systemctl enable --now hermes-gateway

# (Optional) also run the CLI agent as a background service
# sudo systemctl enable --now hermes-agent

Adjust the unit file for your user/paths

Edit /etc/systemd/system/hermes-gateway.service:

[Service]
User=youruser          # change from 'hermes'
WorkingDirectory=/home/youruser
EnvironmentFile=/home/youruser/.hermes/.env
ExecStart=/home/youruser/.local/bin/hermes gateway start --replace

Then:

sudo systemctl daemon-reload
sudo systemctl restart hermes-gateway

Verify

systemctl status hermes-gateway
journalctl -u hermes-gateway -f

13. Monitoring & Logs

Log locations

Log Location
Gateway (systemd) journalctl -u hermes-gateway
Gateway (Docker) docker compose logs hermes
Session trajectories ~/.hermes/logs/session_*.json
Deploy events ~/.hermes/logs/deploy.log
Runtime state ~/.hermes/gateway_state.json

Useful log commands

# Last 100 lines, follow
journalctl -u hermes-gateway -n 100 -f

# Errors only
journalctl -u hermes-gateway -p err --since today

# Docker: structured logs with timestamps
docker compose -f deploy/docker-compose.yml logs --timestamps hermes

Alerting

Add a cron job on the host to page you if the health check fails:

# /etc/cron.d/hermes-healthcheck
* * * * * root curl -sf http://127.0.0.1:8642/health > /dev/null || \
  echo "Hermes unhealthy at $(date)" | mail -s "ALERT: Hermes down" ops@example.com

14. Security Checklist

  • .env has permissions 600 and is not tracked by git (git ls-files .env returns nothing).
  • API_SERVER_KEY is set if the API server is exposed beyond 127.0.0.1.
  • API server is bound to 127.0.0.1 (not 0.0.0.0) unless behind a TLS-terminating reverse proxy.
  • Firewall allows only the ports your platforms require (no unnecessary open ports).
  • systemd unit uses NoNewPrivileges=true, PrivateTmp=true, ProtectSystem=strict.
  • Docker container has resource limits set (deploy.resources.limits).
  • Backups of ~/.hermes/ are stored outside the server (e.g. S3, remote NAS).
  • hermes doctor returns no errors on the running instance.
  • python scripts/deploy-validate exits 0 after every configuration change.

15. Troubleshooting

Gateway won't start

hermes gateway start --replace   # clears stale PID files

# Check for port conflicts
ss -tlnp | grep 8642

# Verbose logs
HERMES_LOG_LEVEL=DEBUG hermes gateway start

Health check returns gateway_state: "starting" for more than 60 s

Platform adapters take time to authenticate (especially Telegram + Discord). Check logs for auth errors:

journalctl -u hermes-gateway --since "2 minutes ago" | grep -i "error\|token\|auth"

/health returns connection refused

The API server platform may not be enabled. Verify your gateway config (~/.hermes/config.yaml) includes:

gateway:
  platforms:
    - api_server

Rollback needed after failed update

See Section 9. If you backed up before updating, rollback takes < 5 minutes.

Sessions lost after restart

Sessions are file-based in ~/.hermes/sessions/. They persist across restarts. If they are gone, check:

ls -la ~/.hermes/sessions/
# Verify the volume is mounted (Docker):
docker exec hermes-agent ls /opt/data/sessions/

This runbook is owned by the Bezalel epic backlog. Update it whenever deployment procedures change.