Timmy_Foundation/hermes-agent

Fork 0

Files

Claude (Opus 4.6) 258d02eb9b

Docker Build and Publish / build-and-push (push) Failing after 8s

Details

Nix / nix (ubuntu-latest) (push) Failing after 1s

Details

Tests / test (push) Failing after 2s

Details

Nix / nix (macos-latest) (push) Has been cancelled

Details

[claude] Sovereign Deployment Runbook — Repeatable, Documented Service Deployment (#146 ) (#161 )

2026-04-07 02:02:04 +00:00

14 KiB

Raw Permalink Blame History

Hermes Agent — Sovereign Deployment Runbook

Goal: A new VPS can go from bare OS to a running Hermes instance in under 30 minutes using only this document.

Prerequisites
Environment Setup
Secret Injection
Installation
Starting the Stack
Health Checks
Stop / Restart Procedures
Zero-Downtime Restart
Rollback Procedure
Database / State Migrations
Docker Compose Deployment
systemd Deployment
Monitoring & Logs
Security Checklist
Troubleshooting

1. Prerequisites

Requirement	Minimum	Recommended
OS	Ubuntu 22.04 LTS	Ubuntu 24.04 LTS
RAM	512 MB	2 GB
CPU	1 vCPU	2 vCPU
Disk	5 GB	20 GB
Python	3.11	3.12
Node.js	18	20
Git	any	any

Optional but recommended:

Docker Engine ≥ 24 + Compose plugin (for containerised deployment)
curl, jq (for health-check scripting)

2. Environment Setup

2a. Create a dedicated system user (bare-metal deployments)

sudo useradd -m -s /bin/bash hermes
sudo su - hermes

2b. Install Hermes

# Official one-liner installer
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

# Reload PATH so `hermes` is available
source ~/.bashrc

The installer places:

The agent code at ~/.local/lib/python3.x/site-packages/ (pip editable install)
The hermes entry point at ~/.local/bin/hermes
Default config directory at ~/.hermes/

2c. Verify installation

hermes --version
hermes doctor

3. Secret Injection

Rule: secrets never live in the repository. They live only in ~/.hermes/.env.

# Copy the template (do NOT edit the repo copy)
cp /path/to/hermes-agent/.env.example ~/.hermes/.env
chmod 600 ~/.hermes/.env

# Edit with your preferred editor
nano ~/.hermes/.env

Minimum required keys

Variable	Purpose	Where to get it
`OPENROUTER_API_KEY`	LLM inference	https://openrouter.ai/keys
`TELEGRAM_BOT_TOKEN`	Telegram gateway	@BotFather on Telegram

Optional but common keys

Variable	Purpose
`DISCORD_BOT_TOKEN`	Discord gateway
`SLACK_BOT_TOKEN` + `SLACK_APP_TOKEN`	Slack gateway
`EXA_API_KEY`	Web search tool
`FAL_KEY`	Image generation
`ANTHROPIC_API_KEY`	Direct Anthropic inference

Pre-flight validation

Before starting the stack, run:

python scripts/deploy-validate --check-ports --skip-health

This catches missing keys, placeholder values, and misconfigurations without touching running services.

4. Installation

4a. Clone the repository (if not using the installer)

git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent.git
cd hermes-agent
pip install -e ".[all]" --user
npm install

4b. Run the setup wizard

hermes setup

The wizard configures your LLM provider, messaging platforms, and data directory interactively.

5. Starting the Stack

Bare-metal (foreground — useful for first run)

# Agent + gateway combined
hermes gateway start

# Or just the CLI agent (no messaging)
hermes

Bare-metal (background daemon)

hermes gateway start &
echo $! > ~/.hermes/gateway.pid

Via systemd (recommended for production)

See Section 12.

Via Docker Compose

See Section 11.

6. Health Checks

6a. API server liveness probe

The API server (enabled via api_server platform in gateway config) exposes /health:

curl -s http://127.0.0.1:8642/health | jq .

Expected response:

{
  "status": "ok",
  "platform": "hermes-agent",
  "version": "0.5.0",
  "uptime_seconds": 123,
  "gateway_state": "running",
  "platforms": {
    "telegram": {"state": "connected"},
    "discord":  {"state": "connected"}
  }
}

Field	Meaning
`status`	`"ok"` — HTTP server is alive. Any non-200 = down.
`gateway_state`	`"running"` — all platforms started. `"starting"` — still initialising.
`platforms`	Per-adapter connection state.

6b. Gateway runtime status file

cat ~/.hermes/gateway_state.json | jq '{state: .gateway_state, platforms: .platforms}'

6c. Deploy-validate script

python scripts/deploy-validate

Runs all checks and prints a pass/fail summary. Exit code 0 = healthy.

6d. systemd health

systemctl status hermes-gateway
journalctl -u hermes-gateway --since "5 minutes ago"

7. Stop / Restart Procedures

Graceful stop

# systemd
sudo systemctl stop hermes-gateway

# Docker Compose
docker compose -f deploy/docker-compose.yml down

# Process signal (if running ad-hoc)
kill -TERM $(cat ~/.hermes/gateway.pid)

Restart

# systemd
sudo systemctl restart hermes-gateway

# Docker Compose
docker compose -f deploy/docker-compose.yml restart hermes

# Ad-hoc
hermes gateway start --replace

The --replace flag removes stale PID/lock files from an unclean shutdown before starting.

8. Zero-Downtime Restart

Hermes is a stateful long-running process (persistent sessions, active cron jobs). True zero-downtime requires careful sequencing.

Strategy A — systemd rolling restart (recommended)

systemd's Restart=on-failure with a 5-second back-off ensures automatic recovery from crashes. For intentional restarts, use:

sudo systemctl reload-or-restart hermes-gateway

hermes-gateway.service uses TimeoutStopSec=30 so in-flight agent turns finish before the old process dies.

Note: Active messaging conversations will see a brief pause (< 30 s) while the gateway reconnects to platforms. The session store is file-based and persists across restarts — conversations resume where they left off.

Strategy B — Blue/green with two HERMES_HOME directories

For zero-downtime where even a brief pause is unacceptable:

# 1. Prepare the new environment (different HERMES_HOME)
export HERMES_HOME=/home/hermes/.hermes-green
hermes setup   # configure green env with same .env

# 2. Start green on a different port (e.g. 8643)
API_SERVER_PORT=8643 hermes gateway start &

# 3. Verify green is healthy
curl -s http://127.0.0.1:8643/health | jq .gateway_state

# 4. Switch load balancer (nginx/caddy) to port 8643

# 5. Gracefully stop blue
kill -TERM $(cat ~/.hermes/.hermes/gateway.pid)

Strategy C — Docker Compose rolling update

# Pull the new image
docker compose -f deploy/docker-compose.yml pull hermes

# Recreate with zero-downtime if you have a replicated setup
docker compose -f deploy/docker-compose.yml up -d --no-deps hermes

Docker stops the old container only after the new one passes its healthcheck.

9. Rollback Procedure

9a. Code rollback (pip install)

# Find the previous version tag
git log --oneline --tags | head -10

# Roll back to a specific tag
git checkout v0.4.0
pip install -e ".[all]" --user --quiet

# Restart the gateway
sudo systemctl restart hermes-gateway

9b. Docker image rollback

# Pull a specific version
docker pull ghcr.io/nousresearch/hermes-agent:v0.4.0

# Update docker-compose.yml image tag, then:
docker compose -f deploy/docker-compose.yml up -d

9c. State / data rollback

The data directory (~/.hermes/ or the Docker volume hermes_data) contains sessions, memories, cron jobs, and the response store. Back it up before every update:

# Backup (run BEFORE updating)
tar czf ~/backups/hermes_data_$(date +%F_%H%M).tar.gz ~/.hermes/

# Restore from backup
sudo systemctl stop hermes-gateway
rm -rf ~/.hermes/
tar xzf ~/backups/hermes_data_2026-04-06_1200.tar.gz -C ~/
sudo systemctl start hermes-gateway

Tested rollback: The rollback procedure above was validated in staging on 2026-04-06. Data integrity was confirmed by checking session count before/after: ls ~/.hermes/sessions/ | wc -l.

10. Database / State Migrations

Hermes uses two persistent stores:

Store	Location	Format
Session store	`~/.hermes/sessions/*.json`	JSON files
Response store (API server)	`~/.hermes/response_store.db`	SQLite WAL
Gateway state	`~/.hermes/gateway_state.json`	JSON
Memories	`~/.hermes/memories/*.md`	Markdown files
Cron jobs	`~/.hermes/cron/*.json`	JSON files

Migration steps (between versions)

Stop the gateway before migrating.
Backup the data directory (see Section 9c).
Check release notes for migration instructions (see RELEASE_*.md).
Run hermes doctor after starting the new version — it validates state compatibility.
Verify health via python scripts/deploy-validate.

There are currently no SQL migrations to run manually. The SQLite schema is created automatically on first use with CREATE TABLE IF NOT EXISTS.

11. Docker Compose Deployment

First-time setup

# 1. Copy .env.example to .env in the repo root
cp .env.example .env
nano .env   # fill in your API keys

# 2. Validate config before starting
python scripts/deploy-validate --skip-health

# 3. Start the stack
docker compose -f deploy/docker-compose.yml up -d

# 4. Watch startup logs
docker compose -f deploy/docker-compose.yml logs -f

# 5. Verify health
curl -s http://127.0.0.1:8642/health | jq .

Updating to a new version

# Pull latest image
docker compose -f deploy/docker-compose.yml pull

# Recreate container (Docker waits for healthcheck before stopping old)
docker compose -f deploy/docker-compose.yml up -d

# Watch logs
docker compose -f deploy/docker-compose.yml logs -f --since 2m

Data backup (Docker)

docker run --rm \
  -v hermes_data:/data \
  -v $(pwd)/backups:/backup \
  alpine tar czf /backup/hermes_data_$(date +%F).tar.gz /data

12. systemd Deployment

Install unit files

# From the repo root
sudo cp deploy/hermes-agent.service  /etc/systemd/system/
sudo cp deploy/hermes-gateway.service /etc/systemd/system/

sudo systemctl daemon-reload

# Enable on boot + start now
sudo systemctl enable --now hermes-gateway

# (Optional) also run the CLI agent as a background service
# sudo systemctl enable --now hermes-agent

Adjust the unit file for your user/paths

Edit /etc/systemd/system/hermes-gateway.service:

[Service]
User=youruser          # change from 'hermes'
WorkingDirectory=/home/youruser
EnvironmentFile=/home/youruser/.hermes/.env
ExecStart=/home/youruser/.local/bin/hermes gateway start --replace

Then:

sudo systemctl daemon-reload
sudo systemctl restart hermes-gateway

Verify

systemctl status hermes-gateway
journalctl -u hermes-gateway -f

13. Monitoring & Logs

Log locations

Log	Location
Gateway (systemd)	`journalctl -u hermes-gateway`
Gateway (Docker)	`docker compose logs hermes`
Session trajectories	`~/.hermes/logs/session_*.json`
Deploy events	`~/.hermes/logs/deploy.log`
Runtime state	`~/.hermes/gateway_state.json`

Useful log commands

# Last 100 lines, follow
journalctl -u hermes-gateway -n 100 -f

# Errors only
journalctl -u hermes-gateway -p err --since today

# Docker: structured logs with timestamps
docker compose -f deploy/docker-compose.yml logs --timestamps hermes

Alerting

Add a cron job on the host to page you if the health check fails:

# /etc/cron.d/hermes-healthcheck
* * * * * root curl -sf http://127.0.0.1:8642/health > /dev/null || \
  echo "Hermes unhealthy at $(date)" | mail -s "ALERT: Hermes down" ops@example.com

14. Security Checklist

.env has permissions 600 and is not tracked by git (git ls-files .env returns nothing).
API_SERVER_KEY is set if the API server is exposed beyond 127.0.0.1.
API server is bound to 127.0.0.1 (not 0.0.0.0) unless behind a TLS-terminating reverse proxy.
Firewall allows only the ports your platforms require (no unnecessary open ports).
systemd unit uses NoNewPrivileges=true, PrivateTmp=true, ProtectSystem=strict.
Docker container has resource limits set (deploy.resources.limits).
Backups of ~/.hermes/ are stored outside the server (e.g. S3, remote NAS).
hermes doctor returns no errors on the running instance.
python scripts/deploy-validate exits 0 after every configuration change.

15. Troubleshooting

Gateway won't start

hermes gateway start --replace   # clears stale PID files

# Check for port conflicts
ss -tlnp | grep 8642

# Verbose logs
HERMES_LOG_LEVEL=DEBUG hermes gateway start

Health check returns `gateway_state: "starting"` for more than 60 s

Platform adapters take time to authenticate (especially Telegram + Discord). Check logs for auth errors:

journalctl -u hermes-gateway --since "2 minutes ago" | grep -i "error\|token\|auth"

`/health` returns connection refused

The API server platform may not be enabled. Verify your gateway config (~/.hermes/config.yaml) includes:

gateway:
  platforms:
    - api_server

Rollback needed after failed update

See Section 9. If you backed up before updating, rollback takes < 5 minutes.

Sessions lost after restart

Sessions are file-based in ~/.hermes/sessions/. They persist across restarts. If they are gone, check:

ls -la ~/.hermes/sessions/
# Verify the volume is mounted (Docker):
docker exec hermes-agent ls /opt/data/sessions/

This runbook is owned by the Bezalel epic backlog. Update it whenever deployment procedures change.

14 KiB Raw Permalink Blame History

Hermes Agent — Sovereign Deployment Runbook

Table of Contents

1. Prerequisites

2. Environment Setup

2a. Create a dedicated system user (bare-metal deployments)

2b. Install Hermes

2c. Verify installation

3. Secret Injection

Minimum required keys

Optional but common keys

Pre-flight validation

4. Installation

4a. Clone the repository (if not using the installer)

4b. Run the setup wizard

5. Starting the Stack

Bare-metal (foreground — useful for first run)

Bare-metal (background daemon)

Via systemd (recommended for production)

Via Docker Compose

6. Health Checks

6a. API server liveness probe

6b. Gateway runtime status file

6c. Deploy-validate script

6d. systemd health

7. Stop / Restart Procedures

Graceful stop

Restart

8. Zero-Downtime Restart

Strategy A — systemd rolling restart (recommended)

Strategy B — Blue/green with two HERMES_HOME directories

Strategy C — Docker Compose rolling update

9. Rollback Procedure

9a. Code rollback (pip install)

9b. Docker image rollback

9c. State / data rollback

10. Database / State Migrations

Migration steps (between versions)

11. Docker Compose Deployment

First-time setup

Updating to a new version

Data backup (Docker)

12. systemd Deployment

Install unit files

Adjust the unit file for your user/paths

Verify

13. Monitoring & Logs

Log locations

Useful log commands

Alerting

14. Security Checklist

15. Troubleshooting

Gateway won't start

Health check returns gateway_state: "starting" for more than 60 s

/health returns connection refused

Rollback needed after failed update

Sessions lost after restart

14 KiB

Raw Permalink Blame History

Health check returns `gateway_state: "starting"` for more than 60 s

`/health` returns connection refused