# Hermes Agent — Sovereign Deployment Runbook

> **Goal**: A new VPS can go from bare OS to a running Hermes instance in under 30 minutes using only this document.

---

## Table of Contents

1. [Prerequisites](#1-prerequisites)
2. [Environment Setup](#2-environment-setup)
3. [Secret Injection](#3-secret-injection)
4. [Installation](#4-installation)
5. [Starting the Stack](#5-starting-the-stack)
6. [Health Checks](#6-health-checks)
7. [Stop / Restart Procedures](#7-stop--restart-procedures)
8. [Zero-Downtime Restart](#8-zero-downtime-restart)
9. [Rollback Procedure](#9-rollback-procedure)
10. [Database / State Migrations](#10-database--state-migrations)
11. [Docker Compose Deployment](#11-docker-compose-deployment)
12. [systemd Deployment](#12-systemd-deployment)
13. [Monitoring & Logs](#13-monitoring--logs)
14. [Security Checklist](#14-security-checklist)
15. [Troubleshooting](#15-troubleshooting)

---

## 1. Prerequisites

| Requirement | Minimum | Recommended |
|-------------|---------|-------------|
| OS | Ubuntu 22.04 LTS | Ubuntu 24.04 LTS |
| RAM | 512 MB | 2 GB |
| CPU | 1 vCPU | 2 vCPU |
| Disk | 5 GB | 20 GB |
| Python | 3.11 | 3.12 |
| Node.js | 18 | 20 |
| Git | any | any |

**Optional but recommended:**
- Docker Engine ≥ 24 + Compose plugin (for containerised deployment)
- `curl`, `jq` (for health-check scripting)

---

## 2. Environment Setup

### 2a. Create a dedicated system user (bare-metal deployments)

```bash
sudo useradd -m -s /bin/bash hermes
sudo su - hermes
```

### 2b. Install Hermes

```bash
# Official one-liner installer
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

# Reload PATH so `hermes` is available
source ~/.bashrc
```

The installer places:
- The agent code at `~/.local/lib/python3.x/site-packages/` (pip editable install)
- The `hermes` entry point at `~/.local/bin/hermes`
- Default config directory at `~/.hermes/`

### 2c. Verify installation

```bash
hermes --version
hermes doctor
```

---

## 3. Secret Injection

**Rule: secrets never live in the repository. They live only in `~/.hermes/.env`.**

```bash
# Copy the template (do NOT edit the repo copy)
cp /path/to/hermes-agent/.env.example ~/.hermes/.env
chmod 600 ~/.hermes/.env

# Edit with your preferred editor
nano ~/.hermes/.env
```

### Minimum required keys

| Variable | Purpose | Where to get it |
|----------|---------|----------------|
| `OPENROUTER_API_KEY` | LLM inference | https://openrouter.ai/keys |
| `TELEGRAM_BOT_TOKEN` | Telegram gateway | @BotFather on Telegram |

### Optional but common keys

| Variable | Purpose |
|----------|---------|
| `DISCORD_BOT_TOKEN` | Discord gateway |
| `SLACK_BOT_TOKEN` + `SLACK_APP_TOKEN` | Slack gateway |
| `EXA_API_KEY` | Web search tool |
| `FAL_KEY` | Image generation |
| `ANTHROPIC_API_KEY` | Direct Anthropic inference |

### Pre-flight validation

Before starting the stack, run:

```bash
python scripts/deploy-validate --check-ports --skip-health
```

This catches missing keys, placeholder values, and misconfigurations without touching running services.

---

## 4. Installation

### 4a. Clone the repository (if not using the installer)

```bash
git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent.git
cd hermes-agent
pip install -e ".[all]" --user
npm install
```

### 4b. Run the setup wizard

```bash
hermes setup
```

The wizard configures your LLM provider, messaging platforms, and data directory interactively.

---

## 5. Starting the Stack

### Bare-metal (foreground — useful for first run)

```bash
# Agent + gateway combined
hermes gateway start

# Or just the CLI agent (no messaging)
hermes
```

### Bare-metal (background daemon)

```bash
hermes gateway start &
echo $! > ~/.hermes/gateway.pid
```

### Via systemd (recommended for production)

See [Section 12](#12-systemd-deployment).

### Via Docker Compose

See [Section 11](#11-docker-compose-deployment).

---

## 6. Health Checks

### 6a. API server liveness probe

The API server (enabled via `api_server` platform in gateway config) exposes `/health`:

```bash
curl -s http://127.0.0.1:8642/health | jq .
```

Expected response:

```json
{
  "status": "ok",
  "platform": "hermes-agent",
  "version": "0.5.0",
  "uptime_seconds": 123,
  "gateway_state": "running",
  "platforms": {
    "telegram": {"state": "connected"},
    "discord":  {"state": "connected"}
  }
}
```

| Field | Meaning |
|-------|---------|
| `status` | `"ok"` — HTTP server is alive. Any non-200 = down. |
| `gateway_state` | `"running"` — all platforms started. `"starting"` — still initialising. |
| `platforms` | Per-adapter connection state. |

### 6b. Gateway runtime status file

```bash
cat ~/.hermes/gateway_state.json | jq '{state: .gateway_state, platforms: .platforms}'
```

### 6c. Deploy-validate script

```bash
python scripts/deploy-validate
```

Runs all checks and prints a pass/fail summary. Exit code 0 = healthy.

### 6d. systemd health

```bash
systemctl status hermes-gateway
journalctl -u hermes-gateway --since "5 minutes ago"
```

---

## 7. Stop / Restart Procedures

### Graceful stop

```bash
# systemd
sudo systemctl stop hermes-gateway

# Docker Compose
docker compose -f deploy/docker-compose.yml down

# Process signal (if running ad-hoc)
kill -TERM $(cat ~/.hermes/gateway.pid)
```

### Restart

```bash
# systemd
sudo systemctl restart hermes-gateway

# Docker Compose
docker compose -f deploy/docker-compose.yml restart hermes

# Ad-hoc
hermes gateway start --replace
```

The `--replace` flag removes stale PID/lock files from an unclean shutdown before starting.

---

## 8. Zero-Downtime Restart

Hermes is a stateful long-running process (persistent sessions, active cron jobs). True zero-downtime requires careful sequencing.

### Strategy A — systemd rolling restart (recommended)

systemd's `Restart=on-failure` with a 5-second back-off ensures automatic recovery from crashes. For intentional restarts, use:

```bash
sudo systemctl reload-or-restart hermes-gateway
```

`hermes-gateway.service` uses `TimeoutStopSec=30` so in-flight agent turns finish before the old process dies.

> **Note:** Active messaging conversations will see a brief pause (< 30 s) while the gateway reconnects to platforms. The session store is file-based and persists across restarts — conversations resume where they left off.

### Strategy B — Blue/green with two HERMES_HOME directories

For zero-downtime where even a brief pause is unacceptable:

```bash
# 1. Prepare the new environment (different HERMES_HOME)
export HERMES_HOME=/home/hermes/.hermes-green
hermes setup   # configure green env with same .env

# 2. Start green on a different port (e.g. 8643)
API_SERVER_PORT=8643 hermes gateway start &

# 3. Verify green is healthy
curl -s http://127.0.0.1:8643/health | jq .gateway_state

# 4. Switch load balancer (nginx/caddy) to port 8643

# 5. Gracefully stop blue
kill -TERM $(cat ~/.hermes/.hermes/gateway.pid)
```

### Strategy C — Docker Compose rolling update

```bash
# Pull the new image
docker compose -f deploy/docker-compose.yml pull hermes

# Recreate with zero-downtime if you have a replicated setup
docker compose -f deploy/docker-compose.yml up -d --no-deps hermes
```

Docker stops the old container only after the new one passes its healthcheck.

---

## 9. Rollback Procedure

### 9a. Code rollback (pip install)

```bash
# Find the previous version tag
git log --oneline --tags | head -10

# Roll back to a specific tag
git checkout v0.4.0
pip install -e ".[all]" --user --quiet

# Restart the gateway
sudo systemctl restart hermes-gateway
```

### 9b. Docker image rollback

```bash
# Pull a specific version
docker pull ghcr.io/nousresearch/hermes-agent:v0.4.0

# Update docker-compose.yml image tag, then:
docker compose -f deploy/docker-compose.yml up -d
```

### 9c. State / data rollback

The data directory (`~/.hermes/` or the Docker volume `hermes_data`) contains sessions, memories, cron jobs, and the response store. Back it up before every update:

```bash
# Backup (run BEFORE updating)
tar czf ~/backups/hermes_data_$(date +%F_%H%M).tar.gz ~/.hermes/

# Restore from backup
sudo systemctl stop hermes-gateway
rm -rf ~/.hermes/
tar xzf ~/backups/hermes_data_2026-04-06_1200.tar.gz -C ~/
sudo systemctl start hermes-gateway
```

> **Tested rollback**: The rollback procedure above was validated in staging on 2026-04-06. Data integrity was confirmed by checking session count before/after: `ls ~/.hermes/sessions/ | wc -l`.

---

## 10. Database / State Migrations

Hermes uses two persistent stores:

| Store | Location | Format |
|-------|----------|--------|
| Session store | `~/.hermes/sessions/*.json` | JSON files |
| Response store (API server) | `~/.hermes/response_store.db` | SQLite WAL |
| Gateway state | `~/.hermes/gateway_state.json` | JSON |
| Memories | `~/.hermes/memories/*.md` | Markdown files |
| Cron jobs | `~/.hermes/cron/*.json` | JSON files |

### Migration steps (between versions)

1. **Stop** the gateway before migrating.
2. **Backup** the data directory (see Section 9c).
3. **Check release notes** for migration instructions (see `RELEASE_*.md`).
4. **Run** `hermes doctor` after starting the new version — it validates state compatibility.
5. **Verify** health via `python scripts/deploy-validate`.

There are currently no SQL migrations to run manually. The SQLite schema is
created automatically on first use with `CREATE TABLE IF NOT EXISTS`.

---

## 11. Docker Compose Deployment

### First-time setup

```bash
# 1. Copy .env.example to .env in the repo root
cp .env.example .env
nano .env   # fill in your API keys

# 2. Validate config before starting
python scripts/deploy-validate --skip-health

# 3. Start the stack
docker compose -f deploy/docker-compose.yml up -d

# 4. Watch startup logs
docker compose -f deploy/docker-compose.yml logs -f

# 5. Verify health
curl -s http://127.0.0.1:8642/health | jq .
```

### Updating to a new version

```bash
# Pull latest image
docker compose -f deploy/docker-compose.yml pull

# Recreate container (Docker waits for healthcheck before stopping old)
docker compose -f deploy/docker-compose.yml up -d

# Watch logs
docker compose -f deploy/docker-compose.yml logs -f --since 2m
```

### Data backup (Docker)

```bash
docker run --rm \
  -v hermes_data:/data \
  -v $(pwd)/backups:/backup \
  alpine tar czf /backup/hermes_data_$(date +%F).tar.gz /data
```

---

## 12. systemd Deployment

### Install unit files

```bash
# From the repo root
sudo cp deploy/hermes-agent.service  /etc/systemd/system/
sudo cp deploy/hermes-gateway.service /etc/systemd/system/

sudo systemctl daemon-reload

# Enable on boot + start now
sudo systemctl enable --now hermes-gateway

# (Optional) also run the CLI agent as a background service
# sudo systemctl enable --now hermes-agent
```

### Adjust the unit file for your user/paths

Edit `/etc/systemd/system/hermes-gateway.service`:

```ini
[Service]
User=youruser          # change from 'hermes'
WorkingDirectory=/home/youruser
EnvironmentFile=/home/youruser/.hermes/.env
ExecStart=/home/youruser/.local/bin/hermes gateway start --replace
```

Then:

```bash
sudo systemctl daemon-reload
sudo systemctl restart hermes-gateway
```

### Verify

```bash
systemctl status hermes-gateway
journalctl -u hermes-gateway -f
```

---

## 13. Monitoring & Logs

### Log locations

| Log | Location |
|-----|----------|
| Gateway (systemd) | `journalctl -u hermes-gateway` |
| Gateway (Docker) | `docker compose logs hermes` |
| Session trajectories | `~/.hermes/logs/session_*.json` |
| Deploy events | `~/.hermes/logs/deploy.log` |
| Runtime state | `~/.hermes/gateway_state.json` |

### Useful log commands

```bash
# Last 100 lines, follow
journalctl -u hermes-gateway -n 100 -f

# Errors only
journalctl -u hermes-gateway -p err --since today

# Docker: structured logs with timestamps
docker compose -f deploy/docker-compose.yml logs --timestamps hermes
```

### Alerting

Add a cron job on the host to page you if the health check fails:

```bash
# /etc/cron.d/hermes-healthcheck
* * * * * root curl -sf http://127.0.0.1:8642/health > /dev/null || \
  echo "Hermes unhealthy at $(date)" | mail -s "ALERT: Hermes down" ops@example.com
```

---

## 14. Security Checklist

- [ ] `.env` has permissions `600` and is **not** tracked by git (`git ls-files .env` returns nothing).
- [ ] `API_SERVER_KEY` is set if the API server is exposed beyond `127.0.0.1`.
- [ ] API server is bound to `127.0.0.1` (not `0.0.0.0`) unless behind a TLS-terminating reverse proxy.
- [ ] Firewall allows only the ports your platforms require (no unnecessary open ports).
- [ ] systemd unit uses `NoNewPrivileges=true`, `PrivateTmp=true`, `ProtectSystem=strict`.
- [ ] Docker container has resource limits set (`deploy.resources.limits`).
- [ ] Backups of `~/.hermes/` are stored outside the server (e.g. S3, remote NAS).
- [ ] `hermes doctor` returns no errors on the running instance.
- [ ] `python scripts/deploy-validate` exits 0 after every configuration change.

---

## 15. Troubleshooting

### Gateway won't start

```bash
hermes gateway start --replace   # clears stale PID files

# Check for port conflicts
ss -tlnp | grep 8642

# Verbose logs
HERMES_LOG_LEVEL=DEBUG hermes gateway start
```

### Health check returns `gateway_state: "starting"` for more than 60 s

Platform adapters take time to authenticate (especially Telegram + Discord). Check logs for auth errors:

```bash
journalctl -u hermes-gateway --since "2 minutes ago" | grep -i "error\|token\|auth"
```

### `/health` returns connection refused

The API server platform may not be enabled. Verify your gateway config (`~/.hermes/config.yaml`) includes:

```yaml
gateway:
  platforms:
    - api_server
```

### Rollback needed after failed update

See [Section 9](#9-rollback-procedure). If you backed up before updating, rollback takes < 5 minutes.

### Sessions lost after restart

Sessions are file-based in `~/.hermes/sessions/`. They persist across restarts. If they are gone, check:

```bash
ls -la ~/.hermes/sessions/
# Verify the volume is mounted (Docker):
docker exec hermes-agent ls /opt/data/sessions/
```

---

*This runbook is owned by the Bezalel epic backlog. Update it whenever deployment procedures change.*