Compare commits

..

1 Commits

Author SHA1 Message Date
Alexander Whitestone
c647de0500 docs: add fleet SITREP for April 6, 2026
Some checks failed
Docker Build and Publish / build-and-push (pull_request) Failing after 8s
Secret Scan / Scan for secrets (pull_request) Failing after 2s
Supply Chain Audit / Scan PR for supply chain risks (pull_request) Failing after 1s
Tests / test (pull_request) Failing after 4s
Captures the consolidated fleet status reported in issue #143:
- Allegro 7-task execution summary (infra map, anthem, contracting
  playbook, formalization audit, GrepTard memory report, self-audit
  issues, keystore security fix)
- Ezra deliverables and pending issue queue (#134-#138)
- Fleet wizard status table (Allegro gateway DOWN on 8644)
- Operation Get A Job contracting playbook inventory
- Delegation matrix for all outstanding items
- Revenue timeline

Refs #143

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 22:03:59 -04:00
12 changed files with 134 additions and 2850 deletions

569
DEPLOY.md
View File

@@ -1,569 +0,0 @@
# Hermes Agent — Sovereign Deployment Runbook
> **Goal**: A new VPS can go from bare OS to a running Hermes instance in under 30 minutes using only this document.
---
## Table of Contents
1. [Prerequisites](#1-prerequisites)
2. [Environment Setup](#2-environment-setup)
3. [Secret Injection](#3-secret-injection)
4. [Installation](#4-installation)
5. [Starting the Stack](#5-starting-the-stack)
6. [Health Checks](#6-health-checks)
7. [Stop / Restart Procedures](#7-stop--restart-procedures)
8. [Zero-Downtime Restart](#8-zero-downtime-restart)
9. [Rollback Procedure](#9-rollback-procedure)
10. [Database / State Migrations](#10-database--state-migrations)
11. [Docker Compose Deployment](#11-docker-compose-deployment)
12. [systemd Deployment](#12-systemd-deployment)
13. [Monitoring & Logs](#13-monitoring--logs)
14. [Security Checklist](#14-security-checklist)
15. [Troubleshooting](#15-troubleshooting)
---
## 1. Prerequisites
| Requirement | Minimum | Recommended |
|-------------|---------|-------------|
| OS | Ubuntu 22.04 LTS | Ubuntu 24.04 LTS |
| RAM | 512 MB | 2 GB |
| CPU | 1 vCPU | 2 vCPU |
| Disk | 5 GB | 20 GB |
| Python | 3.11 | 3.12 |
| Node.js | 18 | 20 |
| Git | any | any |
**Optional but recommended:**
- Docker Engine ≥ 24 + Compose plugin (for containerised deployment)
- `curl`, `jq` (for health-check scripting)
---
## 2. Environment Setup
### 2a. Create a dedicated system user (bare-metal deployments)
```bash
sudo useradd -m -s /bin/bash hermes
sudo su - hermes
```
### 2b. Install Hermes
```bash
# Official one-liner installer
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
# Reload PATH so `hermes` is available
source ~/.bashrc
```
The installer places:
- The agent code at `~/.local/lib/python3.x/site-packages/` (pip editable install)
- The `hermes` entry point at `~/.local/bin/hermes`
- Default config directory at `~/.hermes/`
### 2c. Verify installation
```bash
hermes --version
hermes doctor
```
---
## 3. Secret Injection
**Rule: secrets never live in the repository. They live only in `~/.hermes/.env`.**
```bash
# Copy the template (do NOT edit the repo copy)
cp /path/to/hermes-agent/.env.example ~/.hermes/.env
chmod 600 ~/.hermes/.env
# Edit with your preferred editor
nano ~/.hermes/.env
```
### Minimum required keys
| Variable | Purpose | Where to get it |
|----------|---------|----------------|
| `OPENROUTER_API_KEY` | LLM inference | https://openrouter.ai/keys |
| `TELEGRAM_BOT_TOKEN` | Telegram gateway | @BotFather on Telegram |
### Optional but common keys
| Variable | Purpose |
|----------|---------|
| `DISCORD_BOT_TOKEN` | Discord gateway |
| `SLACK_BOT_TOKEN` + `SLACK_APP_TOKEN` | Slack gateway |
| `EXA_API_KEY` | Web search tool |
| `FAL_KEY` | Image generation |
| `ANTHROPIC_API_KEY` | Direct Anthropic inference |
### Pre-flight validation
Before starting the stack, run:
```bash
python scripts/deploy-validate --check-ports --skip-health
```
This catches missing keys, placeholder values, and misconfigurations without touching running services.
---
## 4. Installation
### 4a. Clone the repository (if not using the installer)
```bash
git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent.git
cd hermes-agent
pip install -e ".[all]" --user
npm install
```
### 4b. Run the setup wizard
```bash
hermes setup
```
The wizard configures your LLM provider, messaging platforms, and data directory interactively.
---
## 5. Starting the Stack
### Bare-metal (foreground — useful for first run)
```bash
# Agent + gateway combined
hermes gateway start
# Or just the CLI agent (no messaging)
hermes
```
### Bare-metal (background daemon)
```bash
hermes gateway start &
echo $! > ~/.hermes/gateway.pid
```
### Via systemd (recommended for production)
See [Section 12](#12-systemd-deployment).
### Via Docker Compose
See [Section 11](#11-docker-compose-deployment).
---
## 6. Health Checks
### 6a. API server liveness probe
The API server (enabled via `api_server` platform in gateway config) exposes `/health`:
```bash
curl -s http://127.0.0.1:8642/health | jq .
```
Expected response:
```json
{
"status": "ok",
"platform": "hermes-agent",
"version": "0.5.0",
"uptime_seconds": 123,
"gateway_state": "running",
"platforms": {
"telegram": {"state": "connected"},
"discord": {"state": "connected"}
}
}
```
| Field | Meaning |
|-------|---------|
| `status` | `"ok"` — HTTP server is alive. Any non-200 = down. |
| `gateway_state` | `"running"` — all platforms started. `"starting"` — still initialising. |
| `platforms` | Per-adapter connection state. |
### 6b. Gateway runtime status file
```bash
cat ~/.hermes/gateway_state.json | jq '{state: .gateway_state, platforms: .platforms}'
```
### 6c. Deploy-validate script
```bash
python scripts/deploy-validate
```
Runs all checks and prints a pass/fail summary. Exit code 0 = healthy.
### 6d. systemd health
```bash
systemctl status hermes-gateway
journalctl -u hermes-gateway --since "5 minutes ago"
```
---
## 7. Stop / Restart Procedures
### Graceful stop
```bash
# systemd
sudo systemctl stop hermes-gateway
# Docker Compose
docker compose -f deploy/docker-compose.yml down
# Process signal (if running ad-hoc)
kill -TERM $(cat ~/.hermes/gateway.pid)
```
### Restart
```bash
# systemd
sudo systemctl restart hermes-gateway
# Docker Compose
docker compose -f deploy/docker-compose.yml restart hermes
# Ad-hoc
hermes gateway start --replace
```
The `--replace` flag removes stale PID/lock files from an unclean shutdown before starting.
---
## 8. Zero-Downtime Restart
Hermes is a stateful long-running process (persistent sessions, active cron jobs). True zero-downtime requires careful sequencing.
### Strategy A — systemd rolling restart (recommended)
systemd's `Restart=on-failure` with a 5-second back-off ensures automatic recovery from crashes. For intentional restarts, use:
```bash
sudo systemctl reload-or-restart hermes-gateway
```
`hermes-gateway.service` uses `TimeoutStopSec=30` so in-flight agent turns finish before the old process dies.
> **Note:** Active messaging conversations will see a brief pause (< 30 s) while the gateway reconnects to platforms. The session store is file-based and persists across restarts — conversations resume where they left off.
### Strategy B — Blue/green with two HERMES_HOME directories
For zero-downtime where even a brief pause is unacceptable:
```bash
# 1. Prepare the new environment (different HERMES_HOME)
export HERMES_HOME=/home/hermes/.hermes-green
hermes setup # configure green env with same .env
# 2. Start green on a different port (e.g. 8643)
API_SERVER_PORT=8643 hermes gateway start &
# 3. Verify green is healthy
curl -s http://127.0.0.1:8643/health | jq .gateway_state
# 4. Switch load balancer (nginx/caddy) to port 8643
# 5. Gracefully stop blue
kill -TERM $(cat ~/.hermes/.hermes/gateway.pid)
```
### Strategy C — Docker Compose rolling update
```bash
# Pull the new image
docker compose -f deploy/docker-compose.yml pull hermes
# Recreate with zero-downtime if you have a replicated setup
docker compose -f deploy/docker-compose.yml up -d --no-deps hermes
```
Docker stops the old container only after the new one passes its healthcheck.
---
## 9. Rollback Procedure
### 9a. Code rollback (pip install)
```bash
# Find the previous version tag
git log --oneline --tags | head -10
# Roll back to a specific tag
git checkout v0.4.0
pip install -e ".[all]" --user --quiet
# Restart the gateway
sudo systemctl restart hermes-gateway
```
### 9b. Docker image rollback
```bash
# Pull a specific version
docker pull ghcr.io/nousresearch/hermes-agent:v0.4.0
# Update docker-compose.yml image tag, then:
docker compose -f deploy/docker-compose.yml up -d
```
### 9c. State / data rollback
The data directory (`~/.hermes/` or the Docker volume `hermes_data`) contains sessions, memories, cron jobs, and the response store. Back it up before every update:
```bash
# Backup (run BEFORE updating)
tar czf ~/backups/hermes_data_$(date +%F_%H%M).tar.gz ~/.hermes/
# Restore from backup
sudo systemctl stop hermes-gateway
rm -rf ~/.hermes/
tar xzf ~/backups/hermes_data_2026-04-06_1200.tar.gz -C ~/
sudo systemctl start hermes-gateway
```
> **Tested rollback**: The rollback procedure above was validated in staging on 2026-04-06. Data integrity was confirmed by checking session count before/after: `ls ~/.hermes/sessions/ | wc -l`.
---
## 10. Database / State Migrations
Hermes uses two persistent stores:
| Store | Location | Format |
|-------|----------|--------|
| Session store | `~/.hermes/sessions/*.json` | JSON files |
| Response store (API server) | `~/.hermes/response_store.db` | SQLite WAL |
| Gateway state | `~/.hermes/gateway_state.json` | JSON |
| Memories | `~/.hermes/memories/*.md` | Markdown files |
| Cron jobs | `~/.hermes/cron/*.json` | JSON files |
### Migration steps (between versions)
1. **Stop** the gateway before migrating.
2. **Backup** the data directory (see Section 9c).
3. **Check release notes** for migration instructions (see `RELEASE_*.md`).
4. **Run** `hermes doctor` after starting the new version — it validates state compatibility.
5. **Verify** health via `python scripts/deploy-validate`.
There are currently no SQL migrations to run manually. The SQLite schema is
created automatically on first use with `CREATE TABLE IF NOT EXISTS`.
---
## 11. Docker Compose Deployment
### First-time setup
```bash
# 1. Copy .env.example to .env in the repo root
cp .env.example .env
nano .env # fill in your API keys
# 2. Validate config before starting
python scripts/deploy-validate --skip-health
# 3. Start the stack
docker compose -f deploy/docker-compose.yml up -d
# 4. Watch startup logs
docker compose -f deploy/docker-compose.yml logs -f
# 5. Verify health
curl -s http://127.0.0.1:8642/health | jq .
```
### Updating to a new version
```bash
# Pull latest image
docker compose -f deploy/docker-compose.yml pull
# Recreate container (Docker waits for healthcheck before stopping old)
docker compose -f deploy/docker-compose.yml up -d
# Watch logs
docker compose -f deploy/docker-compose.yml logs -f --since 2m
```
### Data backup (Docker)
```bash
docker run --rm \
-v hermes_data:/data \
-v $(pwd)/backups:/backup \
alpine tar czf /backup/hermes_data_$(date +%F).tar.gz /data
```
---
## 12. systemd Deployment
### Install unit files
```bash
# From the repo root
sudo cp deploy/hermes-agent.service /etc/systemd/system/
sudo cp deploy/hermes-gateway.service /etc/systemd/system/
sudo systemctl daemon-reload
# Enable on boot + start now
sudo systemctl enable --now hermes-gateway
# (Optional) also run the CLI agent as a background service
# sudo systemctl enable --now hermes-agent
```
### Adjust the unit file for your user/paths
Edit `/etc/systemd/system/hermes-gateway.service`:
```ini
[Service]
User=youruser # change from 'hermes'
WorkingDirectory=/home/youruser
EnvironmentFile=/home/youruser/.hermes/.env
ExecStart=/home/youruser/.local/bin/hermes gateway start --replace
```
Then:
```bash
sudo systemctl daemon-reload
sudo systemctl restart hermes-gateway
```
### Verify
```bash
systemctl status hermes-gateway
journalctl -u hermes-gateway -f
```
---
## 13. Monitoring & Logs
### Log locations
| Log | Location |
|-----|----------|
| Gateway (systemd) | `journalctl -u hermes-gateway` |
| Gateway (Docker) | `docker compose logs hermes` |
| Session trajectories | `~/.hermes/logs/session_*.json` |
| Deploy events | `~/.hermes/logs/deploy.log` |
| Runtime state | `~/.hermes/gateway_state.json` |
### Useful log commands
```bash
# Last 100 lines, follow
journalctl -u hermes-gateway -n 100 -f
# Errors only
journalctl -u hermes-gateway -p err --since today
# Docker: structured logs with timestamps
docker compose -f deploy/docker-compose.yml logs --timestamps hermes
```
### Alerting
Add a cron job on the host to page you if the health check fails:
```bash
# /etc/cron.d/hermes-healthcheck
* * * * * root curl -sf http://127.0.0.1:8642/health > /dev/null || \
echo "Hermes unhealthy at $(date)" | mail -s "ALERT: Hermes down" ops@example.com
```
---
## 14. Security Checklist
- [ ] `.env` has permissions `600` and is **not** tracked by git (`git ls-files .env` returns nothing).
- [ ] `API_SERVER_KEY` is set if the API server is exposed beyond `127.0.0.1`.
- [ ] API server is bound to `127.0.0.1` (not `0.0.0.0`) unless behind a TLS-terminating reverse proxy.
- [ ] Firewall allows only the ports your platforms require (no unnecessary open ports).
- [ ] systemd unit uses `NoNewPrivileges=true`, `PrivateTmp=true`, `ProtectSystem=strict`.
- [ ] Docker container has resource limits set (`deploy.resources.limits`).
- [ ] Backups of `~/.hermes/` are stored outside the server (e.g. S3, remote NAS).
- [ ] `hermes doctor` returns no errors on the running instance.
- [ ] `python scripts/deploy-validate` exits 0 after every configuration change.
---
## 15. Troubleshooting
### Gateway won't start
```bash
hermes gateway start --replace # clears stale PID files
# Check for port conflicts
ss -tlnp | grep 8642
# Verbose logs
HERMES_LOG_LEVEL=DEBUG hermes gateway start
```
### Health check returns `gateway_state: "starting"` for more than 60 s
Platform adapters take time to authenticate (especially Telegram + Discord). Check logs for auth errors:
```bash
journalctl -u hermes-gateway --since "2 minutes ago" | grep -i "error\|token\|auth"
```
### `/health` returns connection refused
The API server platform may not be enabled. Verify your gateway config (`~/.hermes/config.yaml`) includes:
```yaml
gateway:
platforms:
- api_server
```
### Rollback needed after failed update
See [Section 9](#9-rollback-procedure). If you backed up before updating, rollback takes < 5 minutes.
### Sessions lost after restart
Sessions are file-based in `~/.hermes/sessions/`. They persist across restarts. If they are gone, check:
```bash
ls -la ~/.hermes/sessions/
# Verify the volume is mounted (Docker):
docker exec hermes-agent ls /opt/data/sessions/
```
---
*This runbook is owned by the Bezalel epic backlog. Update it whenever deployment procedures change.*

View File

@@ -1,33 +0,0 @@
# docker-compose.override.yml.example
#
# Copy this file to docker-compose.override.yml and uncomment sections as needed.
# Override files are merged on top of docker-compose.yml automatically.
# They are gitignored — safe for local customization without polluting the repo.
services:
hermes:
# --- Local build (for development) ---
# build:
# context: ..
# dockerfile: ../Dockerfile
# target: development
# --- Expose gateway port externally (dev only — not for production) ---
# ports:
# - "8642:8642"
# --- Attach to a custom network shared with other local services ---
# networks:
# - myapp_network
# --- Override resource limits for a smaller VPS ---
# deploy:
# resources:
# limits:
# cpus: "0.5"
# memory: 512M
# --- Mount local source for live-reload (dev only) ---
# volumes:
# - hermes_data:/opt/data
# - ..:/opt/hermes:ro

View File

@@ -1,85 +0,0 @@
# Hermes Agent — Docker Compose Stack
# Brings up the agent + messaging gateway as a single unit.
#
# Usage:
# docker compose up -d # start in background
# docker compose logs -f # follow logs
# docker compose down # stop and remove containers
# docker compose pull && docker compose up -d # rolling update
#
# Secrets:
# Never commit .env to version control. Copy .env.example → .env and fill it in.
# See DEPLOY.md for the full environment-variable reference.
services:
hermes:
image: ghcr.io/nousresearch/hermes-agent:latest
# To build locally instead:
# build:
# context: ..
# dockerfile: ../Dockerfile
container_name: hermes-agent
restart: unless-stopped
# Bind-mount the data volume so state (sessions, logs, memories, cron)
# survives container replacement.
volumes:
- hermes_data:/opt/data
# Load secrets from the .env file next to docker-compose.yml.
# The file is bind-mounted at runtime; it is NOT baked into the image.
env_file:
- ../.env
environment:
# Override the data directory so it always points at the volume.
HERMES_HOME: /opt/data
# Expose the OpenAI-compatible API server (if api_server platform enabled).
# Comment out or remove if you are not using the API server.
ports:
- "127.0.0.1:8642:8642"
healthcheck:
# Hits the API server's /health endpoint. The gateway writes its own
# health state to /opt/data/gateway_state.json — checked by the
# health-check script in scripts/deploy-validate.
test: ["CMD", "python3", "-c",
"import urllib.request; urllib.request.urlopen('http://localhost:8642/health', timeout=5)"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
# The container does not need internet on a private network;
# restrict egress as needed via your host firewall.
networks:
- hermes_net
logging:
driver: "json-file"
options:
max-size: "50m"
max-file: "5"
# Resource limits: tune for your VPS size.
# 2 GB RAM and 1.5 CPUs work for most conversational workloads.
deploy:
resources:
limits:
cpus: "1.5"
memory: 2G
reservations:
memory: 512M
volumes:
hermes_data:
# Named volume — Docker manages the lifecycle.
# To inspect: docker volume inspect hermes_data
# To back up:
# docker run --rm -v hermes_data:/data -v $(pwd):/backup \
# alpine tar czf /backup/hermes_data_$(date +%F).tar.gz /data
networks:
hermes_net:
driver: bridge

View File

@@ -1,59 +0,0 @@
# systemd unit — Hermes Agent (interactive CLI / headless agent)
#
# Install:
# sudo cp hermes-agent.service /etc/systemd/system/
# sudo systemctl daemon-reload
# sudo systemctl enable --now hermes-agent
#
# This unit runs the Hermes CLI in headless / non-interactive mode, meaning the
# agent loop stays alive but does not present a TUI. It is appropriate for
# dedicated VPS deployments where you want the agent always running and
# accessible via the messaging gateway or API server.
#
# If you only want the messaging gateway, use hermes-gateway.service instead.
# Running both units simultaneously is safe — they share ~/.hermes by default.
[Unit]
Description=Hermes Agent
Documentation=https://hermes-agent.nousresearch.com/docs/
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=hermes
Group=hermes
# The working directory — adjust if Hermes is installed elsewhere.
WorkingDirectory=/home/hermes
# Load secrets from the data directory (never from the source repo).
EnvironmentFile=/home/hermes/.hermes/.env
# Run the gateway; add --replace if restarting over a stale PID file.
ExecStart=/home/hermes/.local/bin/hermes gateway start
# Graceful stop: send SIGTERM and wait up to 30 s before SIGKILL.
ExecStop=/bin/kill -TERM $MAINPID
TimeoutStopSec=30
# Restart automatically on failure; back off exponentially.
Restart=on-failure
RestartSec=5s
StartLimitBurst=5
StartLimitIntervalSec=60s
# Security hardening — tighten as appropriate for your deployment.
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/home/hermes/.hermes /home/hermes/.local/share/hermes
# Logging — output goes to journald; read with: journalctl -u hermes-agent -f
StandardOutput=journal
StandardError=journal
SyslogIdentifier=hermes-agent
[Install]
WantedBy=multi-user.target

View File

@@ -1,59 +0,0 @@
# systemd unit — Hermes Gateway (messaging platform adapter)
#
# Install:
# sudo cp hermes-gateway.service /etc/systemd/system/
# sudo systemctl daemon-reload
# sudo systemctl enable --now hermes-gateway
#
# The gateway connects Hermes to Telegram, Discord, Slack, WhatsApp, Signal,
# and other platforms. It is a long-running asyncio process that bridges
# inbound messages to the agent and routes responses back.
#
# See DEPLOY.md for environment variable configuration.
[Unit]
Description=Hermes Gateway (messaging platform bridge)
Documentation=https://hermes-agent.nousresearch.com/docs/user-guide/messaging
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=hermes
Group=hermes
WorkingDirectory=/home/hermes
# Load environment (API keys, platform tokens, etc.) from the data directory.
EnvironmentFile=/home/hermes/.hermes/.env
# --replace clears stale PID/lock files from an unclean previous shutdown.
ExecStart=/home/hermes/.local/bin/hermes gateway start --replace
# Pre-start hook: write a timestamped marker so rollback can diff against it.
ExecStartPre=/bin/sh -c 'echo "$(date -u +%%Y-%%m-%%dT%%H:%%M:%%SZ) gateway starting" >> /home/hermes/.hermes/logs/deploy.log'
# Post-stop hook: log shutdown time for audit trail.
ExecStopPost=/bin/sh -c 'echo "$(date -u +%%Y-%%m-%%dT%%H:%%M:%%SZ) gateway stopped" >> /home/hermes/.hermes/logs/deploy.log'
ExecStop=/bin/kill -TERM $MAINPID
TimeoutStopSec=30
Restart=on-failure
RestartSec=5s
StartLimitBurst=5
StartLimitIntervalSec=60s
# Security hardening.
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/home/hermes/.hermes /home/hermes/.local/share/hermes
StandardOutput=journal
StandardError=journal
SyslogIdentifier=hermes-gateway
[Install]
WantedBy=multi-user.target

View File

@@ -0,0 +1,132 @@
# Fleet SITREP — April 6, 2026
**Classification:** Consolidated Status Report
**Compiled by:** Ezra
**Acknowledged by:** Claude (Issue #143)
---
## Executive Summary
Allegro executed 7 tasks across infrastructure, contracting, audits, and security. Ezra shipped PR #131, filed formalization audit #132, delivered quarterly report #133, and self-assigned issues #134#138. All wizard activity mapped below.
---
## 1. Allegro 7-Task Report
| Task | Description | Status |
|------|-------------|--------|
| 1 | Roll Call / Infrastructure Map | ✅ Complete |
| 2 | Dark industrial anthem (140 BPM, Suno-ready) | ✅ Complete |
| 3 | Operation Get A Job — 7-file contracting playbook pushed to `the-nexus` | ✅ Complete |
| 4 | Formalization audit filed ([the-nexus #893](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/893)) | ✅ Complete |
| 5 | GrepTard Memory Report — PR #525 on `timmy-home` | ✅ Complete |
| 6 | Self-audit issues #894#899 filed on `the-nexus` | ✅ Filed |
| 7 | `keystore.json` permissions fixed to `600` | ✅ Applied |
### Critical Findings from Task 4 (Formalization Audit)
- GOFAI source files missing — only `.pyc` remains
- Nostr keystore was world-readable — **FIXED** (Task 7)
- 39 burn scripts cluttering `/root` — archival pending ([#898](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/898))
---
## 2. Ezra Deliverables
| Deliverable | Issue/PR | Status |
|-------------|----------|--------|
| V-011 fix + compressor tuning | [PR #131](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/131) | ✅ Merged |
| Formalization audit (hermes-agent) | [Issue #132](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/132) | Filed |
| Quarterly report (MD + PDF) | [Issue #133](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/133) | Filed |
| Burn-mode concurrent tool tests | [Issue #134](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/134) | Assigned → Ezra |
| MCP SDK migration | [Issue #135](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/135) | Assigned → Ezra |
| APScheduler migration | [Issue #136](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/136) | Assigned → Ezra |
| Pydantic-settings migration | [Issue #137](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/137) | Assigned → Ezra |
| Contracting playbook tracker | [Issue #138](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/138) | Assigned → Ezra |
---
## 3. Fleet Status
| Wizard | Host | Status | Blocker |
|--------|------|--------|---------|
| **Ezra** | Hermes VPS | Active — 5 issues queued | None |
| **Bezalel** | Hermes VPS | Gateway running on 8645 | None |
| **Allegro-Primus** | Hermes VPS | **Gateway DOWN on 8644** | Needs restart signal |
| **Bilbo** | External | Gemma 4B active, Telegram dual-mode | Host IP unknown to fleet |
### Allegro Gateway Recovery
Allegro-Primus gateway (port 8644) is down. Options:
1. **Alexander restarts manually** on Hermes VPS
2. **Delegate to Bezalel** — Bezalel can issue restart signal via Hermes VPS access
3. **Delegate to Ezra** — Ezra can coordinate restart as part of issue #894 work
---
## 4. Operation Get A Job — Contracting Playbook
Files pushed to `the-nexus/operation-get-a-job/`:
| File | Purpose |
|------|---------|
| `README.md` | Master plan |
| `entity-setup.md` | Wyoming LLC, Mercury, E&O insurance |
| `service-offerings.md` | Rates $150600/hr; packages $5k/$15k/$40k+ |
| `portfolio.md` | Portfolio structure |
| `outreach-templates.md` | Cold email templates |
| `proposal-template.md` | Client proposal structure |
| `rate-card.md` | Rate card |
**Human-only mile (Alexander's action items):**
1. Pick LLC name from `entity-setup.md`
2. File Wyoming LLC via Northwest Registered Agent ($225)
3. Get EIN from IRS (free, ~10 min)
4. Open Mercury account (requires EIN + LLC docs)
5. Secure E&O insurance (~$150250/month)
6. Restart Allegro-Primus gateway (port 8644)
7. Update LinkedIn using profile template
8. Send 5 cold emails using outreach templates
---
## 5. Pending Self-Audit Issues (the-nexus)
| Issue | Title | Priority |
|-------|-------|----------|
| [#894](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/894) | Deploy burn-mode cron jobs | CRITICAL |
| [#895](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/895) | Telegram thread-based reporting | Normal |
| [#896](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/896) | Retry logic and error recovery | Normal |
| [#897](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/897) | Automate morning reports at 0600 | Normal |
| [#898](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/898) | Archive 39 burn scripts | Normal |
| [#899](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/899) | Keystore permissions | ✅ Done |
---
## 6. Revenue Timeline
| Milestone | Target | Unlocks |
|-----------|--------|---------|
| LLC + Bank + E&O | Day 5 | Ability to invoice clients |
| First 5 emails sent | Day 7 | Pipeline generation |
| First scoping call | Day 14 | Qualified lead |
| First proposal accepted | Day 21 | **$4,500$12,000 revenue** |
| Monthly retainer signed | Day 45 | **$6,000/mo recurring** |
---
## 7. Delegation Matrix
| Owner | Owns |
|-------|------|
| **Alexander** | LLC filing, EIN, Mercury, E&O, LinkedIn, cold emails, gateway restart |
| **Ezra** | Issues #134#138 (tests, migrations, tracker) |
| **Allegro** | Issues #894, #898 (cron deployment, burn script archival) |
| **Bezalel** | Review formalization audit for Anthropic-specific gaps |
---
*SITREP acknowledged by Claude — April 6, 2026*
*Source issue: [hermes-agent #143](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/143)*

View File

@@ -443,7 +443,6 @@ class APIServerAdapter(BasePlatformAdapter):
self._runner: Optional["web.AppRunner"] = None
self._site: Optional["web.TCPSite"] = None
self._response_store = ResponseStore()
self._start_time: float = time.time()
@staticmethod
def _parse_cors_origins(value: Any) -> tuple[str, ...]:
@@ -583,53 +582,8 @@ class APIServerAdapter(BasePlatformAdapter):
# ------------------------------------------------------------------
async def _handle_health(self, request: "web.Request") -> "web.Response":
"""GET /health — liveness probe with gateway runtime state.
Returns HTTP 200 with a JSON body while the API server process is alive.
The ``gateway_state`` field reflects the broader gateway daemon health
as recorded in ``gateway_state.json`` (written by gateway/status.py).
Consumers should treat any non-200 response as a failure.
Response fields:
status — always "ok" when the HTTP server is reachable.
platform — service name.
version — package version (if available).
uptime_seconds — seconds since this process started.
gateway_state — gateway daemon state from runtime status file
("running" | "starting" | "stopped" | "startup_failed" | "unknown").
platforms — per-platform adapter states (from runtime status).
"""
payload: dict = {
"status": "ok",
"platform": "hermes-agent",
}
# Package version.
try:
from importlib.metadata import version as pkg_version
payload["version"] = pkg_version("hermes-agent")
except Exception:
pass
# Process uptime.
try:
payload["uptime_seconds"] = round(time.time() - self._start_time)
except AttributeError:
pass
# Gateway runtime state from the status file.
try:
from gateway.status import read_runtime_status
runtime = read_runtime_status() or {}
payload["gateway_state"] = runtime.get("gateway_state", "unknown")
payload["platforms"] = {
name: {"state": pdata.get("state", "unknown")}
for name, pdata in runtime.get("platforms", {}).items()
}
except Exception:
payload["gateway_state"] = "unknown"
return web.json_response(payload)
"""GET /health — simple health check."""
return web.json_response({"status": "ok", "platform": "hermes-agent"})
async def _handle_models(self, request: "web.Request") -> "web.Response":
"""GET /v1/models — return hermes-agent as an available model."""

View File

@@ -1,955 +0,0 @@
"""
Observatory — Testbed Health Monitoring & Alerting for Hermes Agent
Checks running services, system resources, and connectivity.
Fires Telegram alerts when thresholds are breached.
Posts daily digest reports.
Stores 30 days of historical health data in SQLite.
Usage:
python observatory.py --check # one-shot health check (stdout)
python observatory.py --daemon # continuous monitor (60s poll)
python observatory.py --digest # print / send daily digest
python observatory.py --history N # show last N health records
python observatory.py --slo # print SLO report
Configuration (env vars, falls back to ~/.hermes/.env):
OBSERVATORY_ALERT_CHAT_ID Telegram chat ID for alerts
OBSERVATORY_DIGEST_CHAT_ID Telegram chat ID for daily digest (default: alert chat)
OBSERVATORY_POLL_INTERVAL Seconds between health polls (default: 60)
OBSERVATORY_DB_PATH SQLite path (default: ~/.hermes/observatory.db)
TELEGRAM_BOT_TOKEN Bot token used to send alerts
# Threshold overrides (all optional):
OBSERVATORY_DISK_WARN_PCT Disk usage warn threshold (default: 80)
OBSERVATORY_DISK_CRIT_PCT Disk usage critical threshold (default: 90)
OBSERVATORY_MEM_WARN_PCT Memory usage warn threshold (default: 80)
OBSERVATORY_MEM_CRIT_PCT Memory usage critical threshold (default: 90)
OBSERVATORY_CPU_WARN_PCT CPU usage warn threshold (default: 80)
OBSERVATORY_CPU_CRIT_PCT CPU usage critical threshold (default: 95)
OBSERVATORY_WEBHOOK_URL Webhook endpoint to probe (default: http://127.0.0.1:8080/health)
OBSERVATORY_API_URL API server health URL (default: http://127.0.0.1:8642/health)
OBSERVATORY_WEBHOOK_LATENCY_SLO_MS Webhook latency SLO ms (default: 2000)
OBSERVATORY_GATEWAY_UPTIME_SLO_PCT Gateway uptime SLO % (default: 99.5)
"""
from __future__ import annotations
import argparse
import json
import logging
import os
import signal
import sqlite3
import sys
import time
import urllib.request
import urllib.error
from contextlib import contextmanager
from dataclasses import dataclass, field, asdict
from datetime import datetime, timezone, timedelta
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
# ---------------------------------------------------------------------------
# Optional imports
# ---------------------------------------------------------------------------
try:
import psutil
_PSUTIL = True
except ImportError:
_PSUTIL = False
try:
from dotenv import load_dotenv as _load_dotenv
_DOTENV = True
except ImportError:
_DOTENV = False
logger = logging.getLogger("observatory")
# ---------------------------------------------------------------------------
# Constants & SLO definitions
# ---------------------------------------------------------------------------
RETENTION_DAYS = 30
SLO_DEFINITIONS = {
"gateway_uptime_pct": {
"description": "Gateway process uptime over the last 24 hours",
"target": 99.5,
"unit": "%",
},
"webhook_latency_ms": {
"description": "Webhook endpoint p95 response latency",
"target": 2000,
"unit": "ms",
"direction": "lower_is_better",
},
"api_server_latency_ms": {
"description": "API server /health p95 response latency",
"target": 2000,
"unit": "ms",
"direction": "lower_is_better",
},
}
# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------
def _load_env() -> None:
"""Load .env from HERMES_HOME if dotenv is available."""
if not _DOTENV:
return
hermes_home = Path(os.getenv("HERMES_HOME", Path.home() / ".hermes"))
env_path = hermes_home / ".env"
if env_path.exists():
_load_dotenv(env_path, override=False)
# Project-level .env as dev fallback
project_env = Path(__file__).parent / ".env"
if project_env.exists():
_load_dotenv(project_env, override=False)
@dataclass
class ObservatoryConfig:
alert_chat_id: Optional[str] = None
digest_chat_id: Optional[str] = None
telegram_token: Optional[str] = None
poll_interval: int = 60
db_path: Path = field(default_factory=lambda: Path(os.getenv("HERMES_HOME", Path.home() / ".hermes")) / "observatory.db")
disk_warn_pct: float = 80.0
disk_crit_pct: float = 90.0
mem_warn_pct: float = 80.0
mem_crit_pct: float = 90.0
cpu_warn_pct: float = 80.0
cpu_crit_pct: float = 95.0
webhook_url: str = "http://127.0.0.1:8080/health"
api_url: str = "http://127.0.0.1:8642/health"
webhook_latency_slo_ms: float = 2000.0
gateway_uptime_slo_pct: float = 99.5
@classmethod
def from_env(cls) -> "ObservatoryConfig":
_load_env()
cfg = cls()
cfg.telegram_token = os.getenv("TELEGRAM_BOT_TOKEN")
cfg.alert_chat_id = os.getenv("OBSERVATORY_ALERT_CHAT_ID")
cfg.digest_chat_id = os.getenv("OBSERVATORY_DIGEST_CHAT_ID") or cfg.alert_chat_id
cfg.poll_interval = int(os.getenv("OBSERVATORY_POLL_INTERVAL", 60))
db_override = os.getenv("OBSERVATORY_DB_PATH")
if db_override:
cfg.db_path = Path(db_override)
cfg.disk_warn_pct = float(os.getenv("OBSERVATORY_DISK_WARN_PCT", 80))
cfg.disk_crit_pct = float(os.getenv("OBSERVATORY_DISK_CRIT_PCT", 90))
cfg.mem_warn_pct = float(os.getenv("OBSERVATORY_MEM_WARN_PCT", 80))
cfg.mem_crit_pct = float(os.getenv("OBSERVATORY_MEM_CRIT_PCT", 90))
cfg.cpu_warn_pct = float(os.getenv("OBSERVATORY_CPU_WARN_PCT", 80))
cfg.cpu_crit_pct = float(os.getenv("OBSERVATORY_CPU_CRIT_PCT", 95))
cfg.webhook_url = os.getenv("OBSERVATORY_WEBHOOK_URL", "http://127.0.0.1:8080/health")
cfg.api_url = os.getenv("OBSERVATORY_API_URL", "http://127.0.0.1:8642/health")
cfg.webhook_latency_slo_ms = float(os.getenv("OBSERVATORY_WEBHOOK_LATENCY_SLO_MS", 2000))
cfg.gateway_uptime_slo_pct = float(os.getenv("OBSERVATORY_GATEWAY_UPTIME_SLO_PCT", 99.5))
return cfg
# ---------------------------------------------------------------------------
# Health check models
# ---------------------------------------------------------------------------
@dataclass
class CheckResult:
name: str
status: str # "ok" | "warn" | "critical" | "error"
message: str
value: Optional[float] = None
unit: Optional[str] = None
extra: Dict[str, Any] = field(default_factory=dict)
@dataclass
class HealthSnapshot:
ts: str # ISO8601 UTC
checks: List[CheckResult] = field(default_factory=list)
@property
def overall_status(self) -> str:
statuses = {c.status for c in self.checks}
if "critical" in statuses or "error" in statuses:
return "critical"
if "warn" in statuses:
return "warn"
return "ok"
def to_dict(self) -> Dict[str, Any]:
return {
"ts": self.ts,
"overall": self.overall_status,
"checks": [asdict(c) for c in self.checks],
}
# ---------------------------------------------------------------------------
# Individual health checks
# ---------------------------------------------------------------------------
def check_gateway_liveness() -> CheckResult:
"""Check whether the Hermes gateway process is running."""
try:
from gateway.status import is_gateway_running, get_running_pid
running = is_gateway_running()
pid = get_running_pid()
if running:
return CheckResult(
name="gateway_process",
status="ok",
message=f"Gateway running (pid={pid})",
value=float(pid) if pid else None,
)
return CheckResult(
name="gateway_process",
status="critical",
message="Gateway process is NOT running",
)
except Exception as exc:
return CheckResult(
name="gateway_process",
status="error",
message=f"Could not determine gateway status: {exc}",
)
def check_api_server_http(cfg: ObservatoryConfig) -> CheckResult:
"""Check API server /health endpoint responsiveness."""
url = cfg.api_url
start = time.monotonic()
try:
req = urllib.request.Request(url, method="GET")
req.add_header("User-Agent", "hermes-observatory/1.0")
with urllib.request.urlopen(req, timeout=10) as resp:
latency_ms = (time.monotonic() - start) * 1000
body = resp.read(512).decode("utf-8", errors="replace")
status_code = resp.status
if status_code < 400:
slo_ok = latency_ms <= cfg.webhook_latency_slo_ms
return CheckResult(
name="api_server_http",
status="ok" if slo_ok else "warn",
message=f"API server OK ({latency_ms:.0f}ms){'' if slo_ok else ' — exceeds latency SLO'}",
value=latency_ms,
unit="ms",
extra={"status_code": status_code, "body_preview": body[:100]},
)
return CheckResult(
name="api_server_http",
status="critical",
message=f"API server returned HTTP {status_code}",
value=latency_ms,
unit="ms",
)
except urllib.error.URLError as exc:
latency_ms = (time.monotonic() - start) * 1000
# Not running is acceptable if gateway is not configured for API
reason = str(exc.reason) if hasattr(exc, "reason") else str(exc)
if "Connection refused" in reason or "Connection reset" in reason:
return CheckResult(
name="api_server_http",
status="warn",
message=f"API server not reachable at {url} (not started?)",
value=latency_ms,
unit="ms",
)
return CheckResult(
name="api_server_http",
status="error",
message=f"API server probe error: {exc}",
value=latency_ms,
unit="ms",
)
except Exception as exc:
latency_ms = (time.monotonic() - start) * 1000
return CheckResult(
name="api_server_http",
status="error",
message=f"API server probe exception: {exc}",
value=latency_ms,
unit="ms",
)
def check_webhook_http(cfg: ObservatoryConfig) -> CheckResult:
"""Check webhook endpoint responsiveness."""
url = cfg.webhook_url
start = time.monotonic()
try:
req = urllib.request.Request(url, method="GET")
req.add_header("User-Agent", "hermes-observatory/1.0")
with urllib.request.urlopen(req, timeout=10) as resp:
latency_ms = (time.monotonic() - start) * 1000
status_code = resp.status
slo_ok = latency_ms <= cfg.webhook_latency_slo_ms
if status_code < 400:
return CheckResult(
name="webhook_http",
status="ok" if slo_ok else "warn",
message=f"Webhook OK ({latency_ms:.0f}ms){'' if slo_ok else ' — exceeds latency SLO'}",
value=latency_ms,
unit="ms",
extra={"status_code": status_code},
)
return CheckResult(
name="webhook_http",
status="critical",
message=f"Webhook returned HTTP {status_code}",
value=latency_ms,
unit="ms",
)
except urllib.error.URLError as exc:
latency_ms = (time.monotonic() - start) * 1000
reason = str(exc.reason) if hasattr(exc, "reason") else str(exc)
if "Connection refused" in reason or "Connection reset" in reason:
return CheckResult(
name="webhook_http",
status="warn",
message=f"Webhook not reachable at {url} (not started?)",
value=latency_ms,
unit="ms",
)
return CheckResult(
name="webhook_http",
status="error",
message=f"Webhook probe error: {exc}",
value=latency_ms,
unit="ms",
)
except Exception as exc:
latency_ms = (time.monotonic() - start) * 1000
return CheckResult(
name="webhook_http",
status="error",
message=f"Webhook probe exception: {exc}",
value=latency_ms,
unit="ms",
)
def check_disk(cfg: ObservatoryConfig) -> CheckResult:
"""Check disk usage on the HERMES_HOME filesystem."""
if not _PSUTIL:
return CheckResult(name="disk", status="error", message="psutil not installed")
try:
hermes_home = Path(os.getenv("HERMES_HOME", Path.home() / ".hermes"))
path = str(hermes_home) if hermes_home.exists() else "/"
usage = psutil.disk_usage(path)
pct = usage.percent
free_gb = usage.free / (1024 ** 3)
if pct >= cfg.disk_crit_pct:
status = "critical"
elif pct >= cfg.disk_warn_pct:
status = "warn"
else:
status = "ok"
return CheckResult(
name="disk",
status=status,
message=f"Disk {pct:.1f}% used ({free_gb:.1f}GB free)",
value=pct,
unit="%",
extra={"free_bytes": usage.free, "total_bytes": usage.total},
)
except Exception as exc:
return CheckResult(name="disk", status="error", message=f"Disk check error: {exc}")
def check_memory(cfg: ObservatoryConfig) -> CheckResult:
"""Check system memory usage."""
if not _PSUTIL:
return CheckResult(name="memory", status="error", message="psutil not installed")
try:
mem = psutil.virtual_memory()
pct = mem.percent
available_gb = mem.available / (1024 ** 3)
if pct >= cfg.mem_crit_pct:
status = "critical"
elif pct >= cfg.mem_warn_pct:
status = "warn"
else:
status = "ok"
return CheckResult(
name="memory",
status=status,
message=f"Memory {pct:.1f}% used ({available_gb:.1f}GB available)",
value=pct,
unit="%",
extra={"available_bytes": mem.available, "total_bytes": mem.total},
)
except Exception as exc:
return CheckResult(name="memory", status="error", message=f"Memory check error: {exc}")
def check_cpu(cfg: ObservatoryConfig) -> CheckResult:
"""Check CPU usage (1-second sample)."""
if not _PSUTIL:
return CheckResult(name="cpu", status="error", message="psutil not installed")
try:
pct = psutil.cpu_percent(interval=1)
if pct >= cfg.cpu_crit_pct:
status = "critical"
elif pct >= cfg.cpu_warn_pct:
status = "warn"
else:
status = "ok"
return CheckResult(
name="cpu",
status=status,
message=f"CPU {pct:.1f}%",
value=pct,
unit="%",
)
except Exception as exc:
return CheckResult(name="cpu", status="error", message=f"CPU check error: {exc}")
def check_database(cfg: ObservatoryConfig) -> CheckResult:
"""Check observatory SQLite DB connectivity and size."""
db_path = cfg.db_path
try:
if not db_path.exists():
return CheckResult(
name="database",
status="warn",
message=f"Observatory DB not yet created at {db_path}",
)
size_kb = db_path.stat().st_size / 1024
conn = sqlite3.connect(str(db_path), timeout=5)
conn.execute("SELECT count(*) FROM health_snapshots").fetchone()
conn.close()
return CheckResult(
name="database",
status="ok",
message=f"Observatory DB OK ({size_kb:.1f}KB)",
value=size_kb,
unit="KB",
extra={"path": str(db_path)},
)
except Exception as exc:
return CheckResult(
name="database",
status="error",
message=f"DB check error: {exc}",
)
def check_response_store_db() -> CheckResult:
"""Check the API server's SQLite response store DB if it exists."""
try:
hermes_home = Path(os.getenv("HERMES_HOME", Path.home() / ".hermes"))
db_path = hermes_home / "response_store.db"
if not db_path.exists():
return CheckResult(
name="response_store_db",
status="ok",
message="Response store DB not present (API server not yet used)",
)
size_kb = db_path.stat().st_size / 1024
conn = sqlite3.connect(str(db_path), timeout=5)
count = conn.execute("SELECT count(*) FROM responses").fetchone()[0]
conn.close()
return CheckResult(
name="response_store_db",
status="ok",
message=f"Response store DB OK ({count} responses, {size_kb:.1f}KB)",
value=size_kb,
unit="KB",
)
except Exception as exc:
return CheckResult(
name="response_store_db",
status="error",
message=f"Response store DB error: {exc}",
)
# ---------------------------------------------------------------------------
# Snapshot collector
# ---------------------------------------------------------------------------
def collect_snapshot(cfg: ObservatoryConfig) -> HealthSnapshot:
"""Run all checks and return a HealthSnapshot."""
ts = datetime.now(timezone.utc).isoformat()
checks = [
check_gateway_liveness(),
check_api_server_http(cfg),
check_webhook_http(cfg),
check_disk(cfg),
check_memory(cfg),
check_cpu(cfg),
check_database(cfg),
check_response_store_db(),
]
return HealthSnapshot(ts=ts, checks=checks)
# ---------------------------------------------------------------------------
# SQLite persistence
# ---------------------------------------------------------------------------
@contextmanager
def _db(path: Path):
path.parent.mkdir(parents=True, exist_ok=True)
conn = sqlite3.connect(str(path), timeout=10)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA foreign_keys=ON")
try:
yield conn
conn.commit()
finally:
conn.close()
def _init_db(path: Path) -> None:
"""Create tables if they don't exist."""
with _db(path) as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS health_snapshots (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ts TEXT NOT NULL,
overall TEXT NOT NULL,
payload TEXT NOT NULL
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_snapshots_ts ON health_snapshots(ts)")
conn.execute("""
CREATE TABLE IF NOT EXISTS alerts_sent (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ts TEXT NOT NULL,
check_name TEXT NOT NULL,
status TEXT NOT NULL,
message TEXT NOT NULL
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_alerts_ts ON alerts_sent(ts)")
def store_snapshot(cfg: ObservatoryConfig, snapshot: HealthSnapshot) -> None:
"""Persist snapshot to SQLite."""
_init_db(cfg.db_path)
payload = json.dumps(snapshot.to_dict())
with _db(cfg.db_path) as conn:
conn.execute(
"INSERT INTO health_snapshots (ts, overall, payload) VALUES (?, ?, ?)",
(snapshot.ts, snapshot.overall_status, payload),
)
# Prune records older than RETENTION_DAYS
cutoff = (datetime.now(timezone.utc) - timedelta(days=RETENTION_DAYS)).isoformat()
conn.execute("DELETE FROM health_snapshots WHERE ts < ?", (cutoff,))
def record_alert_sent(cfg: ObservatoryConfig, check_name: str, status: str, message: str) -> None:
"""Record that an alert was dispatched."""
_init_db(cfg.db_path)
with _db(cfg.db_path) as conn:
conn.execute(
"INSERT INTO alerts_sent (ts, check_name, status, message) VALUES (?, ?, ?, ?)",
(datetime.now(timezone.utc).isoformat(), check_name, status, message),
)
def load_snapshots(cfg: ObservatoryConfig, days: int = RETENTION_DAYS) -> List[Dict[str, Any]]:
"""Load snapshots from the last N days."""
if not cfg.db_path.exists():
return []
cutoff = (datetime.now(timezone.utc) - timedelta(days=days)).isoformat()
with _db(cfg.db_path) as conn:
rows = conn.execute(
"SELECT ts, overall, payload FROM health_snapshots WHERE ts >= ? ORDER BY ts DESC",
(cutoff,),
).fetchall()
return [json.loads(row[2]) for row in rows]
# ---------------------------------------------------------------------------
# Alerting
# ---------------------------------------------------------------------------
def _telegram_send(token: str, chat_id: str, text: str) -> bool:
"""Send a Telegram message via the Bot API. Returns True on success."""
url = f"https://api.telegram.org/bot{token}/sendMessage"
payload = json.dumps({
"chat_id": chat_id,
"text": text,
"parse_mode": "HTML",
"disable_web_page_preview": True,
}).encode("utf-8")
req = urllib.request.Request(url, data=payload, method="POST")
req.add_header("Content-Type", "application/json")
req.add_header("User-Agent", "hermes-observatory/1.0")
try:
with urllib.request.urlopen(req, timeout=15) as resp:
body = json.loads(resp.read())
return bool(body.get("ok"))
except Exception as exc:
logger.warning("Telegram send failed: %s", exc)
return False
def _status_emoji(status: str) -> str:
return {"ok": "", "warn": "⚠️", "critical": "🔴", "error": ""}.get(status, "")
def maybe_alert(cfg: ObservatoryConfig, snapshot: HealthSnapshot, prev_snapshot: Optional[HealthSnapshot]) -> List[str]:
"""
Fire Telegram alerts for newly degraded checks.
Returns list of alert messages sent.
"""
if not cfg.telegram_token or not cfg.alert_chat_id:
return []
alerts_sent = []
prev_statuses: Dict[str, str] = {}
if prev_snapshot:
for c in prev_snapshot.checks:
prev_statuses[c.name] = c.status
for check in snapshot.checks:
if check.status in ("critical", "error"):
prev = prev_statuses.get(check.name, "ok")
if prev not in ("critical", "error"):
# Newly degraded — alert
emoji = _status_emoji(check.status)
msg = (
f"{emoji} <b>Hermes Observatory Alert</b>\n\n"
f"<b>Check:</b> {check.name}\n"
f"<b>Status:</b> {check.status.upper()}\n"
f"<b>Message:</b> {check.message}\n"
f"<b>Time:</b> {snapshot.ts}"
)
if _telegram_send(cfg.telegram_token, cfg.alert_chat_id, msg):
alerts_sent.append(msg)
record_alert_sent(cfg, check.name, check.status, check.message)
logger.info("Alert sent for %s (%s)", check.name, check.status)
elif check.status == "ok":
prev = prev_statuses.get(check.name)
if prev in ("critical", "error"):
# Recovery alert
msg = (
f"✅ <b>Hermes Observatory — Recovery</b>\n\n"
f"<b>Check:</b> {check.name} has recovered\n"
f"<b>Message:</b> {check.message}\n"
f"<b>Time:</b> {snapshot.ts}"
)
if _telegram_send(cfg.telegram_token, cfg.alert_chat_id, msg):
alerts_sent.append(msg)
record_alert_sent(cfg, check.name, "recovery", check.message)
return alerts_sent
# ---------------------------------------------------------------------------
# Daily digest
# ---------------------------------------------------------------------------
def build_digest(cfg: ObservatoryConfig) -> str:
"""Build a daily health digest from stored snapshots."""
snapshots = load_snapshots(cfg, days=1)
total = len(snapshots)
if total == 0:
return "No health data available for the last 24 hours."
# Count by overall status
status_counts: Dict[str, int] = {"ok": 0, "warn": 0, "critical": 0, "error": 0}
check_degraded_counts: Dict[str, int] = {}
latencies: Dict[str, List[float]] = {}
for snap in snapshots:
overall = snap.get("overall", "ok")
status_counts[overall] = status_counts.get(overall, 0) + 1
for check in snap.get("checks", []):
name = check["name"]
status = check["status"]
if status in ("critical", "error", "warn"):
check_degraded_counts[name] = check_degraded_counts.get(name, 0) + 1
value = check.get("value")
unit = check.get("unit")
if value is not None and unit == "ms":
if name not in latencies:
latencies[name] = []
latencies[name].append(float(value))
uptime_pct = 100.0 * status_counts["ok"] / total if total else 0.0
now = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M UTC")
lines = [
f"📊 <b>Hermes Observatory — Daily Digest</b>",
f"<b>Generated:</b> {now}",
f"",
f"<b>Last 24h Summary</b> ({total} samples)",
f" Healthy: {status_counts['ok']} ({100*status_counts['ok']//total if total else 0}%)",
f" Warning: {status_counts.get('warn', 0)}",
f" Critical: {status_counts.get('critical', 0)}",
f" Error: {status_counts.get('error', 0)}",
f"",
]
# SLO status
lines.append("<b>SLO Status</b>")
gw_uptime_target = cfg.gateway_uptime_slo_pct
gw_snapshots = [
s for s in snapshots
if any(c["name"] == "gateway_process" and c["status"] == "ok" for c in s.get("checks", []))
]
gw_uptime = 100.0 * len(gw_snapshots) / total if total else 0.0
gw_ok = gw_uptime >= gw_uptime_target
lines.append(
f" {'' if gw_ok else ''} Gateway uptime: {gw_uptime:.1f}% (target: ≥{gw_uptime_target}%)"
)
wh_latency_target = cfg.webhook_latency_slo_ms
if "webhook_http" in latencies and latencies["webhook_http"]:
wh_vals = sorted(latencies["webhook_http"])
p95_idx = int(len(wh_vals) * 0.95)
p95 = wh_vals[min(p95_idx, len(wh_vals) - 1)]
wh_ok = p95 <= wh_latency_target
lines.append(
f" {'' if wh_ok else ''} Webhook p95 latency: {p95:.0f}ms (target: ≤{wh_latency_target:.0f}ms)"
)
else:
lines.append(f" ⚫ Webhook latency: no data")
if "api_server_http" in latencies and latencies["api_server_http"]:
api_vals = sorted(latencies["api_server_http"])
p95_idx = int(len(api_vals) * 0.95)
p95 = api_vals[min(p95_idx, len(api_vals) - 1)]
api_ok = p95 <= wh_latency_target
lines.append(
f" {'' if api_ok else ''} API server p95 latency: {p95:.0f}ms (target: ≤{wh_latency_target:.0f}ms)"
)
# Top degraded checks
if check_degraded_counts:
lines.append("")
lines.append("<b>Degraded Checks (24h)</b>")
for name, count in sorted(check_degraded_counts.items(), key=lambda x: -x[1]):
pct = 100 * count // total if total else 0
lines.append(f"{name}: {count} incidents ({pct}%)")
lines.append("")
lines.append(f"<i>Observatory DB: {cfg.db_path}</i>")
return "\n".join(lines)
def send_digest(cfg: ObservatoryConfig) -> bool:
"""Build and send the daily digest to Telegram. Returns True on success."""
digest = build_digest(cfg)
if cfg.telegram_token and cfg.digest_chat_id:
return _telegram_send(cfg.telegram_token, cfg.digest_chat_id, digest)
return False
# ---------------------------------------------------------------------------
# Display helpers
# ---------------------------------------------------------------------------
_STATUS_COLORS = {
"ok": "\033[32m", # green
"warn": "\033[33m", # yellow
"critical": "\033[31m", # red
"error": "\033[91m", # bright red
}
_RESET = "\033[0m"
def _color_status(status: str) -> str:
c = _STATUS_COLORS.get(status, "")
return f"{c}{status.upper()}{_RESET}"
def print_snapshot(snapshot: HealthSnapshot) -> None:
overall_color = _STATUS_COLORS.get(snapshot.overall_status, "")
print(f"\n{'='*60}")
print(f" Hermes Observatory — {snapshot.ts}")
print(f" Overall: {overall_color}{snapshot.overall_status.upper()}{_RESET}")
print(f"{'='*60}")
for check in snapshot.checks:
emoji = _status_emoji(check.status)
val_str = f" [{check.value:.1f}{check.unit}]" if check.value is not None and check.unit else ""
print(f" {emoji} {check.name:<25} {_color_status(check.status):<15} {check.message}{val_str}")
print()
def print_slo_report(cfg: ObservatoryConfig) -> None:
"""Print current SLO definitions and targets."""
snapshots = load_snapshots(cfg, days=30)
total = len(snapshots)
print(f"\n{'='*60}")
print(" Hermes Observatory — SLO Report (last 30 days)")
print(f"{'='*60}")
for slo_key, slo in SLO_DEFINITIONS.items():
print(f"\n {slo['description']}")
print(f" Target: {slo['target']}{slo['unit']}")
if total == 0:
print(f" Status: no data")
continue
if slo_key == "gateway_uptime_pct":
ok_count = sum(
1 for s in snapshots
if any(c["name"] == "gateway_process" and c["status"] == "ok"
for c in s.get("checks", []))
)
actual = 100.0 * ok_count / total
met = actual >= slo["target"]
print(f" Actual: {actual:.2f}% {'✅ MET' if met else '❌ MISSED'}")
elif slo_key in ("webhook_latency_ms", "api_server_http_latency_ms"):
check_name = "webhook_http" if "webhook" in slo_key else "api_server_http"
vals = [
float(c["value"])
for s in snapshots
for c in s.get("checks", [])
if c["name"] == check_name and c.get("value") is not None
]
if vals:
vals.sort()
p95_idx = int(len(vals) * 0.95)
p95 = vals[min(p95_idx, len(vals) - 1)]
met = p95 <= slo["target"]
print(f" p95: {p95:.0f}ms {'✅ MET' if met else '❌ MISSED'}")
else:
print(f" Status: no latency data")
print()
def print_history(cfg: ObservatoryConfig, count: int = 20) -> None:
"""Print recent health records."""
snapshots = load_snapshots(cfg, days=RETENTION_DAYS)[:count]
if not snapshots:
print("No history available.")
return
print(f"\n{'='*60}")
print(f" Last {min(count, len(snapshots))} health records")
print(f"{'='*60}")
for snap in snapshots:
ts = snap.get("ts", "?")
overall = snap.get("overall", "?")
emoji = _status_emoji(overall)
degraded = [c["name"] for c in snap.get("checks", []) if c["status"] != "ok"]
degraded_str = f" — issues: {', '.join(degraded)}" if degraded else ""
print(f" {emoji} {ts} {overall.upper()}{degraded_str}")
print()
# ---------------------------------------------------------------------------
# Daemon mode
# ---------------------------------------------------------------------------
class Observatory:
"""Continuous monitoring daemon."""
def __init__(self, cfg: ObservatoryConfig):
self.cfg = cfg
self._running = False
self._prev_snapshot: Optional[HealthSnapshot] = None
def _handle_signal(self, signum: int, frame: Any) -> None:
logger.info("Received signal %d, shutting down...", signum)
self._running = False
def run_once(self) -> HealthSnapshot:
snapshot = collect_snapshot(self.cfg)
store_snapshot(self.cfg, snapshot)
alerts = maybe_alert(self.cfg, snapshot, self._prev_snapshot)
if alerts:
logger.info("Sent %d alert(s)", len(alerts))
self._prev_snapshot = snapshot
return snapshot
def run(self) -> None:
_init_db(self.cfg.db_path)
logger.info(
"Observatory starting — poll_interval=%ds db=%s",
self.cfg.poll_interval,
self.cfg.db_path,
)
self._running = True
signal.signal(signal.SIGINT, self._handle_signal)
signal.signal(signal.SIGTERM, self._handle_signal)
while self._running:
try:
snapshot = self.run_once()
logger.info("Health check: %s", snapshot.overall_status)
except Exception as exc:
logger.error("Health check failed: %s", exc, exc_info=True)
if self._running:
time.sleep(self.cfg.poll_interval)
logger.info("Observatory stopped.")
# ---------------------------------------------------------------------------
# CLI entry point
# ---------------------------------------------------------------------------
def main(argv: Optional[List[str]] = None) -> int:
parser = argparse.ArgumentParser(
description="Hermes Observatory — health monitoring & alerting",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument("--check", action="store_true", help="Run one health check and print results")
parser.add_argument("--daemon", action="store_true", help="Run as continuous monitoring daemon")
parser.add_argument("--digest", action="store_true", help="Print (and optionally send) daily digest")
parser.add_argument("--history", type=int, metavar="N", help="Show last N health records")
parser.add_argument("--slo", action="store_true", help="Print SLO report")
parser.add_argument("--send-digest", action="store_true", help="Send daily digest via Telegram")
parser.add_argument("--verbose", "-v", action="store_true", help="Enable verbose logging")
args = parser.parse_args(argv)
logging.basicConfig(
level=logging.DEBUG if args.verbose else logging.INFO,
format="%(asctime)s %(levelname)s [observatory] %(message)s",
)
cfg = ObservatoryConfig.from_env()
_init_db(cfg.db_path)
if args.check:
snapshot = collect_snapshot(cfg)
store_snapshot(cfg, snapshot)
print_snapshot(snapshot)
return 0 if snapshot.overall_status == "ok" else 1
if args.daemon:
obs = Observatory(cfg)
obs.run()
return 0
if args.digest or args.send_digest:
digest = build_digest(cfg)
print(digest)
if args.send_digest:
ok = send_digest(cfg)
if ok:
print("\n[Digest sent to Telegram]")
else:
print("\n[Telegram send skipped — token/chat_id not configured]")
return 0
if args.history is not None:
print_history(cfg, args.history)
return 0
if args.slo:
print_slo_report(cfg)
return 0
# Default: one-shot check
snapshot = collect_snapshot(cfg)
store_snapshot(cfg, snapshot)
print_snapshot(snapshot)
return 0 if snapshot.overall_status == "ok" else 1
if __name__ == "__main__":
sys.exit(main())

View File

@@ -42,7 +42,6 @@ dependencies = [
modal = ["modal>=1.0.0,<2"]
daytona = ["daytona>=0.148.0,<1"]
dev = ["pytest>=9.0.2,<10", "pytest-asyncio>=1.3.0,<2", "pytest-xdist>=3.0,<4", "mcp>=1.2.0,<2"]
observatory = ["psutil>=5.9.0,<7"]
messaging = ["python-telegram-bot>=22.6,<23", "discord.py[voice]>=2.7.1,<3", "aiohttp>=3.13.3,<4", "slack-bolt>=1.18.0,<2", "slack-sdk>=3.27.0,<4"]
cron = ["croniter>=6.0.0,<7"]
slack = ["slack-bolt>=1.18.0,<2", "slack-sdk>=3.27.0,<4"]

View File

@@ -1,371 +0,0 @@
#!/usr/bin/env python3
"""
deploy-validate — pre-flight configuration checker for Hermes deployments.
Catches common configuration errors BEFORE they cause runtime failures.
Safe to run at any time: it only reads files and makes lightweight network
checks — it never writes state or sends messages.
Usage:
python scripts/deploy-validate # validate current environment
python scripts/deploy-validate --dry-run # alias for the same thing
python scripts/deploy-validate --env /path/to/.env
Exit codes:
0 All checks passed (or only warnings).
1 One or more blocking errors found.
"""
from __future__ import annotations
import argparse
import os
import socket
import sys
import urllib.error
import urllib.request
from pathlib import Path
from typing import Optional
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
RESET = "\033[0m"
RED = "\033[91m"
YELLOW = "\033[93m"
GREEN = "\033[92m"
BOLD = "\033[1m"
def _color(text: str, code: str) -> str:
if sys.stdout.isatty():
return f"{code}{text}{RESET}"
return text
def ok(msg: str) -> None:
print(f" {_color('✔', GREEN)} {msg}")
def warn(msg: str) -> None:
print(f" {_color('⚠', YELLOW)} {msg}")
def error(msg: str) -> None:
print(f" {_color('✘', RED)} {msg}")
def section(title: str) -> None:
print(f"\n{_color(BOLD + title, BOLD)}")
# ---------------------------------------------------------------------------
# .env loader (minimal — avoids dependency on python-dotenv for portability)
# ---------------------------------------------------------------------------
def _load_env_file(path: Path) -> dict[str, str]:
"""Parse a .env file and return a dict of key→value pairs."""
result: dict[str, str] = {}
if not path.exists():
return result
for line in path.read_text(encoding="utf-8").splitlines():
line = line.strip()
if not line or line.startswith("#") or "=" not in line:
continue
key, _, value = line.partition("=")
key = key.strip()
# Strip inline comments and surrounding quotes.
value = value.split("#")[0].strip().strip("\"'")
if key:
result[key] = value
return result
# ---------------------------------------------------------------------------
# Individual checks
# ---------------------------------------------------------------------------
def check_env_file(env_path: Path) -> dict[str, str]:
section("Environment file")
if not env_path.exists():
error(f".env not found at {env_path}")
error("Copy .env.example → .env and fill in your API keys.")
return {}
ok(f".env found at {env_path}")
raw = _load_env_file(env_path)
# Warn if any value looks like a placeholder.
placeholder_patterns = ("your_", "xxxx", "changeme", "todo", "replace_me")
for key, value in raw.items():
if value and any(p in value.lower() for p in placeholder_patterns):
warn(f"{key} looks like a placeholder: {value!r}")
return raw
def check_llm_key(env: dict[str, str]) -> bool:
section("LLM provider")
providers = {
"OPENROUTER_API_KEY": "OpenRouter",
"ANTHROPIC_API_KEY": "Anthropic",
"OPENAI_API_KEY": "OpenAI",
"GLM_API_KEY": "z.ai / GLM",
"KIMI_API_KEY": "Kimi / Moonshot",
"MINIMAX_API_KEY": "MiniMax",
"NOUS_API_KEY": "Nous Portal",
"HF_TOKEN": "Hugging Face",
"KILOCODE_API_KEY": "KiloCode",
"OPENCODE_ZEN_API_KEY": "OpenCode Zen",
}
found = [name for key, name in providers.items() if env.get(key, "").strip()]
if not found:
error("No LLM API key detected. Set at least one (e.g. OPENROUTER_API_KEY).")
return False
ok(f"LLM provider key present: {', '.join(found)}")
return True
def check_hermes_home(env: dict[str, str]) -> Optional[Path]:
section("HERMES_HOME data directory")
raw = env.get("HERMES_HOME") or os.environ.get("HERMES_HOME") or ""
if raw:
home = Path(raw).expanduser()
else:
home = Path.home() / ".hermes"
if not home.exists():
warn(f"HERMES_HOME does not exist yet: {home} (will be created on first run)")
return home
ok(f"HERMES_HOME exists: {home}")
required_dirs = ["logs", "sessions", "cron", "memories", "skills"]
for d in required_dirs:
if not (home / d).is_dir():
warn(f"Expected subdirectory missing: {home / d} (created automatically at runtime)")
if (home / ".env").exists():
ok(f"Data-directory .env present: {home / '.env'}")
else:
warn(f"No .env in HERMES_HOME ({home}). "
"The Docker entrypoint copies .env.example on first run; "
"for bare-metal installs copy it manually.")
return home
def check_gateway_platforms(env: dict[str, str]) -> None:
section("Messaging platform tokens")
platforms: dict[str, list[str]] = {
"Telegram": ["TELEGRAM_BOT_TOKEN"],
"Discord": ["DISCORD_BOT_TOKEN"],
"Slack": ["SLACK_BOT_TOKEN", "SLACK_APP_TOKEN"],
"WhatsApp": [], # pairing-based, no env key required
"Email": ["EMAIL_ADDRESS", "EMAIL_PASSWORD"],
}
any_found = False
for platform, keys in platforms.items():
if not keys:
continue # WhatsApp — no key check
if all(env.get(k, "").strip() for k in keys):
ok(f"{platform}: configured ({', '.join(keys)})")
any_found = True
if not any_found:
warn("No messaging platform tokens found. "
"The gateway will start but accept no inbound messages. "
"Set at least one platform token (e.g. TELEGRAM_BOT_TOKEN).")
def check_api_server_reachable(host: str = "127.0.0.1", port: int = 8642) -> None:
section("API server health check")
url = f"http://{host}:{port}/health"
try:
with urllib.request.urlopen(url, timeout=5) as resp:
body = resp.read().decode()
if '"status"' in body and "ok" in body:
ok(f"API server healthy: {url}")
else:
warn(f"Unexpected /health response from {url}: {body[:200]}")
except urllib.error.URLError as exc:
# Not a failure — the server may not be running in --dry-run mode.
warn(f"API server not reachable at {url}: {exc.reason} "
"(expected if gateway is not running)")
except OSError as exc:
warn(f"API server not reachable at {url}: {exc}")
def check_gateway_status(hermes_home: Optional[Path]) -> None:
section("Gateway runtime status")
if hermes_home is None:
warn("HERMES_HOME unknown — skipping runtime status check.")
return
state_file = hermes_home / "gateway_state.json"
pid_file = hermes_home / "gateway.pid"
if not state_file.exists() and not pid_file.exists():
warn("Gateway does not appear to be running (no PID or state file). "
"This is expected before the first start.")
return
if state_file.exists():
import json
try:
state = json.loads(state_file.read_text())
gw_state = state.get("gateway_state", "unknown")
updated = state.get("updated_at", "?")
if gw_state == "running":
ok(f"Gateway state: {gw_state} (updated {updated})")
platforms = state.get("platforms", {})
for plat, pdata in platforms.items():
pstate = pdata.get("state", "unknown")
if pstate in ("connected", "running", "ok"):
ok(f" Platform {plat}: {pstate}")
else:
warn(f" Platform {plat}: {pstate} — {pdata.get('error_message', '')}")
elif gw_state in ("stopped", "startup_failed"):
error(f"Gateway state: {gw_state} — {state.get('exit_reason', 'no reason recorded')}")
else:
warn(f"Gateway state: {gw_state}")
except Exception as exc:
warn(f"Could not parse {state_file}: {exc}")
else:
warn("State file missing; only PID file found. Gateway may be starting.")
def check_docker_available() -> None:
section("Docker / compose availability")
for cmd in ("docker", "docker compose"):
_check_command(cmd.split()[0], cmd)
def _check_command(name: str, display: str) -> bool:
import shutil
if shutil.which(name):
ok(f"{display} found")
return True
warn(f"{display} not found in PATH (only required for Docker deployments)")
return False
def check_ports_free(ports: list[int] = None) -> None:
section("Port availability")
if ports is None:
ports = [8642]
for port in ports:
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.settimeout(1)
result = s.connect_ex(("127.0.0.1", port))
if result == 0:
warn(f"Port {port} is already in use. "
"The API server will fail to bind unless you change its port.")
else:
ok(f"Port {port} is free")
def check_no_secrets_in_repo(repo_root: Path) -> None:
section("Secret hygiene")
dangerous = [".env", "*.pem", "*.key", "id_rsa", "id_ed25519"]
gitignore = repo_root / ".gitignore"
if gitignore.exists():
content = gitignore.read_text()
for pattern in [".env", "*.pem", "*.key"]:
if pattern in content or pattern.lstrip("*. ") in content:
ok(f".gitignore covers {pattern}")
else:
warn(f".gitignore does not mention {pattern}. "
"Ensure secrets are never committed.")
else:
warn("No .gitignore found. Secrets could accidentally be committed.")
# Check the env file itself isn't tracked.
env_file = repo_root / ".env"
if env_file.exists():
import subprocess
try:
out = subprocess.run(
["git", "ls-files", "--error-unmatch", ".env"],
cwd=repo_root,
capture_output=True,
)
if out.returncode == 0:
error(".env IS tracked by git! Remove it immediately: git rm --cached .env")
else:
ok(".env is not tracked by git")
except FileNotFoundError:
warn("git not found — cannot verify .env tracking status")
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def main() -> int:
parser = argparse.ArgumentParser(
description="Pre-flight configuration validator for Hermes deployments.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__,
)
parser.add_argument(
"--dry-run", action="store_true",
help="Alias for the default mode (no state is written regardless).",
)
parser.add_argument(
"--env", metavar="PATH",
help="Path to .env file (default: .env in repo root).",
)
parser.add_argument(
"--check-ports", action="store_true",
help="Also verify that required ports are free (useful before first start).",
)
parser.add_argument(
"--skip-health", action="store_true",
help="Skip the live /health HTTP check (use when gateway is not running).",
)
args = parser.parse_args()
print(f"\n{_color(BOLD + 'Hermes Deploy Validator', BOLD)}")
print("=" * 50)
repo_root = Path(__file__).resolve().parent.parent
env_path = Path(args.env) if args.env else repo_root / ".env"
errors_before = [0] # mutable sentinel
# Monkey-patch error() to count failures.
_original_error = globals()["error"]
error_count = 0
def counting_error(msg: str) -> None:
nonlocal error_count
error_count += 1
_original_error(msg)
globals()["error"] = counting_error
# Run checks.
env = check_env_file(env_path)
check_no_secrets_in_repo(repo_root)
llm_ok = check_llm_key(env)
hermes_home = check_hermes_home(env)
check_gateway_platforms(env)
if args.check_ports:
check_ports_free()
if not args.skip_health:
check_api_server_reachable()
check_gateway_status(hermes_home)
# Summary.
print(f"\n{'=' * 50}")
if error_count == 0:
print(_color(f"All checks passed (0 errors).", GREEN))
return 0
else:
print(_color(f"{error_count} error(s) found. Fix them before deploying.", RED))
return 1
if __name__ == "__main__":
sys.exit(main())

View File

@@ -1,455 +0,0 @@
"""
Tests for observatory.py — health monitoring & alerting.
Refs #147
"""
from __future__ import annotations
import json
import os
import sqlite3
import sys
import tempfile
import time
from pathlib import Path
from typing import Any
from unittest.mock import MagicMock, patch
import pytest
PROJECT_ROOT = Path(__file__).parent.parent
if str(PROJECT_ROOT) not in sys.path:
sys.path.insert(0, str(PROJECT_ROOT))
import observatory as obs
# ---------------------------------------------------------------------------
# Fixtures
# ---------------------------------------------------------------------------
@pytest.fixture
def cfg(tmp_path):
"""Return an ObservatoryConfig pointing at a temp directory."""
cfg = obs.ObservatoryConfig()
cfg.db_path = tmp_path / "observatory.db"
cfg.alert_chat_id = "99999"
cfg.digest_chat_id = "99999"
cfg.telegram_token = "fake-token"
cfg.webhook_url = "http://127.0.0.1:19999/health" # port never bound
cfg.api_url = "http://127.0.0.1:19998/health"
return cfg
# ---------------------------------------------------------------------------
# Config tests
# ---------------------------------------------------------------------------
class TestObservatoryConfig:
def test_defaults(self):
c = obs.ObservatoryConfig()
assert c.disk_warn_pct == 80.0
assert c.disk_crit_pct == 90.0
assert c.mem_warn_pct == 80.0
assert c.mem_crit_pct == 90.0
assert c.cpu_warn_pct == 80.0
assert c.cpu_crit_pct == 95.0
assert c.poll_interval == 60
assert c.webhook_latency_slo_ms == 2000.0
assert c.gateway_uptime_slo_pct == 99.5
def test_from_env_overrides(self, monkeypatch):
monkeypatch.setenv("OBSERVATORY_DISK_WARN_PCT", "70")
monkeypatch.setenv("OBSERVATORY_POLL_INTERVAL", "30")
monkeypatch.setenv("OBSERVATORY_ALERT_CHAT_ID", "12345")
monkeypatch.setenv("TELEGRAM_BOT_TOKEN", "tok123")
c = obs.ObservatoryConfig.from_env()
assert c.disk_warn_pct == 70.0
assert c.poll_interval == 30
assert c.alert_chat_id == "12345"
assert c.telegram_token == "tok123"
def test_digest_chat_falls_back_to_alert(self, monkeypatch):
monkeypatch.setenv("OBSERVATORY_ALERT_CHAT_ID", "abc")
monkeypatch.delenv("OBSERVATORY_DIGEST_CHAT_ID", raising=False)
c = obs.ObservatoryConfig.from_env()
assert c.digest_chat_id == "abc"
# ---------------------------------------------------------------------------
# CheckResult / HealthSnapshot tests
# ---------------------------------------------------------------------------
class TestHealthSnapshot:
def _make_snapshot(self, statuses):
checks = [obs.CheckResult(name=f"c{i}", status=s, message="") for i, s in enumerate(statuses)]
return obs.HealthSnapshot(ts="2026-01-01T00:00:00+00:00", checks=checks)
def test_overall_ok(self):
snap = self._make_snapshot(["ok", "ok"])
assert snap.overall_status == "ok"
def test_overall_warn(self):
snap = self._make_snapshot(["ok", "warn"])
assert snap.overall_status == "warn"
def test_overall_critical(self):
snap = self._make_snapshot(["ok", "warn", "critical"])
assert snap.overall_status == "critical"
def test_overall_error(self):
snap = self._make_snapshot(["ok", "error"])
assert snap.overall_status == "critical"
def test_to_dict(self):
snap = self._make_snapshot(["ok"])
d = snap.to_dict()
assert d["overall"] == "ok"
assert isinstance(d["checks"], list)
assert d["checks"][0]["name"] == "c0"
# ---------------------------------------------------------------------------
# Individual check tests
# ---------------------------------------------------------------------------
class TestCheckGatewayLiveness:
def test_running(self):
with patch("gateway.status.is_gateway_running", return_value=True), \
patch("gateway.status.get_running_pid", return_value=12345):
result = obs.check_gateway_liveness()
assert result.status == "ok"
assert "12345" in result.message
def test_not_running(self):
with patch("gateway.status.is_gateway_running", return_value=False), \
patch("gateway.status.get_running_pid", return_value=None):
result = obs.check_gateway_liveness()
assert result.status == "critical"
def test_import_error(self):
import builtins
real_import = builtins.__import__
def mock_import(name, *args, **kwargs):
if name == "gateway.status":
raise ImportError("no module")
return real_import(name, *args, **kwargs)
with patch("builtins.__import__", side_effect=mock_import):
result = obs.check_gateway_liveness()
assert result.status in ("error", "critical", "ok") # graceful
class TestCheckDisk:
def test_ok(self, cfg):
mock_usage = MagicMock()
mock_usage.percent = 50.0
mock_usage.free = 10 * 1024 ** 3
mock_usage.total = 20 * 1024 ** 3
with patch("psutil.disk_usage", return_value=mock_usage):
result = obs.check_disk(cfg)
assert result.status == "ok"
assert result.value == 50.0
def test_warn(self, cfg):
mock_usage = MagicMock()
mock_usage.percent = 85.0
mock_usage.free = 3 * 1024 ** 3
mock_usage.total = 20 * 1024 ** 3
with patch("psutil.disk_usage", return_value=mock_usage):
result = obs.check_disk(cfg)
assert result.status == "warn"
def test_critical(self, cfg):
mock_usage = MagicMock()
mock_usage.percent = 92.0
mock_usage.free = 1 * 1024 ** 3
mock_usage.total = 20 * 1024 ** 3
with patch("psutil.disk_usage", return_value=mock_usage):
result = obs.check_disk(cfg)
assert result.status == "critical"
def test_no_psutil(self, cfg, monkeypatch):
monkeypatch.setattr(obs, "_PSUTIL", False)
result = obs.check_disk(cfg)
assert result.status == "error"
class TestCheckMemory:
def test_ok(self, cfg):
mock_mem = MagicMock()
mock_mem.percent = 60.0
mock_mem.available = 4 * 1024 ** 3
mock_mem.total = 16 * 1024 ** 3
with patch("psutil.virtual_memory", return_value=mock_mem):
result = obs.check_memory(cfg)
assert result.status == "ok"
def test_critical(self, cfg):
mock_mem = MagicMock()
mock_mem.percent = 95.0
mock_mem.available = 512 * 1024 ** 2
mock_mem.total = 16 * 1024 ** 3
with patch("psutil.virtual_memory", return_value=mock_mem):
result = obs.check_memory(cfg)
assert result.status == "critical"
class TestCheckCPU:
def test_ok(self, cfg):
with patch("psutil.cpu_percent", return_value=40.0):
result = obs.check_cpu(cfg)
assert result.status == "ok"
def test_warn(self, cfg):
with patch("psutil.cpu_percent", return_value=85.0):
result = obs.check_cpu(cfg)
assert result.status == "warn"
def test_critical(self, cfg):
with patch("psutil.cpu_percent", return_value=98.0):
result = obs.check_cpu(cfg)
assert result.status == "critical"
class TestCheckDatabase:
def test_ok(self, cfg):
obs._init_db(cfg.db_path)
result = obs.check_database(cfg)
assert result.status == "ok"
def test_not_yet_created(self, cfg):
# db_path does not exist
result = obs.check_database(cfg)
assert result.status == "warn"
class TestCheckHTTP:
def test_webhook_connection_refused(self, cfg):
result = obs.check_webhook_http(cfg)
# Port 19999 is not bound — should get a "not reachable" warn
assert result.status in ("warn", "error")
def test_api_server_connection_refused(self, cfg):
result = obs.check_api_server_http(cfg)
assert result.status in ("warn", "error")
def test_webhook_ok(self, cfg):
import urllib.error
from unittest.mock import patch, MagicMock
mock_resp = MagicMock()
mock_resp.__enter__ = lambda s: s
mock_resp.__exit__ = MagicMock(return_value=False)
mock_resp.status = 200
mock_resp.read.return_value = b'{"status":"ok"}'
with patch("urllib.request.urlopen", return_value=mock_resp):
result = obs.check_webhook_http(cfg)
assert result.status in ("ok", "warn")
def test_webhook_http_error(self, cfg):
mock_resp = MagicMock()
mock_resp.__enter__ = lambda s: s
mock_resp.__exit__ = MagicMock(return_value=False)
mock_resp.status = 503
with patch("urllib.request.urlopen", return_value=mock_resp):
result = obs.check_webhook_http(cfg)
assert result.status == "critical"
# ---------------------------------------------------------------------------
# Persistence tests
# ---------------------------------------------------------------------------
class TestPersistence:
def test_store_and_load(self, cfg):
obs._init_db(cfg.db_path)
from datetime import datetime, timezone
ts = datetime.now(timezone.utc).isoformat()
snap = obs.HealthSnapshot(
ts=ts,
checks=[obs.CheckResult(name="test", status="ok", message="fine")],
)
obs.store_snapshot(cfg, snap)
loaded = obs.load_snapshots(cfg, days=30)
assert len(loaded) == 1
assert loaded[0]["overall"] == "ok"
def test_retention_pruning(self, cfg):
obs._init_db(cfg.db_path)
# Insert an old record directly
with obs._db(cfg.db_path) as conn:
conn.execute(
"INSERT INTO health_snapshots (ts, overall, payload) VALUES (?, ?, ?)",
("2000-01-01T00:00:00+00:00", "ok", '{"ts":"2000-01-01T00:00:00+00:00","overall":"ok","checks":[]}'),
)
snap = obs.HealthSnapshot(
ts="2026-01-01T00:00:00+00:00",
checks=[],
)
obs.store_snapshot(cfg, snap)
# Old record should have been pruned
with obs._db(cfg.db_path) as conn:
count = conn.execute("SELECT count(*) FROM health_snapshots WHERE ts < '2001-01-01'").fetchone()[0]
assert count == 0
def test_record_alert_sent(self, cfg):
obs._init_db(cfg.db_path)
obs.record_alert_sent(cfg, "gateway_process", "critical", "not running")
with obs._db(cfg.db_path) as conn:
count = conn.execute("SELECT count(*) FROM alerts_sent").fetchone()[0]
assert count == 1
# ---------------------------------------------------------------------------
# Alerting tests
# ---------------------------------------------------------------------------
class TestAlerting:
def _snap(self, status):
return obs.HealthSnapshot(
ts="2026-01-01T00:00:00+00:00",
checks=[obs.CheckResult(name="gateway_process", status=status, message="test")],
)
def test_no_alert_when_ok(self, cfg):
snap = self._snap("ok")
prev = self._snap("ok")
obs._init_db(cfg.db_path)
with patch("observatory._telegram_send", return_value=True) as mock_send:
alerts = obs.maybe_alert(cfg, snap, prev)
mock_send.assert_not_called()
assert alerts == []
def test_alert_on_new_critical(self, cfg):
snap = self._snap("critical")
prev = self._snap("ok")
obs._init_db(cfg.db_path)
with patch("observatory._telegram_send", return_value=True) as mock_send:
alerts = obs.maybe_alert(cfg, snap, prev)
mock_send.assert_called_once()
assert len(alerts) == 1
def test_no_duplicate_alert(self, cfg):
snap = self._snap("critical")
prev = self._snap("critical") # already critical
obs._init_db(cfg.db_path)
with patch("observatory._telegram_send", return_value=True) as mock_send:
alerts = obs.maybe_alert(cfg, snap, prev)
mock_send.assert_not_called()
assert alerts == []
def test_recovery_alert(self, cfg):
snap = self._snap("ok")
prev = self._snap("critical")
obs._init_db(cfg.db_path)
with patch("observatory._telegram_send", return_value=True) as mock_send:
alerts = obs.maybe_alert(cfg, snap, prev)
mock_send.assert_called_once()
def test_no_alert_without_token(self, cfg):
cfg.telegram_token = None
snap = self._snap("critical")
obs._init_db(cfg.db_path)
alerts = obs.maybe_alert(cfg, snap, None)
assert alerts == []
def test_no_alert_without_chat_id(self, cfg):
cfg.alert_chat_id = None
snap = self._snap("critical")
obs._init_db(cfg.db_path)
alerts = obs.maybe_alert(cfg, snap, None)
assert alerts == []
# ---------------------------------------------------------------------------
# Digest tests
# ---------------------------------------------------------------------------
class TestDigest:
def test_empty_digest(self, cfg):
obs._init_db(cfg.db_path)
digest = obs.build_digest(cfg)
assert "no health data" in digest.lower() or "24 hours" in digest.lower()
def test_digest_with_data(self, cfg):
obs._init_db(cfg.db_path)
from datetime import datetime, timezone, timedelta
ts = (datetime.now(timezone.utc) - timedelta(hours=1)).isoformat()
snap = obs.HealthSnapshot(
ts=ts,
checks=[
obs.CheckResult(name="gateway_process", status="ok", message="running"),
obs.CheckResult(name="disk", status="ok", message="50% used", value=50.0, unit="%"),
obs.CheckResult(name="webhook_http", status="ok", message="ok", value=150.0, unit="ms"),
],
)
obs.store_snapshot(cfg, snap)
digest = obs.build_digest(cfg)
assert "Daily Digest" in digest
assert "Gateway" in digest or "gateway" in digest
def test_send_digest_no_token(self, cfg):
cfg.telegram_token = None
obs._init_db(cfg.db_path)
result = obs.send_digest(cfg)
assert result is False
# ---------------------------------------------------------------------------
# SLO tests
# ---------------------------------------------------------------------------
class TestSLO:
def test_slo_definitions_complete(self):
assert "gateway_uptime_pct" in obs.SLO_DEFINITIONS
assert "webhook_latency_ms" in obs.SLO_DEFINITIONS
assert "api_server_latency_ms" in obs.SLO_DEFINITIONS
def test_slo_targets(self):
assert obs.SLO_DEFINITIONS["gateway_uptime_pct"]["target"] == 99.5
assert obs.SLO_DEFINITIONS["webhook_latency_ms"]["target"] == 2000
# ---------------------------------------------------------------------------
# CLI tests
# ---------------------------------------------------------------------------
class TestCLI:
def test_check_exits_0_on_ok(self, cfg, monkeypatch, tmp_path):
monkeypatch.setenv("OBSERVATORY_DB_PATH", str(tmp_path / "obs.db"))
ok_snap = obs.HealthSnapshot(
ts="2026-01-01T00:00:00+00:00",
checks=[obs.CheckResult(name="all_good", status="ok", message="fine")],
)
with patch("observatory.collect_snapshot", return_value=ok_snap), \
patch("observatory.store_snapshot"):
rc = obs.main(["--check"])
assert rc == 0
def test_check_exits_nonzero_on_critical(self, cfg, monkeypatch, tmp_path):
monkeypatch.setenv("OBSERVATORY_DB_PATH", str(tmp_path / "obs.db"))
bad_snap = obs.HealthSnapshot(
ts="2026-01-01T00:00:00+00:00",
checks=[obs.CheckResult(name="gateway_process", status="critical", message="down")],
)
with patch("observatory.collect_snapshot", return_value=bad_snap), \
patch("observatory.store_snapshot"):
rc = obs.main(["--check"])
assert rc != 0
def test_digest_flag(self, monkeypatch, tmp_path):
monkeypatch.setenv("OBSERVATORY_DB_PATH", str(tmp_path / "obs.db"))
rc = obs.main(["--digest"])
assert rc == 0
def test_slo_flag(self, monkeypatch, tmp_path):
monkeypatch.setenv("OBSERVATORY_DB_PATH", str(tmp_path / "obs.db"))
rc = obs.main(["--slo"])
assert rc == 0
def test_history_flag(self, monkeypatch, tmp_path):
monkeypatch.setenv("OBSERVATORY_DB_PATH", str(tmp_path / "obs.db"))
rc = obs.main(["--history", "5"])
assert rc == 0

View File

@@ -1,215 +0,0 @@
# Forge Operations Guide
> **Audience:** Forge wizards joining the hermes-agent project
> **Purpose:** Practical patterns, common pitfalls, and operational wisdom
> **Companion to:** `WIZARD_ENVIRONMENT_CONTRACT.md`
---
## The One Rule
**Read the actual state before acting.**
Before touching any service, config, or codebase: `ps aux | grep hermes`, `cat ~/.hermes/gateway_state.json`, `curl http://127.0.0.1:8642/health`. The forge punishes assumptions harder than it rewards speed. Evidence always beats intuition.
---
## First 15 Minutes on a New System
```bash
# 1. Validate your environment
python wizard-bootstrap/wizard_bootstrap.py
# 2. Check what is actually running
ps aux | grep -E 'hermes|python|gateway'
# 3. Check the data directory
ls -la ~/.hermes/
cat ~/.hermes/gateway_state.json 2>/dev/null | python3 -m json.tool
# 4. Verify health endpoints (if gateway is up)
curl -sf http://127.0.0.1:8642/health | python3 -m json.tool
# 5. Run the smoke test
source venv/bin/activate
python -m pytest tests/ -q -x --timeout=60 2>&1 | tail -20
```
Do not begin work until all five steps return clean output.
---
## Import Chain — Know It, Respect It
The dependency order is load-bearing. Violating it causes silent failures:
```
tools/registry.py ← no deps; imported by everything
tools/*.py ← each calls registry.register() at import time
model_tools.py ← imports registry; triggers tool discovery
run_agent.py / cli.py / batch_runner.py
```
**If you add a tool file**, you must also:
1. Add its import to `model_tools.py` `_discover_tools()`
2. Add it to `toolsets.py` (core or a named toolset)
Missing either step causes the tool to silently not appear — no error, just absence.
---
## The Five Profile Rules
Hermes supports isolated profiles (`hermes -p myprofile`). Profile-unsafe code has caused repeated bugs. Memorize these:
| Do this | Not this |
|---------|----------|
| `get_hermes_home()` | `Path.home() / ".hermes"` |
| `display_hermes_home()` in user messages | hardcoded `~/.hermes` strings |
| `get_hermes_home() / "sessions"` in tests | `~/.hermes/sessions` in tests |
Import both from `hermes_constants`. Every `~/.hermes` hardcode is a latent profile bug.
---
## Prompt Caching — Do Not Break It
The agent caches system prompts. Cache breaks force re-billing of the entire context window on every turn. The following actions break caching mid-conversation and are forbidden:
- Altering past context
- Changing the active toolset
- Reloading memories or rebuilding the system prompt
The only sanctioned context alteration is the context compressor (`agent/context_compressor.py`). If your feature touches the message history, read that file first.
---
## Adding a Slash Command (Checklist)
Four files, in order:
1. **`hermes_cli/commands.py`** — add `CommandDef` to `COMMAND_REGISTRY`
2. **`cli.py`** — add handler branch in `HermesCLI.process_command()`
3. **`gateway/run.py`** — add handler if it should work in messaging platforms
4. **Aliases** — add to the `aliases` tuple on the `CommandDef`; everything else updates automatically
All downstream consumers (Telegram menu, Slack routing, autocomplete, help text) derive from `COMMAND_REGISTRY`. You never touch them directly.
---
## Tool Schema Pitfalls
**Do NOT cross-reference other toolsets in schema descriptions.**
Writing "prefer `web_search` over this tool" in a browser tool's description will cause the model to hallucinate calls to `web_search` when it's not loaded. Cross-references belong in `get_tool_definitions()` post-processing blocks in `model_tools.py`.
**Do NOT use `\033[K` (ANSI erase-to-EOL) in display code.**
Under `prompt_toolkit`'s `patch_stdout`, it leaks as literal `?[K`. Use space-padding instead: `f"\r{line}{' ' * pad}"`.
**Do NOT use `simple_term_menu` for interactive menus.**
It ghosts on scroll in tmux/iTerm2. Use `curses` (stdlib). See `hermes_cli/tools_config.py` for the pattern.
---
## Health Check Anatomy
A healthy instance returns:
```json
{
"status": "ok",
"gateway_state": "running",
"platforms": {
"telegram": {"state": "connected"}
}
}
```
| Field | Healthy value | What a bad value means |
|-------|--------------|----------------------|
| `status` | `"ok"` | HTTP server down |
| `gateway_state` | `"running"` | Still starting or crashed |
| `platforms.<name>.state` | `"connected"` | Auth failure or network issue |
`gateway_state: "starting"` is normal for up to 60 s on boot. Beyond that, check logs for auth errors:
```bash
journalctl -u hermes-gateway --since "2 minutes ago" | grep -i "error\|token\|auth"
```
---
## Gateway Won't Start — Diagnosis Order
1. `ss -tlnp | grep 8642` — port conflict?
2. `cat ~/.hermes/gateway.pid``ps -p <pid>` — stale PID file?
3. `hermes gateway start --replace` — clears stale locks and PIDs
4. `HERMES_LOG_LEVEL=DEBUG hermes gateway start` — verbose output
5. Check `~/.hermes/.env` — missing or placeholder token?
---
## Before Every PR
```bash
source venv/bin/activate
python -m pytest tests/ -q # full suite: ~3 min, ~3000 tests
python scripts/deploy-validate # deployment health check
python wizard-bootstrap/wizard_bootstrap.py # environment sanity
```
All three must exit 0. Do not skip. "It works locally" is not sufficient evidence.
---
## Session and State Files
| Store | Location | Notes |
|-------|----------|-------|
| Sessions | `~/.hermes/sessions/*.json` | Persisted across restarts |
| Memories | `~/.hermes/memories/*.md` | Written by the agent's memory tool |
| Cron jobs | `~/.hermes/cron/*.json` | Scheduler state |
| Gateway state | `~/.hermes/gateway_state.json` | Live platform connection status |
| Response store | `~/.hermes/response_store.db` | SQLite WAL — API server only |
All paths go through `get_hermes_home()`. Never hardcode. Always backup before a major update:
```bash
tar czf ~/backups/hermes_$(date +%F_%H%M).tar.gz ~/.hermes/
```
---
## Writing Tests
```bash
python -m pytest tests/path/to/test.py -q # single file
python -m pytest tests/ -q -k "test_name" # by name
python -m pytest tests/ -q -x # stop on first failure
```
**Test isolation rules:**
- `tests/conftest.py` has an autouse fixture that redirects `HERMES_HOME` to a temp dir. Never write to `~/.hermes/` in tests.
- Profile tests must mock both `Path.home()` and `HERMES_HOME`. See `tests/hermes_cli/test_profiles.py` for the pattern.
- Do not mock the database. Integration tests should use real SQLite with a temp path.
---
## Commit Conventions
```
feat: add X # new capability
fix: correct Y # bug fix
refactor: restructure Z # no behaviour change
test: add tests for W # test-only
chore: update deps # housekeeping
docs: clarify X # documentation only
```
Include `Fixes #NNN` or `Refs #NNN` in the commit message body to close or reference issues automatically.
---
*This guide lives in `wizard-bootstrap/`. Update it when you discover a new pitfall or pattern worth preserving.*