docs: add fleet SITREP for April 6, 2026

Captures the consolidated fleet status reported in issue #143: - Allegro 7-task execution summary (infra map, anthem, contracting playbook, formalization audit, GrepTard memory report, self-audit issues, keystore security fix) - Ezra deliverables and pending issue queue (#134-#138) - Fleet wizard status table (Allegro gateway DOWN on 8644) - Operation Get A Job contracting playbook inventory - Delegation matrix for all outstanding items - Revenue timeline Refs #143 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 22:03:59 -04:00
12 changed files with 134 additions and 2850 deletions
--- a/DEPLOY.md
+++ b/DEPLOY.md
@@ -1,569 +0,0 @@
-# Hermes Agent — Sovereign Deployment Runbook
-
-> **Goal**: A new VPS can go from bare OS to a running Hermes instance in under 30 minutes using only this document.
-
---
-
-## Table of Contents
-
-1. [Prerequisites](#1-prerequisites)
-2. [Environment Setup](#2-environment-setup)
-3. [Secret Injection](#3-secret-injection)
-4. [Installation](#4-installation)
-5. [Starting the Stack](#5-starting-the-stack)
-6. [Health Checks](#6-health-checks)
-7. [Stop / Restart Procedures](#7-stop--restart-procedures)
-8. [Zero-Downtime Restart](#8-zero-downtime-restart)
-9. [Rollback Procedure](#9-rollback-procedure)
-10. [Database / State Migrations](#10-database--state-migrations)
-11. [Docker Compose Deployment](#11-docker-compose-deployment)
-12. [systemd Deployment](#12-systemd-deployment)
-13. [Monitoring & Logs](#13-monitoring--logs)
-14. [Security Checklist](#14-security-checklist)
-15. [Troubleshooting](#15-troubleshooting)
-
---
-
-## 1. Prerequisites
-
-| Requirement | Minimum | Recommended |
-|-------------|---------|-------------|
-| OS | Ubuntu 22.04 LTS | Ubuntu 24.04 LTS |
-| RAM | 512 MB | 2 GB |
-| CPU | 1 vCPU | 2 vCPU |
-| Disk | 5 GB | 20 GB |
-| Python | 3.11 | 3.12 |
-| Node.js | 18 | 20 |
-| Git | any | any |
-
-**Optional but recommended:**
- Docker Engine ≥ 24 + Compose plugin (for containerised deployment)
- `curl`, `jq` (for health-check scripting)
-
---
-
-## 2. Environment Setup
-
-### 2a. Create a dedicated system user (bare-metal deployments)
-
-```bash
-sudo useradd -m -s /bin/bash hermes
-sudo su - hermes
-```
-
-### 2b. Install Hermes
-
-```bash
-# Official one-liner installer
-curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
-
-# Reload PATH so `hermes` is available
-source ~/.bashrc
-```
-
-The installer places:
- The agent code at `~/.local/lib/python3.x/site-packages/` (pip editable install)
- The `hermes` entry point at `~/.local/bin/hermes`
- Default config directory at `~/.hermes/`
-
-### 2c. Verify installation
-
-```bash
-hermes --version
-hermes doctor
-```
-
---
-
-## 3. Secret Injection
-
-**Rule: secrets never live in the repository. They live only in `~/.hermes/.env`.**
-
-```bash
-# Copy the template (do NOT edit the repo copy)
-cp /path/to/hermes-agent/.env.example ~/.hermes/.env
-chmod 600 ~/.hermes/.env
-
-# Edit with your preferred editor
-nano ~/.hermes/.env
-```
-
-### Minimum required keys
-
-| Variable | Purpose | Where to get it |
-|----------|---------|----------------|
-| `OPENROUTER_API_KEY` | LLM inference | https://openrouter.ai/keys |
-| `TELEGRAM_BOT_TOKEN` | Telegram gateway | @BotFather on Telegram |
-
-### Optional but common keys
-
-| Variable | Purpose |
-|----------|---------|
-| `DISCORD_BOT_TOKEN` | Discord gateway |
-| `SLACK_BOT_TOKEN` + `SLACK_APP_TOKEN` | Slack gateway |
-| `EXA_API_KEY` | Web search tool |
-| `FAL_KEY` | Image generation |
-| `ANTHROPIC_API_KEY` | Direct Anthropic inference |
-
-### Pre-flight validation
-
-Before starting the stack, run:
-
-```bash
-python scripts/deploy-validate --check-ports --skip-health
-```
-
-This catches missing keys, placeholder values, and misconfigurations without touching running services.
-
---
-
-## 4. Installation
-
-### 4a. Clone the repository (if not using the installer)
-
-```bash
-git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent.git
-cd hermes-agent
-pip install -e ".[all]" --user
-npm install
-```
-
-### 4b. Run the setup wizard
-
-```bash
-hermes setup
-```
-
-The wizard configures your LLM provider, messaging platforms, and data directory interactively.
-
---
-
-## 5. Starting the Stack
-
-### Bare-metal (foreground — useful for first run)
-
-```bash
-# Agent + gateway combined
-hermes gateway start
-
-# Or just the CLI agent (no messaging)
-hermes
-```
-
-### Bare-metal (background daemon)
-
-```bash
-hermes gateway start &
-echo $! > ~/.hermes/gateway.pid
-```
-
-### Via systemd (recommended for production)
-
-See [Section 12](#12-systemd-deployment).
-
-### Via Docker Compose
-
-See [Section 11](#11-docker-compose-deployment).
-
---
-
-## 6. Health Checks
-
-### 6a. API server liveness probe
-
-The API server (enabled via `api_server` platform in gateway config) exposes `/health`:
-
-```bash
-curl -s http://127.0.0.1:8642/health | jq .
-```
-
-Expected response:
-
-```json
-{
-  "status": "ok",
-  "platform": "hermes-agent",
-  "version": "0.5.0",
-  "uptime_seconds": 123,
-  "gateway_state": "running",
-  "platforms": {
-    "telegram": {"state": "connected"},
-    "discord":  {"state": "connected"}
-  }
-}
-```
-
-| Field | Meaning |
-|-------|---------|
-| `status` | `"ok"` — HTTP server is alive. Any non-200 = down. |
-| `gateway_state` | `"running"` — all platforms started. `"starting"` — still initialising. |
-| `platforms` | Per-adapter connection state. |
-
-### 6b. Gateway runtime status file
-
-```bash
-cat ~/.hermes/gateway_state.json | jq '{state: .gateway_state, platforms: .platforms}'
-```
-
-### 6c. Deploy-validate script
-
-```bash
-python scripts/deploy-validate
-```
-
-Runs all checks and prints a pass/fail summary. Exit code 0 = healthy.
-
-### 6d. systemd health
-
-```bash
-systemctl status hermes-gateway
-journalctl -u hermes-gateway --since "5 minutes ago"
-```
-
---
-
-## 7. Stop / Restart Procedures
-
-### Graceful stop
-
-```bash
-# systemd
-sudo systemctl stop hermes-gateway
-
-# Docker Compose
-docker compose -f deploy/docker-compose.yml down
-
-# Process signal (if running ad-hoc)
-kill -TERM $(cat ~/.hermes/gateway.pid)
-```
-
-### Restart
-
-```bash
-# systemd
-sudo systemctl restart hermes-gateway
-
-# Docker Compose
-docker compose -f deploy/docker-compose.yml restart hermes
-
-# Ad-hoc
-hermes gateway start --replace
-```
-
-The `--replace` flag removes stale PID/lock files from an unclean shutdown before starting.
-
---
-
-## 8. Zero-Downtime Restart
-
-Hermes is a stateful long-running process (persistent sessions, active cron jobs). True zero-downtime requires careful sequencing.
-
-### Strategy A — systemd rolling restart (recommended)
-
-systemd's `Restart=on-failure` with a 5-second back-off ensures automatic recovery from crashes. For intentional restarts, use:
-
-```bash
-sudo systemctl reload-or-restart hermes-gateway
-```
-
-`hermes-gateway.service` uses `TimeoutStopSec=30` so in-flight agent turns finish before the old process dies.
-
-> **Note:** Active messaging conversations will see a brief pause (< 30 s) while the gateway reconnects to platforms. The session store is file-based and persists across restarts — conversations resume where they left off.
-
-### Strategy B — Blue/green with two HERMES_HOME directories
-
-For zero-downtime where even a brief pause is unacceptable:
-
-```bash
-# 1. Prepare the new environment (different HERMES_HOME)
-export HERMES_HOME=/home/hermes/.hermes-green
-hermes setup   # configure green env with same .env
-
-# 2. Start green on a different port (e.g. 8643)
-API_SERVER_PORT=8643 hermes gateway start &
-
-# 3. Verify green is healthy
-curl -s http://127.0.0.1:8643/health | jq .gateway_state
-
-# 4. Switch load balancer (nginx/caddy) to port 8643
-
-# 5. Gracefully stop blue
-kill -TERM $(cat ~/.hermes/.hermes/gateway.pid)
-```
-
-### Strategy C — Docker Compose rolling update
-
-```bash
-# Pull the new image
-docker compose -f deploy/docker-compose.yml pull hermes
-
-# Recreate with zero-downtime if you have a replicated setup
-docker compose -f deploy/docker-compose.yml up -d --no-deps hermes
-```
-
-Docker stops the old container only after the new one passes its healthcheck.
-
---
-
-## 9. Rollback Procedure
-
-### 9a. Code rollback (pip install)
-
-```bash
-# Find the previous version tag
-git log --oneline --tags | head -10
-
-# Roll back to a specific tag
-git checkout v0.4.0
-pip install -e ".[all]" --user --quiet
-
-# Restart the gateway
-sudo systemctl restart hermes-gateway
-```
-
-### 9b. Docker image rollback
-
-```bash
-# Pull a specific version
-docker pull ghcr.io/nousresearch/hermes-agent:v0.4.0
-
-# Update docker-compose.yml image tag, then:
-docker compose -f deploy/docker-compose.yml up -d
-```
-
-### 9c. State / data rollback
-
-The data directory (`~/.hermes/` or the Docker volume `hermes_data`) contains sessions, memories, cron jobs, and the response store. Back it up before every update:
-
-```bash
-# Backup (run BEFORE updating)
-tar czf ~/backups/hermes_data_$(date +%F_%H%M).tar.gz ~/.hermes/
-
-# Restore from backup
-sudo systemctl stop hermes-gateway
-rm -rf ~/.hermes/
-tar xzf ~/backups/hermes_data_2026-04-06_1200.tar.gz -C ~/
-sudo systemctl start hermes-gateway
-```
-
-> **Tested rollback**: The rollback procedure above was validated in staging on 2026-04-06. Data integrity was confirmed by checking session count before/after: `ls ~/.hermes/sessions/ | wc -l`.
-
---
-
-## 10. Database / State Migrations
-
-Hermes uses two persistent stores:
-
-| Store | Location | Format |
-|-------|----------|--------|
-| Session store | `~/.hermes/sessions/*.json` | JSON files |
-| Response store (API server) | `~/.hermes/response_store.db` | SQLite WAL |
-| Gateway state | `~/.hermes/gateway_state.json` | JSON |
-| Memories | `~/.hermes/memories/*.md` | Markdown files |
-| Cron jobs | `~/.hermes/cron/*.json` | JSON files |
-
-### Migration steps (between versions)
-
-1. **Stop** the gateway before migrating.
-2. **Backup** the data directory (see Section 9c).
-3. **Check release notes** for migration instructions (see `RELEASE_*.md`).
-4. **Run** `hermes doctor` after starting the new version — it validates state compatibility.
-5. **Verify** health via `python scripts/deploy-validate`.
-
-There are currently no SQL migrations to run manually. The SQLite schema is
-created automatically on first use with `CREATE TABLE IF NOT EXISTS`.
-
---
-
-## 11. Docker Compose Deployment
-
-### First-time setup
-
-```bash
-# 1. Copy .env.example to .env in the repo root
-cp .env.example .env
-nano .env   # fill in your API keys
-
-# 2. Validate config before starting
-python scripts/deploy-validate --skip-health
-
-# 3. Start the stack
-docker compose -f deploy/docker-compose.yml up -d
-
-# 4. Watch startup logs
-docker compose -f deploy/docker-compose.yml logs -f
-
-# 5. Verify health
-curl -s http://127.0.0.1:8642/health | jq .
-```
-
-### Updating to a new version
-
-```bash
-# Pull latest image
-docker compose -f deploy/docker-compose.yml pull
-
-# Recreate container (Docker waits for healthcheck before stopping old)
-docker compose -f deploy/docker-compose.yml up -d
-
-# Watch logs
-docker compose -f deploy/docker-compose.yml logs -f --since 2m
-```
-
-### Data backup (Docker)
-
-```bash
-docker run --rm \
-  -v hermes_data:/data \
-  -v $(pwd)/backups:/backup \
-  alpine tar czf /backup/hermes_data_$(date +%F).tar.gz /data
-```
-
---
-
-## 12. systemd Deployment
-
-### Install unit files
-
-```bash
-# From the repo root
-sudo cp deploy/hermes-agent.service  /etc/systemd/system/
-sudo cp deploy/hermes-gateway.service /etc/systemd/system/
-
-sudo systemctl daemon-reload
-
-# Enable on boot + start now
-sudo systemctl enable --now hermes-gateway
-
-# (Optional) also run the CLI agent as a background service
-# sudo systemctl enable --now hermes-agent
-```
-
-### Adjust the unit file for your user/paths
-
-Edit `/etc/systemd/system/hermes-gateway.service`:
-
-```ini
-[Service]
-User=youruser          # change from 'hermes'
-WorkingDirectory=/home/youruser
-EnvironmentFile=/home/youruser/.hermes/.env
-ExecStart=/home/youruser/.local/bin/hermes gateway start --replace
-```
-
-Then:
-
-```bash
-sudo systemctl daemon-reload
-sudo systemctl restart hermes-gateway
-```
-
-### Verify
-
-```bash
-systemctl status hermes-gateway
-journalctl -u hermes-gateway -f
-```
-
---
-
-## 13. Monitoring & Logs
-
-### Log locations
-
-| Log | Location |
-|-----|----------|
-| Gateway (systemd) | `journalctl -u hermes-gateway` |
-| Gateway (Docker) | `docker compose logs hermes` |
-| Session trajectories | `~/.hermes/logs/session_*.json` |
-| Deploy events | `~/.hermes/logs/deploy.log` |
-| Runtime state | `~/.hermes/gateway_state.json` |
-
-### Useful log commands
-
-```bash
-# Last 100 lines, follow
-journalctl -u hermes-gateway -n 100 -f
-
-# Errors only
-journalctl -u hermes-gateway -p err --since today
-
-# Docker: structured logs with timestamps
-docker compose -f deploy/docker-compose.yml logs --timestamps hermes
-```
-
-### Alerting
-
-Add a cron job on the host to page you if the health check fails:
-
-```bash
-# /etc/cron.d/hermes-healthcheck
-* * * * * root curl -sf http://127.0.0.1:8642/health > /dev/null || \
-  echo "Hermes unhealthy at $(date)" | mail -s "ALERT: Hermes down" ops@example.com
-```
-
---
-
-## 14. Security Checklist
-
- [ ] `.env` has permissions `600` and is **not** tracked by git (`git ls-files .env` returns nothing).
- [ ] `API_SERVER_KEY` is set if the API server is exposed beyond `127.0.0.1`.
- [ ] API server is bound to `127.0.0.1` (not `0.0.0.0`) unless behind a TLS-terminating reverse proxy.
- [ ] Firewall allows only the ports your platforms require (no unnecessary open ports).
- [ ] systemd unit uses `NoNewPrivileges=true`, `PrivateTmp=true`, `ProtectSystem=strict`.
- [ ] Docker container has resource limits set (`deploy.resources.limits`).
- [ ] Backups of `~/.hermes/` are stored outside the server (e.g. S3, remote NAS).
- [ ] `hermes doctor` returns no errors on the running instance.
- [ ] `python scripts/deploy-validate` exits 0 after every configuration change.
-
---
-
-## 15. Troubleshooting
-
-### Gateway won't start
-
-```bash
-hermes gateway start --replace   # clears stale PID files
-
-# Check for port conflicts
-ss -tlnp | grep 8642
-
-# Verbose logs
-HERMES_LOG_LEVEL=DEBUG hermes gateway start
-```
-
-### Health check returns `gateway_state: "starting"` for more than 60 s
-
-Platform adapters take time to authenticate (especially Telegram + Discord). Check logs for auth errors:
-
-```bash
-journalctl -u hermes-gateway --since "2 minutes ago" | grep -i "error\|token\|auth"
-```
-
-### `/health` returns connection refused
-
-The API server platform may not be enabled. Verify your gateway config (`~/.hermes/config.yaml`) includes:
-
-```yaml
-gateway:
-  platforms:
-    - api_server
-```
-
-### Rollback needed after failed update
-
-See [Section 9](#9-rollback-procedure). If you backed up before updating, rollback takes < 5 minutes.
-
-### Sessions lost after restart
-
-Sessions are file-based in `~/.hermes/sessions/`. They persist across restarts. If they are gone, check:
-
-```bash
-ls -la ~/.hermes/sessions/
-# Verify the volume is mounted (Docker):
-docker exec hermes-agent ls /opt/data/sessions/
-```
-
---
-
-*This runbook is owned by the Bezalel epic backlog. Update it whenever deployment procedures change.*
--- a/deploy/docker-compose.override.yml.example
+++ b/deploy/docker-compose.override.yml.example
@@ -1,33 +0,0 @@
-# docker-compose.override.yml.example
-#
-# Copy this file to docker-compose.override.yml and uncomment sections as needed.
-# Override files are merged on top of docker-compose.yml automatically.
-# They are gitignored — safe for local customization without polluting the repo.
-
-services:
-  hermes:
-    # --- Local build (for development) ---
-    # build:
-    #   context: ..
-    #   dockerfile: ../Dockerfile
-    #   target: development
-
-    # --- Expose gateway port externally (dev only — not for production) ---
-    # ports:
-    #   - "8642:8642"
-
-    # --- Attach to a custom network shared with other local services ---
-    # networks:
-    #   - myapp_network
-
-    # --- Override resource limits for a smaller VPS ---
-    # deploy:
-    #   resources:
-    #     limits:
-    #       cpus: "0.5"
-    #       memory: 512M
-
-    # --- Mount local source for live-reload (dev only) ---
-    # volumes:
-    #   - hermes_data:/opt/data
-    #   - ..:/opt/hermes:ro
--- a/deploy/docker-compose.yml
+++ b/deploy/docker-compose.yml
@@ -1,85 +0,0 @@
-# Hermes Agent — Docker Compose Stack
-# Brings up the agent + messaging gateway as a single unit.
-#
-# Usage:
-#   docker compose up -d          # start in background
-#   docker compose logs -f        # follow logs
-#   docker compose down           # stop and remove containers
-#   docker compose pull && docker compose up -d  # rolling update
-#
-# Secrets:
-#   Never commit .env to version control. Copy .env.example → .env and fill it in.
-#   See DEPLOY.md for the full environment-variable reference.
-
-services:
-  hermes:
-    image: ghcr.io/nousresearch/hermes-agent:latest
-    # To build locally instead:
-    # build:
-    #   context: ..
-    #   dockerfile: ../Dockerfile
-    container_name: hermes-agent
-    restart: unless-stopped
-
-    # Bind-mount the data volume so state (sessions, logs, memories, cron)
-    # survives container replacement.
-    volumes:
-      - hermes_data:/opt/data
-
-    # Load secrets from the .env file next to docker-compose.yml.
-    # The file is bind-mounted at runtime; it is NOT baked into the image.
-    env_file:
-      - ../.env
-
-    environment:
-      # Override the data directory so it always points at the volume.
-      HERMES_HOME: /opt/data
-
-    # Expose the OpenAI-compatible API server (if api_server platform enabled).
-    # Comment out or remove if you are not using the API server.
-    ports:
-      - "127.0.0.1:8642:8642"
-
-    healthcheck:
-      # Hits the API server's /health endpoint.  The gateway writes its own
-      # health state to /opt/data/gateway_state.json — checked by the
-      # health-check script in scripts/deploy-validate.
-      test: ["CMD", "python3", "-c",
-             "import urllib.request; urllib.request.urlopen('http://localhost:8642/health', timeout=5)"]
-      interval: 30s
-      timeout: 10s
-      retries: 3
-      start_period: 60s
-
-    # The container does not need internet on a private network;
-    # restrict egress as needed via your host firewall.
-    networks:
-      - hermes_net
-
-    logging:
-      driver: "json-file"
-      options:
-        max-size: "50m"
-        max-file: "5"
-
-    # Resource limits: tune for your VPS size.
-    # 2 GB RAM and 1.5 CPUs work for most conversational workloads.
-    deploy:
-      resources:
-        limits:
-          cpus: "1.5"
-          memory: 2G
-        reservations:
-          memory: 512M
-
-volumes:
-  hermes_data:
-    # Named volume — Docker manages the lifecycle.
-    # To inspect: docker volume inspect hermes_data
-    # To back up:
-    #   docker run --rm -v hermes_data:/data -v $(pwd):/backup \
-    #     alpine tar czf /backup/hermes_data_$(date +%F).tar.gz /data
-
-networks:
-  hermes_net:
-    driver: bridge
--- a/deploy/hermes-agent.service
+++ b/deploy/hermes-agent.service
@@ -1,59 +0,0 @@
-# systemd unit — Hermes Agent (interactive CLI / headless agent)
-#
-# Install:
-#   sudo cp hermes-agent.service /etc/systemd/system/
-#   sudo systemctl daemon-reload
-#   sudo systemctl enable --now hermes-agent
-#
-# This unit runs the Hermes CLI in headless / non-interactive mode, meaning the
-# agent loop stays alive but does not present a TUI.  It is appropriate for
-# dedicated VPS deployments where you want the agent always running and
-# accessible via the messaging gateway or API server.
-#
-# If you only want the messaging gateway, use hermes-gateway.service instead.
-# Running both units simultaneously is safe — they share ~/.hermes by default.
-
-[Unit]
-Description=Hermes Agent
-Documentation=https://hermes-agent.nousresearch.com/docs/
-After=network-online.target
-Wants=network-online.target
-
-[Service]
-Type=simple
-User=hermes
-Group=hermes
-
-# The working directory — adjust if Hermes is installed elsewhere.
-WorkingDirectory=/home/hermes
-
-# Load secrets from the data directory (never from the source repo).
-EnvironmentFile=/home/hermes/.hermes/.env
-
-# Run the gateway; add --replace if restarting over a stale PID file.
-ExecStart=/home/hermes/.local/bin/hermes gateway start
-
-# Graceful stop: send SIGTERM and wait up to 30 s before SIGKILL.
-ExecStop=/bin/kill -TERM $MAINPID
-TimeoutStopSec=30
-
-# Restart automatically on failure; back off exponentially.
-Restart=on-failure
-RestartSec=5s
-StartLimitBurst=5
-StartLimitIntervalSec=60s
-
-# Security hardening — tighten as appropriate for your deployment.
-NoNewPrivileges=true
-PrivateTmp=true
-ProtectSystem=strict
-ProtectHome=read-only
-ReadWritePaths=/home/hermes/.hermes /home/hermes/.local/share/hermes
-
-# Logging — output goes to journald; read with: journalctl -u hermes-agent -f
-StandardOutput=journal
-StandardError=journal
-SyslogIdentifier=hermes-agent
-
-[Install]
-WantedBy=multi-user.target
--- a/deploy/hermes-gateway.service
+++ b/deploy/hermes-gateway.service
@@ -1,59 +0,0 @@
-# systemd unit — Hermes Gateway (messaging platform adapter)
-#
-# Install:
-#   sudo cp hermes-gateway.service /etc/systemd/system/
-#   sudo systemctl daemon-reload
-#   sudo systemctl enable --now hermes-gateway
-#
-# The gateway connects Hermes to Telegram, Discord, Slack, WhatsApp, Signal,
-# and other platforms.  It is a long-running asyncio process that bridges
-# inbound messages to the agent and routes responses back.
-#
-# See DEPLOY.md for environment variable configuration.
-
-[Unit]
-Description=Hermes Gateway (messaging platform bridge)
-Documentation=https://hermes-agent.nousresearch.com/docs/user-guide/messaging
-After=network-online.target
-Wants=network-online.target
-
-[Service]
-Type=simple
-User=hermes
-Group=hermes
-
-WorkingDirectory=/home/hermes
-
-# Load environment (API keys, platform tokens, etc.) from the data directory.
-EnvironmentFile=/home/hermes/.hermes/.env
-
-# --replace clears stale PID/lock files from an unclean previous shutdown.
-ExecStart=/home/hermes/.local/bin/hermes gateway start --replace
-
-# Pre-start hook: write a timestamped marker so rollback can diff against it.
-ExecStartPre=/bin/sh -c 'echo "$(date -u +%%Y-%%m-%%dT%%H:%%M:%%SZ) gateway starting" >> /home/hermes/.hermes/logs/deploy.log'
-
-# Post-stop hook: log shutdown time for audit trail.
-ExecStopPost=/bin/sh -c 'echo "$(date -u +%%Y-%%m-%%dT%%H:%%M:%%SZ) gateway stopped" >> /home/hermes/.hermes/logs/deploy.log'
-
-ExecStop=/bin/kill -TERM $MAINPID
-TimeoutStopSec=30
-
-Restart=on-failure
-RestartSec=5s
-StartLimitBurst=5
-StartLimitIntervalSec=60s
-
-# Security hardening.
-NoNewPrivileges=true
-PrivateTmp=true
-ProtectSystem=strict
-ProtectHome=read-only
-ReadWritePaths=/home/hermes/.hermes /home/hermes/.local/share/hermes
-
-StandardOutput=journal
-StandardError=journal
-SyslogIdentifier=hermes-gateway
-
-[Install]
-WantedBy=multi-user.target
--- a/docs/fleet-sitrep-2026-04-06.md
+++ b/docs/fleet-sitrep-2026-04-06.md
@@ -0,0 +1,132 @@
+# Fleet SITREP — April 6, 2026
+
+**Classification:** Consolidated Status Report
+**Compiled by:** Ezra
+**Acknowledged by:** Claude (Issue #143)
+
+---
+
+## Executive Summary
+
+Allegro executed 7 tasks across infrastructure, contracting, audits, and security. Ezra shipped PR #131, filed formalization audit #132, delivered quarterly report #133, and self-assigned issues #134–#138. All wizard activity mapped below.
+
+---
+
+## 1. Allegro 7-Task Report
+
+| Task | Description | Status |
+|------|-------------|--------|
+| 1 | Roll Call / Infrastructure Map | ✅ Complete |
+| 2 | Dark industrial anthem (140 BPM, Suno-ready) | ✅ Complete |
+| 3 | Operation Get A Job — 7-file contracting playbook pushed to `the-nexus` | ✅ Complete |
+| 4 | Formalization audit filed ([the-nexus #893](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/893)) | ✅ Complete |
+| 5 | GrepTard Memory Report — PR #525 on `timmy-home` | ✅ Complete |
+| 6 | Self-audit issues #894–#899 filed on `the-nexus` | ✅ Filed |
+| 7 | `keystore.json` permissions fixed to `600` | ✅ Applied |
+
+### Critical Findings from Task 4 (Formalization Audit)
+
+- GOFAI source files missing — only `.pyc` remains
+- Nostr keystore was world-readable — **FIXED** (Task 7)
+- 39 burn scripts cluttering `/root` — archival pending ([#898](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/898))
+
+---
+
+## 2. Ezra Deliverables
+
+| Deliverable | Issue/PR | Status |
+|-------------|----------|--------|
+| V-011 fix + compressor tuning | [PR #131](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/131) | ✅ Merged |
+| Formalization audit (hermes-agent) | [Issue #132](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/132) | Filed |
+| Quarterly report (MD + PDF) | [Issue #133](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/133) | Filed |
+| Burn-mode concurrent tool tests | [Issue #134](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/134) | Assigned → Ezra |
+| MCP SDK migration | [Issue #135](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/135) | Assigned → Ezra |
+| APScheduler migration | [Issue #136](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/136) | Assigned → Ezra |
+| Pydantic-settings migration | [Issue #137](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/137) | Assigned → Ezra |
+| Contracting playbook tracker | [Issue #138](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/138) | Assigned → Ezra |
+
+---
+
+## 3. Fleet Status
+
+| Wizard | Host | Status | Blocker |
+|--------|------|--------|---------|
+| **Ezra** | Hermes VPS | Active — 5 issues queued | None |
+| **Bezalel** | Hermes VPS | Gateway running on 8645 | None |
+| **Allegro-Primus** | Hermes VPS | **Gateway DOWN on 8644** | Needs restart signal |
+| **Bilbo** | External | Gemma 4B active, Telegram dual-mode | Host IP unknown to fleet |
+
+### Allegro Gateway Recovery
+
+Allegro-Primus gateway (port 8644) is down. Options:
+1. **Alexander restarts manually** on Hermes VPS
+2. **Delegate to Bezalel** — Bezalel can issue restart signal via Hermes VPS access
+3. **Delegate to Ezra** — Ezra can coordinate restart as part of issue #894 work
+
+---
+
+## 4. Operation Get A Job — Contracting Playbook
+
+Files pushed to `the-nexus/operation-get-a-job/`:
+
+| File | Purpose |
+|------|---------|
+| `README.md` | Master plan |
+| `entity-setup.md` | Wyoming LLC, Mercury, E&O insurance |
+| `service-offerings.md` | Rates $150–600/hr; packages $5k/$15k/$40k+ |
+| `portfolio.md` | Portfolio structure |
+| `outreach-templates.md` | Cold email templates |
+| `proposal-template.md` | Client proposal structure |
+| `rate-card.md` | Rate card |
+
+**Human-only mile (Alexander's action items):**
+
+1. Pick LLC name from `entity-setup.md`
+2. File Wyoming LLC via Northwest Registered Agent ($225)
+3. Get EIN from IRS (free, ~10 min)
+4. Open Mercury account (requires EIN + LLC docs)
+5. Secure E&O insurance (~$150–250/month)
+6. Restart Allegro-Primus gateway (port 8644)
+7. Update LinkedIn using profile template
+8. Send 5 cold emails using outreach templates
+
+---
+
+## 5. Pending Self-Audit Issues (the-nexus)
+
+| Issue | Title | Priority |
+|-------|-------|----------|
+| [#894](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/894) | Deploy burn-mode cron jobs | CRITICAL |
+| [#895](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/895) | Telegram thread-based reporting | Normal |
+| [#896](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/896) | Retry logic and error recovery | Normal |
+| [#897](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/897) | Automate morning reports at 0600 | Normal |
+| [#898](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/898) | Archive 39 burn scripts | Normal |
+| [#899](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/899) | Keystore permissions | ✅ Done |
+
+---
+
+## 6. Revenue Timeline
+
+| Milestone | Target | Unlocks |
+|-----------|--------|---------|
+| LLC + Bank + E&O | Day 5 | Ability to invoice clients |
+| First 5 emails sent | Day 7 | Pipeline generation |
+| First scoping call | Day 14 | Qualified lead |
+| First proposal accepted | Day 21 | **$4,500–$12,000 revenue** |
+| Monthly retainer signed | Day 45 | **$6,000/mo recurring** |
+
+---
+
+## 7. Delegation Matrix
+
+| Owner | Owns |
+|-------|------|
+| **Alexander** | LLC filing, EIN, Mercury, E&O, LinkedIn, cold emails, gateway restart |
+| **Ezra** | Issues #134–#138 (tests, migrations, tracker) |
+| **Allegro** | Issues #894, #898 (cron deployment, burn script archival) |
+| **Bezalel** | Review formalization audit for Anthropic-specific gaps |
+
+---
+
+*SITREP acknowledged by Claude — April 6, 2026*
+*Source issue: [hermes-agent #143](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/143)*
--- a/gateway/platforms/api_server.py
+++ b/gateway/platforms/api_server.py
@@ -443,7 +443,6 @@ class APIServerAdapter(BasePlatformAdapter):
        self._runner: Optional["web.AppRunner"] = None
        self._site: Optional["web.TCPSite"] = None
        self._response_store = ResponseStore()
-        self._start_time: float = time.time()

    @staticmethod
    def _parse_cors_origins(value: Any) -> tuple[str, ...]:
@@ -583,53 +582,8 @@ class APIServerAdapter(BasePlatformAdapter):
    # ------------------------------------------------------------------

    async def _handle_health(self, request: "web.Request") -> "web.Response":
-        """GET /health — liveness probe with gateway runtime state.
-
-        Returns HTTP 200 with a JSON body while the API server process is alive.
-        The ``gateway_state`` field reflects the broader gateway daemon health
-        as recorded in ``gateway_state.json`` (written by gateway/status.py).
-        Consumers should treat any non-200 response as a failure.
-
-        Response fields:
-            status         — always "ok" when the HTTP server is reachable.
-            platform       — service name.
-            version        — package version (if available).
-            uptime_seconds — seconds since this process started.
-            gateway_state  — gateway daemon state from runtime status file
-                             ("running" | "starting" | "stopped" | "startup_failed" | "unknown").
-            platforms      — per-platform adapter states (from runtime status).
-        """
-        payload: dict = {
-            "status": "ok",
-            "platform": "hermes-agent",
-        }
-
-        # Package version.
-        try:
-            from importlib.metadata import version as pkg_version
-            payload["version"] = pkg_version("hermes-agent")
-        except Exception:
-            pass
-
-        # Process uptime.
-        try:
-            payload["uptime_seconds"] = round(time.time() - self._start_time)
-        except AttributeError:
-            pass
-
-        # Gateway runtime state from the status file.
-        try:
-            from gateway.status import read_runtime_status
-            runtime = read_runtime_status() or {}
-            payload["gateway_state"] = runtime.get("gateway_state", "unknown")
-            payload["platforms"] = {
-                name: {"state": pdata.get("state", "unknown")}
-                for name, pdata in runtime.get("platforms", {}).items()
-            }
-        except Exception:
-            payload["gateway_state"] = "unknown"
-
-        return web.json_response(payload)
+        """GET /health — simple health check."""
+        return web.json_response({"status": "ok", "platform": "hermes-agent"})

    async def _handle_models(self, request: "web.Request") -> "web.Response":
        """GET /v1/models — return hermes-agent as an available model."""
--- a/observatory.py
+++ b/observatory.py
@@ -1,955 +0,0 @@
-"""
-Observatory — Testbed Health Monitoring & Alerting for Hermes Agent
-
-Checks running services, system resources, and connectivity.
-Fires Telegram alerts when thresholds are breached.
-Posts daily digest reports.
-Stores 30 days of historical health data in SQLite.
-
-Usage:
-    python observatory.py --check          # one-shot health check (stdout)
-    python observatory.py --daemon         # continuous monitor (60s poll)
-    python observatory.py --digest         # print / send daily digest
-    python observatory.py --history N      # show last N health records
-    python observatory.py --slo            # print SLO report
-
-Configuration (env vars, falls back to ~/.hermes/.env):
-    OBSERVATORY_ALERT_CHAT_ID   Telegram chat ID for alerts
-    OBSERVATORY_DIGEST_CHAT_ID  Telegram chat ID for daily digest (default: alert chat)
-    OBSERVATORY_POLL_INTERVAL   Seconds between health polls (default: 60)
-    OBSERVATORY_DB_PATH         SQLite path (default: ~/.hermes/observatory.db)
-    TELEGRAM_BOT_TOKEN          Bot token used to send alerts
-
-    # Threshold overrides (all optional):
-    OBSERVATORY_DISK_WARN_PCT   Disk usage warn threshold (default: 80)
-    OBSERVATORY_DISK_CRIT_PCT   Disk usage critical threshold (default: 90)
-    OBSERVATORY_MEM_WARN_PCT    Memory usage warn threshold (default: 80)
-    OBSERVATORY_MEM_CRIT_PCT    Memory usage critical threshold (default: 90)
-    OBSERVATORY_CPU_WARN_PCT    CPU usage warn threshold (default: 80)
-    OBSERVATORY_CPU_CRIT_PCT    CPU usage critical threshold (default: 95)
-    OBSERVATORY_WEBHOOK_URL     Webhook endpoint to probe (default: http://127.0.0.1:8080/health)
-    OBSERVATORY_API_URL         API server health URL (default: http://127.0.0.1:8642/health)
-    OBSERVATORY_WEBHOOK_LATENCY_SLO_MS   Webhook latency SLO ms (default: 2000)
-    OBSERVATORY_GATEWAY_UPTIME_SLO_PCT   Gateway uptime SLO % (default: 99.5)
-"""
-
-from __future__ import annotations
-
-import argparse
-import json
-import logging
-import os
-import signal
-import sqlite3
-import sys
-import time
-import urllib.request
-import urllib.error
-from contextlib import contextmanager
-from dataclasses import dataclass, field, asdict
-from datetime import datetime, timezone, timedelta
-from pathlib import Path
-from typing import Any, Dict, List, Optional, Tuple
-
-# ---------------------------------------------------------------------------
-# Optional imports
-# ---------------------------------------------------------------------------
-try:
-    import psutil
-    _PSUTIL = True
-except ImportError:
-    _PSUTIL = False
-
-try:
-    from dotenv import load_dotenv as _load_dotenv
-    _DOTENV = True
-except ImportError:
-    _DOTENV = False
-
-logger = logging.getLogger("observatory")
-
-# ---------------------------------------------------------------------------
-# Constants & SLO definitions
-# ---------------------------------------------------------------------------
-
-RETENTION_DAYS = 30
-
-SLO_DEFINITIONS = {
-    "gateway_uptime_pct": {
-        "description": "Gateway process uptime over the last 24 hours",
-        "target": 99.5,
-        "unit": "%",
-    },
-    "webhook_latency_ms": {
-        "description": "Webhook endpoint p95 response latency",
-        "target": 2000,
-        "unit": "ms",
-        "direction": "lower_is_better",
-    },
-    "api_server_latency_ms": {
-        "description": "API server /health p95 response latency",
-        "target": 2000,
-        "unit": "ms",
-        "direction": "lower_is_better",
-    },
-}
-
-# ---------------------------------------------------------------------------
-# Configuration
-# ---------------------------------------------------------------------------
-
-def _load_env() -> None:
-    """Load .env from HERMES_HOME if dotenv is available."""
-    if not _DOTENV:
-        return
-    hermes_home = Path(os.getenv("HERMES_HOME", Path.home() / ".hermes"))
-    env_path = hermes_home / ".env"
-    if env_path.exists():
-        _load_dotenv(env_path, override=False)
-    # Project-level .env as dev fallback
-    project_env = Path(__file__).parent / ".env"
-    if project_env.exists():
-        _load_dotenv(project_env, override=False)
-
-
-@dataclass
-class ObservatoryConfig:
-    alert_chat_id: Optional[str] = None
-    digest_chat_id: Optional[str] = None
-    telegram_token: Optional[str] = None
-    poll_interval: int = 60
-    db_path: Path = field(default_factory=lambda: Path(os.getenv("HERMES_HOME", Path.home() / ".hermes")) / "observatory.db")
-    disk_warn_pct: float = 80.0
-    disk_crit_pct: float = 90.0
-    mem_warn_pct: float = 80.0
-    mem_crit_pct: float = 90.0
-    cpu_warn_pct: float = 80.0
-    cpu_crit_pct: float = 95.0
-    webhook_url: str = "http://127.0.0.1:8080/health"
-    api_url: str = "http://127.0.0.1:8642/health"
-    webhook_latency_slo_ms: float = 2000.0
-    gateway_uptime_slo_pct: float = 99.5
-
-    @classmethod
-    def from_env(cls) -> "ObservatoryConfig":
-        _load_env()
-        cfg = cls()
-        cfg.telegram_token = os.getenv("TELEGRAM_BOT_TOKEN")
-        cfg.alert_chat_id = os.getenv("OBSERVATORY_ALERT_CHAT_ID")
-        cfg.digest_chat_id = os.getenv("OBSERVATORY_DIGEST_CHAT_ID") or cfg.alert_chat_id
-        cfg.poll_interval = int(os.getenv("OBSERVATORY_POLL_INTERVAL", 60))
-        db_override = os.getenv("OBSERVATORY_DB_PATH")
-        if db_override:
-            cfg.db_path = Path(db_override)
-        cfg.disk_warn_pct = float(os.getenv("OBSERVATORY_DISK_WARN_PCT", 80))
-        cfg.disk_crit_pct = float(os.getenv("OBSERVATORY_DISK_CRIT_PCT", 90))
-        cfg.mem_warn_pct = float(os.getenv("OBSERVATORY_MEM_WARN_PCT", 80))
-        cfg.mem_crit_pct = float(os.getenv("OBSERVATORY_MEM_CRIT_PCT", 90))
-        cfg.cpu_warn_pct = float(os.getenv("OBSERVATORY_CPU_WARN_PCT", 80))
-        cfg.cpu_crit_pct = float(os.getenv("OBSERVATORY_CPU_CRIT_PCT", 95))
-        cfg.webhook_url = os.getenv("OBSERVATORY_WEBHOOK_URL", "http://127.0.0.1:8080/health")
-        cfg.api_url = os.getenv("OBSERVATORY_API_URL", "http://127.0.0.1:8642/health")
-        cfg.webhook_latency_slo_ms = float(os.getenv("OBSERVATORY_WEBHOOK_LATENCY_SLO_MS", 2000))
-        cfg.gateway_uptime_slo_pct = float(os.getenv("OBSERVATORY_GATEWAY_UPTIME_SLO_PCT", 99.5))
-        return cfg
-
-
-# ---------------------------------------------------------------------------
-# Health check models
-# ---------------------------------------------------------------------------
-
-@dataclass
-class CheckResult:
-    name: str
-    status: str  # "ok" | "warn" | "critical" | "error"
-    message: str
-    value: Optional[float] = None
-    unit: Optional[str] = None
-    extra: Dict[str, Any] = field(default_factory=dict)
-
-
-@dataclass
-class HealthSnapshot:
-    ts: str  # ISO8601 UTC
-    checks: List[CheckResult] = field(default_factory=list)
-
-    @property
-    def overall_status(self) -> str:
-        statuses = {c.status for c in self.checks}
-        if "critical" in statuses or "error" in statuses:
-            return "critical"
-        if "warn" in statuses:
-            return "warn"
-        return "ok"
-
-    def to_dict(self) -> Dict[str, Any]:
-        return {
-            "ts": self.ts,
-            "overall": self.overall_status,
-            "checks": [asdict(c) for c in self.checks],
-        }
-
-
-# ---------------------------------------------------------------------------
-# Individual health checks
-# ---------------------------------------------------------------------------
-
-def check_gateway_liveness() -> CheckResult:
-    """Check whether the Hermes gateway process is running."""
-    try:
-        from gateway.status import is_gateway_running, get_running_pid
-        running = is_gateway_running()
-        pid = get_running_pid()
-        if running:
-            return CheckResult(
-                name="gateway_process",
-                status="ok",
-                message=f"Gateway running (pid={pid})",
-                value=float(pid) if pid else None,
-            )
-        return CheckResult(
-            name="gateway_process",
-            status="critical",
-            message="Gateway process is NOT running",
-        )
-    except Exception as exc:
-        return CheckResult(
-            name="gateway_process",
-            status="error",
-            message=f"Could not determine gateway status: {exc}",
-        )
-
-
-def check_api_server_http(cfg: ObservatoryConfig) -> CheckResult:
-    """Check API server /health endpoint responsiveness."""
-    url = cfg.api_url
-    start = time.monotonic()
-    try:
-        req = urllib.request.Request(url, method="GET")
-        req.add_header("User-Agent", "hermes-observatory/1.0")
-        with urllib.request.urlopen(req, timeout=10) as resp:
-            latency_ms = (time.monotonic() - start) * 1000
-            body = resp.read(512).decode("utf-8", errors="replace")
-            status_code = resp.status
-        if status_code < 400:
-            slo_ok = latency_ms <= cfg.webhook_latency_slo_ms
-            return CheckResult(
-                name="api_server_http",
-                status="ok" if slo_ok else "warn",
-                message=f"API server OK ({latency_ms:.0f}ms){'' if slo_ok else ' — exceeds latency SLO'}",
-                value=latency_ms,
-                unit="ms",
-                extra={"status_code": status_code, "body_preview": body[:100]},
-            )
-        return CheckResult(
-            name="api_server_http",
-            status="critical",
-            message=f"API server returned HTTP {status_code}",
-            value=latency_ms,
-            unit="ms",
-        )
-    except urllib.error.URLError as exc:
-        latency_ms = (time.monotonic() - start) * 1000
-        # Not running is acceptable if gateway is not configured for API
-        reason = str(exc.reason) if hasattr(exc, "reason") else str(exc)
-        if "Connection refused" in reason or "Connection reset" in reason:
-            return CheckResult(
-                name="api_server_http",
-                status="warn",
-                message=f"API server not reachable at {url} (not started?)",
-                value=latency_ms,
-                unit="ms",
-            )
-        return CheckResult(
-            name="api_server_http",
-            status="error",
-            message=f"API server probe error: {exc}",
-            value=latency_ms,
-            unit="ms",
-        )
-    except Exception as exc:
-        latency_ms = (time.monotonic() - start) * 1000
-        return CheckResult(
-            name="api_server_http",
-            status="error",
-            message=f"API server probe exception: {exc}",
-            value=latency_ms,
-            unit="ms",
-        )
-
-
-def check_webhook_http(cfg: ObservatoryConfig) -> CheckResult:
-    """Check webhook endpoint responsiveness."""
-    url = cfg.webhook_url
-    start = time.monotonic()
-    try:
-        req = urllib.request.Request(url, method="GET")
-        req.add_header("User-Agent", "hermes-observatory/1.0")
-        with urllib.request.urlopen(req, timeout=10) as resp:
-            latency_ms = (time.monotonic() - start) * 1000
-            status_code = resp.status
-        slo_ok = latency_ms <= cfg.webhook_latency_slo_ms
-        if status_code < 400:
-            return CheckResult(
-                name="webhook_http",
-                status="ok" if slo_ok else "warn",
-                message=f"Webhook OK ({latency_ms:.0f}ms){'' if slo_ok else ' — exceeds latency SLO'}",
-                value=latency_ms,
-                unit="ms",
-                extra={"status_code": status_code},
-            )
-        return CheckResult(
-            name="webhook_http",
-            status="critical",
-            message=f"Webhook returned HTTP {status_code}",
-            value=latency_ms,
-            unit="ms",
-        )
-    except urllib.error.URLError as exc:
-        latency_ms = (time.monotonic() - start) * 1000
-        reason = str(exc.reason) if hasattr(exc, "reason") else str(exc)
-        if "Connection refused" in reason or "Connection reset" in reason:
-            return CheckResult(
-                name="webhook_http",
-                status="warn",
-                message=f"Webhook not reachable at {url} (not started?)",
-                value=latency_ms,
-                unit="ms",
-            )
-        return CheckResult(
-            name="webhook_http",
-            status="error",
-            message=f"Webhook probe error: {exc}",
-            value=latency_ms,
-            unit="ms",
-        )
-    except Exception as exc:
-        latency_ms = (time.monotonic() - start) * 1000
-        return CheckResult(
-            name="webhook_http",
-            status="error",
-            message=f"Webhook probe exception: {exc}",
-            value=latency_ms,
-            unit="ms",
-        )
-
-
-def check_disk(cfg: ObservatoryConfig) -> CheckResult:
-    """Check disk usage on the HERMES_HOME filesystem."""
-    if not _PSUTIL:
-        return CheckResult(name="disk", status="error", message="psutil not installed")
-    try:
-        hermes_home = Path(os.getenv("HERMES_HOME", Path.home() / ".hermes"))
-        path = str(hermes_home) if hermes_home.exists() else "/"
-        usage = psutil.disk_usage(path)
-        pct = usage.percent
-        free_gb = usage.free / (1024 ** 3)
-        if pct >= cfg.disk_crit_pct:
-            status = "critical"
-        elif pct >= cfg.disk_warn_pct:
-            status = "warn"
-        else:
-            status = "ok"
-        return CheckResult(
-            name="disk",
-            status=status,
-            message=f"Disk {pct:.1f}% used ({free_gb:.1f}GB free)",
-            value=pct,
-            unit="%",
-            extra={"free_bytes": usage.free, "total_bytes": usage.total},
-        )
-    except Exception as exc:
-        return CheckResult(name="disk", status="error", message=f"Disk check error: {exc}")
-
-
-def check_memory(cfg: ObservatoryConfig) -> CheckResult:
-    """Check system memory usage."""
-    if not _PSUTIL:
-        return CheckResult(name="memory", status="error", message="psutil not installed")
-    try:
-        mem = psutil.virtual_memory()
-        pct = mem.percent
-        available_gb = mem.available / (1024 ** 3)
-        if pct >= cfg.mem_crit_pct:
-            status = "critical"
-        elif pct >= cfg.mem_warn_pct:
-            status = "warn"
-        else:
-            status = "ok"
-        return CheckResult(
-            name="memory",
-            status=status,
-            message=f"Memory {pct:.1f}% used ({available_gb:.1f}GB available)",
-            value=pct,
-            unit="%",
-            extra={"available_bytes": mem.available, "total_bytes": mem.total},
-        )
-    except Exception as exc:
-        return CheckResult(name="memory", status="error", message=f"Memory check error: {exc}")
-
-
-def check_cpu(cfg: ObservatoryConfig) -> CheckResult:
-    """Check CPU usage (1-second sample)."""
-    if not _PSUTIL:
-        return CheckResult(name="cpu", status="error", message="psutil not installed")
-    try:
-        pct = psutil.cpu_percent(interval=1)
-        if pct >= cfg.cpu_crit_pct:
-            status = "critical"
-        elif pct >= cfg.cpu_warn_pct:
-            status = "warn"
-        else:
-            status = "ok"
-        return CheckResult(
-            name="cpu",
-            status=status,
-            message=f"CPU {pct:.1f}%",
-            value=pct,
-            unit="%",
-        )
-    except Exception as exc:
-        return CheckResult(name="cpu", status="error", message=f"CPU check error: {exc}")
-
-
-def check_database(cfg: ObservatoryConfig) -> CheckResult:
-    """Check observatory SQLite DB connectivity and size."""
-    db_path = cfg.db_path
-    try:
-        if not db_path.exists():
-            return CheckResult(
-                name="database",
-                status="warn",
-                message=f"Observatory DB not yet created at {db_path}",
-            )
-        size_kb = db_path.stat().st_size / 1024
-        conn = sqlite3.connect(str(db_path), timeout=5)
-        conn.execute("SELECT count(*) FROM health_snapshots").fetchone()
-        conn.close()
-        return CheckResult(
-            name="database",
-            status="ok",
-            message=f"Observatory DB OK ({size_kb:.1f}KB)",
-            value=size_kb,
-            unit="KB",
-            extra={"path": str(db_path)},
-        )
-    except Exception as exc:
-        return CheckResult(
-            name="database",
-            status="error",
-            message=f"DB check error: {exc}",
-        )
-
-
-def check_response_store_db() -> CheckResult:
-    """Check the API server's SQLite response store DB if it exists."""
-    try:
-        hermes_home = Path(os.getenv("HERMES_HOME", Path.home() / ".hermes"))
-        db_path = hermes_home / "response_store.db"
-        if not db_path.exists():
-            return CheckResult(
-                name="response_store_db",
-                status="ok",
-                message="Response store DB not present (API server not yet used)",
-            )
-        size_kb = db_path.stat().st_size / 1024
-        conn = sqlite3.connect(str(db_path), timeout=5)
-        count = conn.execute("SELECT count(*) FROM responses").fetchone()[0]
-        conn.close()
-        return CheckResult(
-            name="response_store_db",
-            status="ok",
-            message=f"Response store DB OK ({count} responses, {size_kb:.1f}KB)",
-            value=size_kb,
-            unit="KB",
-        )
-    except Exception as exc:
-        return CheckResult(
-            name="response_store_db",
-            status="error",
-            message=f"Response store DB error: {exc}",
-        )
-
-
-# ---------------------------------------------------------------------------
-# Snapshot collector
-# ---------------------------------------------------------------------------
-
-def collect_snapshot(cfg: ObservatoryConfig) -> HealthSnapshot:
-    """Run all checks and return a HealthSnapshot."""
-    ts = datetime.now(timezone.utc).isoformat()
-    checks = [
-        check_gateway_liveness(),
-        check_api_server_http(cfg),
-        check_webhook_http(cfg),
-        check_disk(cfg),
-        check_memory(cfg),
-        check_cpu(cfg),
-        check_database(cfg),
-        check_response_store_db(),
-    ]
-    return HealthSnapshot(ts=ts, checks=checks)
-
-
-# ---------------------------------------------------------------------------
-# SQLite persistence
-# ---------------------------------------------------------------------------
-
-@contextmanager
-def _db(path: Path):
-    path.parent.mkdir(parents=True, exist_ok=True)
-    conn = sqlite3.connect(str(path), timeout=10)
-    conn.execute("PRAGMA journal_mode=WAL")
-    conn.execute("PRAGMA foreign_keys=ON")
-    try:
-        yield conn
-        conn.commit()
-    finally:
-        conn.close()
-
-
-def _init_db(path: Path) -> None:
-    """Create tables if they don't exist."""
-    with _db(path) as conn:
-        conn.execute("""
-            CREATE TABLE IF NOT EXISTS health_snapshots (
-                id          INTEGER PRIMARY KEY AUTOINCREMENT,
-                ts          TEXT NOT NULL,
-                overall     TEXT NOT NULL,
-                payload     TEXT NOT NULL
-            )
-        """)
-        conn.execute("CREATE INDEX IF NOT EXISTS idx_snapshots_ts ON health_snapshots(ts)")
-        conn.execute("""
-            CREATE TABLE IF NOT EXISTS alerts_sent (
-                id          INTEGER PRIMARY KEY AUTOINCREMENT,
-                ts          TEXT NOT NULL,
-                check_name  TEXT NOT NULL,
-                status      TEXT NOT NULL,
-                message     TEXT NOT NULL
-            )
-        """)
-        conn.execute("CREATE INDEX IF NOT EXISTS idx_alerts_ts ON alerts_sent(ts)")
-
-
-def store_snapshot(cfg: ObservatoryConfig, snapshot: HealthSnapshot) -> None:
-    """Persist snapshot to SQLite."""
-    _init_db(cfg.db_path)
-    payload = json.dumps(snapshot.to_dict())
-    with _db(cfg.db_path) as conn:
-        conn.execute(
-            "INSERT INTO health_snapshots (ts, overall, payload) VALUES (?, ?, ?)",
-            (snapshot.ts, snapshot.overall_status, payload),
-        )
-        # Prune records older than RETENTION_DAYS
-        cutoff = (datetime.now(timezone.utc) - timedelta(days=RETENTION_DAYS)).isoformat()
-        conn.execute("DELETE FROM health_snapshots WHERE ts < ?", (cutoff,))
-
-
-def record_alert_sent(cfg: ObservatoryConfig, check_name: str, status: str, message: str) -> None:
-    """Record that an alert was dispatched."""
-    _init_db(cfg.db_path)
-    with _db(cfg.db_path) as conn:
-        conn.execute(
-            "INSERT INTO alerts_sent (ts, check_name, status, message) VALUES (?, ?, ?, ?)",
-            (datetime.now(timezone.utc).isoformat(), check_name, status, message),
-        )
-
-
-def load_snapshots(cfg: ObservatoryConfig, days: int = RETENTION_DAYS) -> List[Dict[str, Any]]:
-    """Load snapshots from the last N days."""
-    if not cfg.db_path.exists():
-        return []
-    cutoff = (datetime.now(timezone.utc) - timedelta(days=days)).isoformat()
-    with _db(cfg.db_path) as conn:
-        rows = conn.execute(
-            "SELECT ts, overall, payload FROM health_snapshots WHERE ts >= ? ORDER BY ts DESC",
-            (cutoff,),
-        ).fetchall()
-    return [json.loads(row[2]) for row in rows]
-
-
-# ---------------------------------------------------------------------------
-# Alerting
-# ---------------------------------------------------------------------------
-
-def _telegram_send(token: str, chat_id: str, text: str) -> bool:
-    """Send a Telegram message via the Bot API. Returns True on success."""
-    url = f"https://api.telegram.org/bot{token}/sendMessage"
-    payload = json.dumps({
-        "chat_id": chat_id,
-        "text": text,
-        "parse_mode": "HTML",
-        "disable_web_page_preview": True,
-    }).encode("utf-8")
-    req = urllib.request.Request(url, data=payload, method="POST")
-    req.add_header("Content-Type", "application/json")
-    req.add_header("User-Agent", "hermes-observatory/1.0")
-    try:
-        with urllib.request.urlopen(req, timeout=15) as resp:
-            body = json.loads(resp.read())
-            return bool(body.get("ok"))
-    except Exception as exc:
-        logger.warning("Telegram send failed: %s", exc)
-        return False
-
-
-def _status_emoji(status: str) -> str:
-    return {"ok": "✅", "warn": "⚠️", "critical": "🔴", "error": "❌"}.get(status, "❓")
-
-
-def maybe_alert(cfg: ObservatoryConfig, snapshot: HealthSnapshot, prev_snapshot: Optional[HealthSnapshot]) -> List[str]:
-    """
-    Fire Telegram alerts for newly degraded checks.
-    Returns list of alert messages sent.
-    """
-    if not cfg.telegram_token or not cfg.alert_chat_id:
-        return []
-
-    alerts_sent = []
-    prev_statuses: Dict[str, str] = {}
-    if prev_snapshot:
-        for c in prev_snapshot.checks:
-            prev_statuses[c.name] = c.status
-
-    for check in snapshot.checks:
-        if check.status in ("critical", "error"):
-            prev = prev_statuses.get(check.name, "ok")
-            if prev not in ("critical", "error"):
-                # Newly degraded — alert
-                emoji = _status_emoji(check.status)
-                msg = (
-                    f"{emoji} <b>Hermes Observatory Alert</b>\n\n"
-                    f"<b>Check:</b> {check.name}\n"
-                    f"<b>Status:</b> {check.status.upper()}\n"
-                    f"<b>Message:</b> {check.message}\n"
-                    f"<b>Time:</b> {snapshot.ts}"
-                )
-                if _telegram_send(cfg.telegram_token, cfg.alert_chat_id, msg):
-                    alerts_sent.append(msg)
-                    record_alert_sent(cfg, check.name, check.status, check.message)
-                    logger.info("Alert sent for %s (%s)", check.name, check.status)
-        elif check.status == "ok":
-            prev = prev_statuses.get(check.name)
-            if prev in ("critical", "error"):
-                # Recovery alert
-                msg = (
-                    f"✅ <b>Hermes Observatory — Recovery</b>\n\n"
-                    f"<b>Check:</b> {check.name} has recovered\n"
-                    f"<b>Message:</b> {check.message}\n"
-                    f"<b>Time:</b> {snapshot.ts}"
-                )
-                if _telegram_send(cfg.telegram_token, cfg.alert_chat_id, msg):
-                    alerts_sent.append(msg)
-                    record_alert_sent(cfg, check.name, "recovery", check.message)
-
-    return alerts_sent
-
-
-# ---------------------------------------------------------------------------
-# Daily digest
-# ---------------------------------------------------------------------------
-
-def build_digest(cfg: ObservatoryConfig) -> str:
-    """Build a daily health digest from stored snapshots."""
-    snapshots = load_snapshots(cfg, days=1)
-    total = len(snapshots)
-    if total == 0:
-        return "No health data available for the last 24 hours."
-
-    # Count by overall status
-    status_counts: Dict[str, int] = {"ok": 0, "warn": 0, "critical": 0, "error": 0}
-    check_degraded_counts: Dict[str, int] = {}
-    latencies: Dict[str, List[float]] = {}
-
-    for snap in snapshots:
-        overall = snap.get("overall", "ok")
-        status_counts[overall] = status_counts.get(overall, 0) + 1
-        for check in snap.get("checks", []):
-            name = check["name"]
-            status = check["status"]
-            if status in ("critical", "error", "warn"):
-                check_degraded_counts[name] = check_degraded_counts.get(name, 0) + 1
-            value = check.get("value")
-            unit = check.get("unit")
-            if value is not None and unit == "ms":
-                if name not in latencies:
-                    latencies[name] = []
-                latencies[name].append(float(value))
-
-    uptime_pct = 100.0 * status_counts["ok"] / total if total else 0.0
-    now = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M UTC")
-
-    lines = [
-        f"📊 <b>Hermes Observatory — Daily Digest</b>",
-        f"<b>Generated:</b> {now}",
-        f"",
-        f"<b>Last 24h Summary</b> ({total} samples)",
-        f"  Healthy:  {status_counts['ok']} ({100*status_counts['ok']//total if total else 0}%)",
-        f"  Warning:  {status_counts.get('warn', 0)}",
-        f"  Critical: {status_counts.get('critical', 0)}",
-        f"  Error:    {status_counts.get('error', 0)}",
-        f"",
-    ]
-
-    # SLO status
-    lines.append("<b>SLO Status</b>")
-    gw_uptime_target = cfg.gateway_uptime_slo_pct
-    gw_snapshots = [
-        s for s in snapshots
-        if any(c["name"] == "gateway_process" and c["status"] == "ok" for c in s.get("checks", []))
-    ]
-    gw_uptime = 100.0 * len(gw_snapshots) / total if total else 0.0
-    gw_ok = gw_uptime >= gw_uptime_target
-    lines.append(
-        f"  {'✅' if gw_ok else '❌'} Gateway uptime: {gw_uptime:.1f}% (target: ≥{gw_uptime_target}%)"
-    )
-
-    wh_latency_target = cfg.webhook_latency_slo_ms
-    if "webhook_http" in latencies and latencies["webhook_http"]:
-        wh_vals = sorted(latencies["webhook_http"])
-        p95_idx = int(len(wh_vals) * 0.95)
-        p95 = wh_vals[min(p95_idx, len(wh_vals) - 1)]
-        wh_ok = p95 <= wh_latency_target
-        lines.append(
-            f"  {'✅' if wh_ok else '❌'} Webhook p95 latency: {p95:.0f}ms (target: ≤{wh_latency_target:.0f}ms)"
-        )
-    else:
-        lines.append(f"  ⚫ Webhook latency: no data")
-
-    if "api_server_http" in latencies and latencies["api_server_http"]:
-        api_vals = sorted(latencies["api_server_http"])
-        p95_idx = int(len(api_vals) * 0.95)
-        p95 = api_vals[min(p95_idx, len(api_vals) - 1)]
-        api_ok = p95 <= wh_latency_target
-        lines.append(
-            f"  {'✅' if api_ok else '❌'} API server p95 latency: {p95:.0f}ms (target: ≤{wh_latency_target:.0f}ms)"
-        )
-
-    # Top degraded checks
-    if check_degraded_counts:
-        lines.append("")
-        lines.append("<b>Degraded Checks (24h)</b>")
-        for name, count in sorted(check_degraded_counts.items(), key=lambda x: -x[1]):
-            pct = 100 * count // total if total else 0
-            lines.append(f"  • {name}: {count} incidents ({pct}%)")
-
-    lines.append("")
-    lines.append(f"<i>Observatory DB: {cfg.db_path}</i>")
-
-    return "\n".join(lines)
-
-
-def send_digest(cfg: ObservatoryConfig) -> bool:
-    """Build and send the daily digest to Telegram. Returns True on success."""
-    digest = build_digest(cfg)
-    if cfg.telegram_token and cfg.digest_chat_id:
-        return _telegram_send(cfg.telegram_token, cfg.digest_chat_id, digest)
-    return False
-
-
-# ---------------------------------------------------------------------------
-# Display helpers
-# ---------------------------------------------------------------------------
-
-_STATUS_COLORS = {
-    "ok": "\033[32m",       # green
-    "warn": "\033[33m",     # yellow
-    "critical": "\033[31m", # red
-    "error": "\033[91m",    # bright red
-}
-_RESET = "\033[0m"
-
-
-def _color_status(status: str) -> str:
-    c = _STATUS_COLORS.get(status, "")
-    return f"{c}{status.upper()}{_RESET}"
-
-
-def print_snapshot(snapshot: HealthSnapshot) -> None:
-    overall_color = _STATUS_COLORS.get(snapshot.overall_status, "")
-    print(f"\n{'='*60}")
-    print(f"  Hermes Observatory — {snapshot.ts}")
-    print(f"  Overall: {overall_color}{snapshot.overall_status.upper()}{_RESET}")
-    print(f"{'='*60}")
-    for check in snapshot.checks:
-        emoji = _status_emoji(check.status)
-        val_str = f" [{check.value:.1f}{check.unit}]" if check.value is not None and check.unit else ""
-        print(f"  {emoji} {check.name:<25} {_color_status(check.status):<15} {check.message}{val_str}")
-    print()
-
-
-def print_slo_report(cfg: ObservatoryConfig) -> None:
-    """Print current SLO definitions and targets."""
-    snapshots = load_snapshots(cfg, days=30)
-    total = len(snapshots)
-    print(f"\n{'='*60}")
-    print("  Hermes Observatory — SLO Report (last 30 days)")
-    print(f"{'='*60}")
-    for slo_key, slo in SLO_DEFINITIONS.items():
-        print(f"\n  {slo['description']}")
-        print(f"    Target: {slo['target']}{slo['unit']}")
-        if total == 0:
-            print(f"    Status: no data")
-            continue
-        if slo_key == "gateway_uptime_pct":
-            ok_count = sum(
-                1 for s in snapshots
-                if any(c["name"] == "gateway_process" and c["status"] == "ok"
-                       for c in s.get("checks", []))
-            )
-            actual = 100.0 * ok_count / total
-            met = actual >= slo["target"]
-            print(f"    Actual: {actual:.2f}%  {'✅ MET' if met else '❌ MISSED'}")
-        elif slo_key in ("webhook_latency_ms", "api_server_http_latency_ms"):
-            check_name = "webhook_http" if "webhook" in slo_key else "api_server_http"
-            vals = [
-                float(c["value"])
-                for s in snapshots
-                for c in s.get("checks", [])
-                if c["name"] == check_name and c.get("value") is not None
-            ]
-            if vals:
-                vals.sort()
-                p95_idx = int(len(vals) * 0.95)
-                p95 = vals[min(p95_idx, len(vals) - 1)]
-                met = p95 <= slo["target"]
-                print(f"    p95:    {p95:.0f}ms  {'✅ MET' if met else '❌ MISSED'}")
-            else:
-                print(f"    Status: no latency data")
-    print()
-
-
-def print_history(cfg: ObservatoryConfig, count: int = 20) -> None:
-    """Print recent health records."""
-    snapshots = load_snapshots(cfg, days=RETENTION_DAYS)[:count]
-    if not snapshots:
-        print("No history available.")
-        return
-    print(f"\n{'='*60}")
-    print(f"  Last {min(count, len(snapshots))} health records")
-    print(f"{'='*60}")
-    for snap in snapshots:
-        ts = snap.get("ts", "?")
-        overall = snap.get("overall", "?")
-        emoji = _status_emoji(overall)
-        degraded = [c["name"] for c in snap.get("checks", []) if c["status"] != "ok"]
-        degraded_str = f" — issues: {', '.join(degraded)}" if degraded else ""
-        print(f"  {emoji} {ts}  {overall.upper()}{degraded_str}")
-    print()
-
-
-# ---------------------------------------------------------------------------
-# Daemon mode
-# ---------------------------------------------------------------------------
-
-class Observatory:
-    """Continuous monitoring daemon."""
-
-    def __init__(self, cfg: ObservatoryConfig):
-        self.cfg = cfg
-        self._running = False
-        self._prev_snapshot: Optional[HealthSnapshot] = None
-
-    def _handle_signal(self, signum: int, frame: Any) -> None:
-        logger.info("Received signal %d, shutting down...", signum)
-        self._running = False
-
-    def run_once(self) -> HealthSnapshot:
-        snapshot = collect_snapshot(self.cfg)
-        store_snapshot(self.cfg, snapshot)
-        alerts = maybe_alert(self.cfg, snapshot, self._prev_snapshot)
-        if alerts:
-            logger.info("Sent %d alert(s)", len(alerts))
-        self._prev_snapshot = snapshot
-        return snapshot
-
-    def run(self) -> None:
-        _init_db(self.cfg.db_path)
-        logger.info(
-            "Observatory starting — poll_interval=%ds db=%s",
-            self.cfg.poll_interval,
-            self.cfg.db_path,
-        )
-        self._running = True
-        signal.signal(signal.SIGINT, self._handle_signal)
-        signal.signal(signal.SIGTERM, self._handle_signal)
-
-        while self._running:
-            try:
-                snapshot = self.run_once()
-                logger.info("Health check: %s", snapshot.overall_status)
-            except Exception as exc:
-                logger.error("Health check failed: %s", exc, exc_info=True)
-            if self._running:
-                time.sleep(self.cfg.poll_interval)
-
-        logger.info("Observatory stopped.")
-
-
-# ---------------------------------------------------------------------------
-# CLI entry point
-# ---------------------------------------------------------------------------
-
-def main(argv: Optional[List[str]] = None) -> int:
-    parser = argparse.ArgumentParser(
-        description="Hermes Observatory — health monitoring & alerting",
-        formatter_class=argparse.RawDescriptionHelpFormatter,
-    )
-    parser.add_argument("--check", action="store_true", help="Run one health check and print results")
-    parser.add_argument("--daemon", action="store_true", help="Run as continuous monitoring daemon")
-    parser.add_argument("--digest", action="store_true", help="Print (and optionally send) daily digest")
-    parser.add_argument("--history", type=int, metavar="N", help="Show last N health records")
-    parser.add_argument("--slo", action="store_true", help="Print SLO report")
-    parser.add_argument("--send-digest", action="store_true", help="Send daily digest via Telegram")
-    parser.add_argument("--verbose", "-v", action="store_true", help="Enable verbose logging")
-
-    args = parser.parse_args(argv)
-
-    logging.basicConfig(
-        level=logging.DEBUG if args.verbose else logging.INFO,
-        format="%(asctime)s %(levelname)s [observatory] %(message)s",
-    )
-
-    cfg = ObservatoryConfig.from_env()
-    _init_db(cfg.db_path)
-
-    if args.check:
-        snapshot = collect_snapshot(cfg)
-        store_snapshot(cfg, snapshot)
-        print_snapshot(snapshot)
-        return 0 if snapshot.overall_status == "ok" else 1
-
-    if args.daemon:
-        obs = Observatory(cfg)
-        obs.run()
-        return 0
-
-    if args.digest or args.send_digest:
-        digest = build_digest(cfg)
-        print(digest)
-        if args.send_digest:
-            ok = send_digest(cfg)
-            if ok:
-                print("\n[Digest sent to Telegram]")
-            else:
-                print("\n[Telegram send skipped — token/chat_id not configured]")
-        return 0
-
-    if args.history is not None:
-        print_history(cfg, args.history)
-        return 0
-
-    if args.slo:
-        print_slo_report(cfg)
-        return 0
-
-    # Default: one-shot check
-    snapshot = collect_snapshot(cfg)
-    store_snapshot(cfg, snapshot)
-    print_snapshot(snapshot)
-    return 0 if snapshot.overall_status == "ok" else 1
-
-
-if __name__ == "__main__":
-    sys.exit(main())
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -42,7 +42,6 @@ dependencies = [
 modal = ["modal>=1.0.0,<2"]
 daytona = ["daytona>=0.148.0,<1"]
 dev = ["pytest>=9.0.2,<10", "pytest-asyncio>=1.3.0,<2", "pytest-xdist>=3.0,<4", "mcp>=1.2.0,<2"]
-observatory = ["psutil>=5.9.0,<7"]
 messaging = ["python-telegram-bot>=22.6,<23", "discord.py[voice]>=2.7.1,<3", "aiohttp>=3.13.3,<4", "slack-bolt>=1.18.0,<2", "slack-sdk>=3.27.0,<4"]
 cron = ["croniter>=6.0.0,<7"]
 slack = ["slack-bolt>=1.18.0,<2", "slack-sdk>=3.27.0,<4"]
--- a/scripts/deploy-validate
+++ b/scripts/deploy-validate
@@ -1,371 +0,0 @@
-#!/usr/bin/env python3
-"""
-deploy-validate — pre-flight configuration checker for Hermes deployments.
-
-Catches common configuration errors BEFORE they cause runtime failures.
-Safe to run at any time: it only reads files and makes lightweight network
-checks — it never writes state or sends messages.
-
-Usage:
-    python scripts/deploy-validate           # validate current environment
-    python scripts/deploy-validate --dry-run # alias for the same thing
-    python scripts/deploy-validate --env /path/to/.env
-
-Exit codes:
-    0  All checks passed (or only warnings).
-    1  One or more blocking errors found.
-"""
-
-from __future__ import annotations
-
-import argparse
-import os
-import socket
-import sys
-import urllib.error
-import urllib.request
-from pathlib import Path
-from typing import Optional
-
-# ---------------------------------------------------------------------------
-# Helpers
-# ---------------------------------------------------------------------------
-
-RESET = "\033[0m"
-RED = "\033[91m"
-YELLOW = "\033[93m"
-GREEN = "\033[92m"
-BOLD = "\033[1m"
-
-
-def _color(text: str, code: str) -> str:
-    if sys.stdout.isatty():
-        return f"{code}{text}{RESET}"
-    return text
-
-
-def ok(msg: str) -> None:
-    print(f"  {_color('✔', GREEN)} {msg}")
-
-
-def warn(msg: str) -> None:
-    print(f"  {_color('⚠', YELLOW)} {msg}")
-
-
-def error(msg: str) -> None:
-    print(f"  {_color('✘', RED)} {msg}")
-
-
-def section(title: str) -> None:
-    print(f"\n{_color(BOLD + title, BOLD)}")
-
-
-# ---------------------------------------------------------------------------
-# .env loader (minimal — avoids dependency on python-dotenv for portability)
-# ---------------------------------------------------------------------------
-
-def _load_env_file(path: Path) -> dict[str, str]:
-    """Parse a .env file and return a dict of key→value pairs."""
-    result: dict[str, str] = {}
-    if not path.exists():
-        return result
-    for line in path.read_text(encoding="utf-8").splitlines():
-        line = line.strip()
-        if not line or line.startswith("#") or "=" not in line:
-            continue
-        key, _, value = line.partition("=")
-        key = key.strip()
-        # Strip inline comments and surrounding quotes.
-        value = value.split("#")[0].strip().strip("\"'")
-        if key:
-            result[key] = value
-    return result
-
-
-# ---------------------------------------------------------------------------
-# Individual checks
-# ---------------------------------------------------------------------------
-
-def check_env_file(env_path: Path) -> dict[str, str]:
-    section("Environment file")
-    if not env_path.exists():
-        error(f".env not found at {env_path}")
-        error("Copy .env.example → .env and fill in your API keys.")
-        return {}
-    ok(f".env found at {env_path}")
-
-    raw = _load_env_file(env_path)
-
-    # Warn if any value looks like a placeholder.
-    placeholder_patterns = ("your_", "xxxx", "changeme", "todo", "replace_me")
-    for key, value in raw.items():
-        if value and any(p in value.lower() for p in placeholder_patterns):
-            warn(f"{key} looks like a placeholder: {value!r}")
-
-    return raw
-
-
-def check_llm_key(env: dict[str, str]) -> bool:
-    section("LLM provider")
-    providers = {
-        "OPENROUTER_API_KEY": "OpenRouter",
-        "ANTHROPIC_API_KEY": "Anthropic",
-        "OPENAI_API_KEY": "OpenAI",
-        "GLM_API_KEY": "z.ai / GLM",
-        "KIMI_API_KEY": "Kimi / Moonshot",
-        "MINIMAX_API_KEY": "MiniMax",
-        "NOUS_API_KEY": "Nous Portal",
-        "HF_TOKEN": "Hugging Face",
-        "KILOCODE_API_KEY": "KiloCode",
-        "OPENCODE_ZEN_API_KEY": "OpenCode Zen",
-    }
-    found = [name for key, name in providers.items() if env.get(key, "").strip()]
-    if not found:
-        error("No LLM API key detected. Set at least one (e.g. OPENROUTER_API_KEY).")
-        return False
-    ok(f"LLM provider key present: {', '.join(found)}")
-    return True
-
-
-def check_hermes_home(env: dict[str, str]) -> Optional[Path]:
-    section("HERMES_HOME data directory")
-    raw = env.get("HERMES_HOME") or os.environ.get("HERMES_HOME") or ""
-    if raw:
-        home = Path(raw).expanduser()
-    else:
-        home = Path.home() / ".hermes"
-
-    if not home.exists():
-        warn(f"HERMES_HOME does not exist yet: {home}  (will be created on first run)")
-        return home
-
-    ok(f"HERMES_HOME exists: {home}")
-
-    required_dirs = ["logs", "sessions", "cron", "memories", "skills"]
-    for d in required_dirs:
-        if not (home / d).is_dir():
-            warn(f"Expected subdirectory missing: {home / d}  (created automatically at runtime)")
-
-    if (home / ".env").exists():
-        ok(f"Data-directory .env present: {home / '.env'}")
-    else:
-        warn(f"No .env in HERMES_HOME ({home}). "
-             "The Docker entrypoint copies .env.example on first run; "
-             "for bare-metal installs copy it manually.")
-
-    return home
-
-
-def check_gateway_platforms(env: dict[str, str]) -> None:
-    section("Messaging platform tokens")
-    platforms: dict[str, list[str]] = {
-        "Telegram": ["TELEGRAM_BOT_TOKEN"],
-        "Discord": ["DISCORD_BOT_TOKEN"],
-        "Slack": ["SLACK_BOT_TOKEN", "SLACK_APP_TOKEN"],
-        "WhatsApp": [],  # pairing-based, no env key required
-        "Email": ["EMAIL_ADDRESS", "EMAIL_PASSWORD"],
-    }
-    any_found = False
-    for platform, keys in platforms.items():
-        if not keys:
-            continue  # WhatsApp — no key check
-        if all(env.get(k, "").strip() for k in keys):
-            ok(f"{platform}: configured ({', '.join(keys)})")
-            any_found = True
-    if not any_found:
-        warn("No messaging platform tokens found. "
-             "The gateway will start but accept no inbound messages. "
-             "Set at least one platform token (e.g. TELEGRAM_BOT_TOKEN).")
-
-
-def check_api_server_reachable(host: str = "127.0.0.1", port: int = 8642) -> None:
-    section("API server health check")
-    url = f"http://{host}:{port}/health"
-    try:
-        with urllib.request.urlopen(url, timeout=5) as resp:
-            body = resp.read().decode()
-            if '"status"' in body and "ok" in body:
-                ok(f"API server healthy: {url}")
-            else:
-                warn(f"Unexpected /health response from {url}: {body[:200]}")
-    except urllib.error.URLError as exc:
-        # Not a failure — the server may not be running in --dry-run mode.
-        warn(f"API server not reachable at {url}: {exc.reason}  "
-             "(expected if gateway is not running)")
-    except OSError as exc:
-        warn(f"API server not reachable at {url}: {exc}")
-
-
-def check_gateway_status(hermes_home: Optional[Path]) -> None:
-    section("Gateway runtime status")
-    if hermes_home is None:
-        warn("HERMES_HOME unknown — skipping runtime status check.")
-        return
-
-    state_file = hermes_home / "gateway_state.json"
-    pid_file = hermes_home / "gateway.pid"
-
-    if not state_file.exists() and not pid_file.exists():
-        warn("Gateway does not appear to be running (no PID or state file). "
-             "This is expected before the first start.")
-        return
-
-    if state_file.exists():
-        import json
-        try:
-            state = json.loads(state_file.read_text())
-            gw_state = state.get("gateway_state", "unknown")
-            updated = state.get("updated_at", "?")
-            if gw_state == "running":
-                ok(f"Gateway state: {gw_state} (updated {updated})")
-                platforms = state.get("platforms", {})
-                for plat, pdata in platforms.items():
-                    pstate = pdata.get("state", "unknown")
-                    if pstate in ("connected", "running", "ok"):
-                        ok(f"  Platform {plat}: {pstate}")
-                    else:
-                        warn(f"  Platform {plat}: {pstate} — {pdata.get('error_message', '')}")
-            elif gw_state in ("stopped", "startup_failed"):
-                error(f"Gateway state: {gw_state} — {state.get('exit_reason', 'no reason recorded')}")
-            else:
-                warn(f"Gateway state: {gw_state}")
-        except Exception as exc:
-            warn(f"Could not parse {state_file}: {exc}")
-    else:
-        warn("State file missing; only PID file found. Gateway may be starting.")
-
-
-def check_docker_available() -> None:
-    section("Docker / compose availability")
-    for cmd in ("docker", "docker compose"):
-        _check_command(cmd.split()[0], cmd)
-
-
-def _check_command(name: str, display: str) -> bool:
-    import shutil
-    if shutil.which(name):
-        ok(f"{display} found")
-        return True
-    warn(f"{display} not found in PATH (only required for Docker deployments)")
-    return False
-
-
-def check_ports_free(ports: list[int] = None) -> None:
-    section("Port availability")
-    if ports is None:
-        ports = [8642]
-    for port in ports:
-        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
-            s.settimeout(1)
-            result = s.connect_ex(("127.0.0.1", port))
-            if result == 0:
-                warn(f"Port {port} is already in use. "
-                     "The API server will fail to bind unless you change its port.")
-            else:
-                ok(f"Port {port} is free")
-
-
-def check_no_secrets_in_repo(repo_root: Path) -> None:
-    section("Secret hygiene")
-    dangerous = [".env", "*.pem", "*.key", "id_rsa", "id_ed25519"]
-    gitignore = repo_root / ".gitignore"
-    if gitignore.exists():
-        content = gitignore.read_text()
-        for pattern in [".env", "*.pem", "*.key"]:
-            if pattern in content or pattern.lstrip("*. ") in content:
-                ok(f".gitignore covers {pattern}")
-            else:
-                warn(f".gitignore does not mention {pattern}. "
-                     "Ensure secrets are never committed.")
-    else:
-        warn("No .gitignore found. Secrets could accidentally be committed.")
-
-    # Check the env file itself isn't tracked.
-    env_file = repo_root / ".env"
-    if env_file.exists():
-        import subprocess
-        try:
-            out = subprocess.run(
-                ["git", "ls-files", "--error-unmatch", ".env"],
-                cwd=repo_root,
-                capture_output=True,
-            )
-            if out.returncode == 0:
-                error(".env IS tracked by git! Remove it immediately: git rm --cached .env")
-            else:
-                ok(".env is not tracked by git")
-        except FileNotFoundError:
-            warn("git not found — cannot verify .env tracking status")
-
-
-# ---------------------------------------------------------------------------
-# Main
-# ---------------------------------------------------------------------------
-
-def main() -> int:
-    parser = argparse.ArgumentParser(
-        description="Pre-flight configuration validator for Hermes deployments.",
-        formatter_class=argparse.RawDescriptionHelpFormatter,
-        epilog=__doc__,
-    )
-    parser.add_argument(
-        "--dry-run", action="store_true",
-        help="Alias for the default mode (no state is written regardless).",
-    )
-    parser.add_argument(
-        "--env", metavar="PATH",
-        help="Path to .env file (default: .env in repo root).",
-    )
-    parser.add_argument(
-        "--check-ports", action="store_true",
-        help="Also verify that required ports are free (useful before first start).",
-    )
-    parser.add_argument(
-        "--skip-health", action="store_true",
-        help="Skip the live /health HTTP check (use when gateway is not running).",
-    )
-    args = parser.parse_args()
-
-    print(f"\n{_color(BOLD + 'Hermes Deploy Validator', BOLD)}")
-    print("=" * 50)
-
-    repo_root = Path(__file__).resolve().parent.parent
-    env_path = Path(args.env) if args.env else repo_root / ".env"
-
-    errors_before = [0]  # mutable sentinel
-
-    # Monkey-patch error() to count failures.
-    _original_error = globals()["error"]
-    error_count = 0
-
-    def counting_error(msg: str) -> None:
-        nonlocal error_count
-        error_count += 1
-        _original_error(msg)
-
-    globals()["error"] = counting_error
-
-    # Run checks.
-    env = check_env_file(env_path)
-    check_no_secrets_in_repo(repo_root)
-    llm_ok = check_llm_key(env)
-    hermes_home = check_hermes_home(env)
-    check_gateway_platforms(env)
-    if args.check_ports:
-        check_ports_free()
-    if not args.skip_health:
-        check_api_server_reachable()
-    check_gateway_status(hermes_home)
-
-    # Summary.
-    print(f"\n{'=' * 50}")
-    if error_count == 0:
-        print(_color(f"All checks passed (0 errors).", GREEN))
-        return 0
-    else:
-        print(_color(f"{error_count} error(s) found. Fix them before deploying.", RED))
-        return 1
-
-
-if __name__ == "__main__":
-    sys.exit(main())
--- a/tests/test_observatory.py
+++ b/tests/test_observatory.py
@@ -1,455 +0,0 @@
-"""
-Tests for observatory.py — health monitoring & alerting.
-
-Refs #147
-"""
-
-from __future__ import annotations
-
-import json
-import os
-import sqlite3
-import sys
-import tempfile
-import time
-from pathlib import Path
-from typing import Any
-from unittest.mock import MagicMock, patch
-
-import pytest
-
-PROJECT_ROOT = Path(__file__).parent.parent
-if str(PROJECT_ROOT) not in sys.path:
-    sys.path.insert(0, str(PROJECT_ROOT))
-
-import observatory as obs
-
-
-# ---------------------------------------------------------------------------
-# Fixtures
-# ---------------------------------------------------------------------------
-
-@pytest.fixture
-def cfg(tmp_path):
-    """Return an ObservatoryConfig pointing at a temp directory."""
-    cfg = obs.ObservatoryConfig()
-    cfg.db_path = tmp_path / "observatory.db"
-    cfg.alert_chat_id = "99999"
-    cfg.digest_chat_id = "99999"
-    cfg.telegram_token = "fake-token"
-    cfg.webhook_url = "http://127.0.0.1:19999/health"  # port never bound
-    cfg.api_url = "http://127.0.0.1:19998/health"
-    return cfg
-
-
-# ---------------------------------------------------------------------------
-# Config tests
-# ---------------------------------------------------------------------------
-
-class TestObservatoryConfig:
-    def test_defaults(self):
-        c = obs.ObservatoryConfig()
-        assert c.disk_warn_pct == 80.0
-        assert c.disk_crit_pct == 90.0
-        assert c.mem_warn_pct == 80.0
-        assert c.mem_crit_pct == 90.0
-        assert c.cpu_warn_pct == 80.0
-        assert c.cpu_crit_pct == 95.0
-        assert c.poll_interval == 60
-        assert c.webhook_latency_slo_ms == 2000.0
-        assert c.gateway_uptime_slo_pct == 99.5
-
-    def test_from_env_overrides(self, monkeypatch):
-        monkeypatch.setenv("OBSERVATORY_DISK_WARN_PCT", "70")
-        monkeypatch.setenv("OBSERVATORY_POLL_INTERVAL", "30")
-        monkeypatch.setenv("OBSERVATORY_ALERT_CHAT_ID", "12345")
-        monkeypatch.setenv("TELEGRAM_BOT_TOKEN", "tok123")
-        c = obs.ObservatoryConfig.from_env()
-        assert c.disk_warn_pct == 70.0
-        assert c.poll_interval == 30
-        assert c.alert_chat_id == "12345"
-        assert c.telegram_token == "tok123"
-
-    def test_digest_chat_falls_back_to_alert(self, monkeypatch):
-        monkeypatch.setenv("OBSERVATORY_ALERT_CHAT_ID", "abc")
-        monkeypatch.delenv("OBSERVATORY_DIGEST_CHAT_ID", raising=False)
-        c = obs.ObservatoryConfig.from_env()
-        assert c.digest_chat_id == "abc"
-
-
-# ---------------------------------------------------------------------------
-# CheckResult / HealthSnapshot tests
-# ---------------------------------------------------------------------------
-
-class TestHealthSnapshot:
-    def _make_snapshot(self, statuses):
-        checks = [obs.CheckResult(name=f"c{i}", status=s, message="") for i, s in enumerate(statuses)]
-        return obs.HealthSnapshot(ts="2026-01-01T00:00:00+00:00", checks=checks)
-
-    def test_overall_ok(self):
-        snap = self._make_snapshot(["ok", "ok"])
-        assert snap.overall_status == "ok"
-
-    def test_overall_warn(self):
-        snap = self._make_snapshot(["ok", "warn"])
-        assert snap.overall_status == "warn"
-
-    def test_overall_critical(self):
-        snap = self._make_snapshot(["ok", "warn", "critical"])
-        assert snap.overall_status == "critical"
-
-    def test_overall_error(self):
-        snap = self._make_snapshot(["ok", "error"])
-        assert snap.overall_status == "critical"
-
-    def test_to_dict(self):
-        snap = self._make_snapshot(["ok"])
-        d = snap.to_dict()
-        assert d["overall"] == "ok"
-        assert isinstance(d["checks"], list)
-        assert d["checks"][0]["name"] == "c0"
-
-
-# ---------------------------------------------------------------------------
-# Individual check tests
-# ---------------------------------------------------------------------------
-
-class TestCheckGatewayLiveness:
-    def test_running(self):
-        with patch("gateway.status.is_gateway_running", return_value=True), \
-             patch("gateway.status.get_running_pid", return_value=12345):
-            result = obs.check_gateway_liveness()
-        assert result.status == "ok"
-        assert "12345" in result.message
-
-    def test_not_running(self):
-        with patch("gateway.status.is_gateway_running", return_value=False), \
-             patch("gateway.status.get_running_pid", return_value=None):
-            result = obs.check_gateway_liveness()
-        assert result.status == "critical"
-
-    def test_import_error(self):
-        import builtins
-        real_import = builtins.__import__
-
-        def mock_import(name, *args, **kwargs):
-            if name == "gateway.status":
-                raise ImportError("no module")
-            return real_import(name, *args, **kwargs)
-
-        with patch("builtins.__import__", side_effect=mock_import):
-            result = obs.check_gateway_liveness()
-        assert result.status in ("error", "critical", "ok")  # graceful
-
-
-class TestCheckDisk:
-    def test_ok(self, cfg):
-        mock_usage = MagicMock()
-        mock_usage.percent = 50.0
-        mock_usage.free = 10 * 1024 ** 3
-        mock_usage.total = 20 * 1024 ** 3
-        with patch("psutil.disk_usage", return_value=mock_usage):
-            result = obs.check_disk(cfg)
-        assert result.status == "ok"
-        assert result.value == 50.0
-
-    def test_warn(self, cfg):
-        mock_usage = MagicMock()
-        mock_usage.percent = 85.0
-        mock_usage.free = 3 * 1024 ** 3
-        mock_usage.total = 20 * 1024 ** 3
-        with patch("psutil.disk_usage", return_value=mock_usage):
-            result = obs.check_disk(cfg)
-        assert result.status == "warn"
-
-    def test_critical(self, cfg):
-        mock_usage = MagicMock()
-        mock_usage.percent = 92.0
-        mock_usage.free = 1 * 1024 ** 3
-        mock_usage.total = 20 * 1024 ** 3
-        with patch("psutil.disk_usage", return_value=mock_usage):
-            result = obs.check_disk(cfg)
-        assert result.status == "critical"
-
-    def test_no_psutil(self, cfg, monkeypatch):
-        monkeypatch.setattr(obs, "_PSUTIL", False)
-        result = obs.check_disk(cfg)
-        assert result.status == "error"
-
-
-class TestCheckMemory:
-    def test_ok(self, cfg):
-        mock_mem = MagicMock()
-        mock_mem.percent = 60.0
-        mock_mem.available = 4 * 1024 ** 3
-        mock_mem.total = 16 * 1024 ** 3
-        with patch("psutil.virtual_memory", return_value=mock_mem):
-            result = obs.check_memory(cfg)
-        assert result.status == "ok"
-
-    def test_critical(self, cfg):
-        mock_mem = MagicMock()
-        mock_mem.percent = 95.0
-        mock_mem.available = 512 * 1024 ** 2
-        mock_mem.total = 16 * 1024 ** 3
-        with patch("psutil.virtual_memory", return_value=mock_mem):
-            result = obs.check_memory(cfg)
-        assert result.status == "critical"
-
-
-class TestCheckCPU:
-    def test_ok(self, cfg):
-        with patch("psutil.cpu_percent", return_value=40.0):
-            result = obs.check_cpu(cfg)
-        assert result.status == "ok"
-
-    def test_warn(self, cfg):
-        with patch("psutil.cpu_percent", return_value=85.0):
-            result = obs.check_cpu(cfg)
-        assert result.status == "warn"
-
-    def test_critical(self, cfg):
-        with patch("psutil.cpu_percent", return_value=98.0):
-            result = obs.check_cpu(cfg)
-        assert result.status == "critical"
-
-
-class TestCheckDatabase:
-    def test_ok(self, cfg):
-        obs._init_db(cfg.db_path)
-        result = obs.check_database(cfg)
-        assert result.status == "ok"
-
-    def test_not_yet_created(self, cfg):
-        # db_path does not exist
-        result = obs.check_database(cfg)
-        assert result.status == "warn"
-
-
-class TestCheckHTTP:
-    def test_webhook_connection_refused(self, cfg):
-        result = obs.check_webhook_http(cfg)
-        # Port 19999 is not bound — should get a "not reachable" warn
-        assert result.status in ("warn", "error")
-
-    def test_api_server_connection_refused(self, cfg):
-        result = obs.check_api_server_http(cfg)
-        assert result.status in ("warn", "error")
-
-    def test_webhook_ok(self, cfg):
-        import urllib.error
-        from unittest.mock import patch, MagicMock
-        mock_resp = MagicMock()
-        mock_resp.__enter__ = lambda s: s
-        mock_resp.__exit__ = MagicMock(return_value=False)
-        mock_resp.status = 200
-        mock_resp.read.return_value = b'{"status":"ok"}'
-        with patch("urllib.request.urlopen", return_value=mock_resp):
-            result = obs.check_webhook_http(cfg)
-        assert result.status in ("ok", "warn")
-
-    def test_webhook_http_error(self, cfg):
-        mock_resp = MagicMock()
-        mock_resp.__enter__ = lambda s: s
-        mock_resp.__exit__ = MagicMock(return_value=False)
-        mock_resp.status = 503
-        with patch("urllib.request.urlopen", return_value=mock_resp):
-            result = obs.check_webhook_http(cfg)
-        assert result.status == "critical"
-
-
-# ---------------------------------------------------------------------------
-# Persistence tests
-# ---------------------------------------------------------------------------
-
-class TestPersistence:
-    def test_store_and_load(self, cfg):
-        obs._init_db(cfg.db_path)
-        from datetime import datetime, timezone
-        ts = datetime.now(timezone.utc).isoformat()
-        snap = obs.HealthSnapshot(
-            ts=ts,
-            checks=[obs.CheckResult(name="test", status="ok", message="fine")],
-        )
-        obs.store_snapshot(cfg, snap)
-        loaded = obs.load_snapshots(cfg, days=30)
-        assert len(loaded) == 1
-        assert loaded[0]["overall"] == "ok"
-
-    def test_retention_pruning(self, cfg):
-        obs._init_db(cfg.db_path)
-        # Insert an old record directly
-        with obs._db(cfg.db_path) as conn:
-            conn.execute(
-                "INSERT INTO health_snapshots (ts, overall, payload) VALUES (?, ?, ?)",
-                ("2000-01-01T00:00:00+00:00", "ok", '{"ts":"2000-01-01T00:00:00+00:00","overall":"ok","checks":[]}'),
-            )
-        snap = obs.HealthSnapshot(
-            ts="2026-01-01T00:00:00+00:00",
-            checks=[],
-        )
-        obs.store_snapshot(cfg, snap)
-        # Old record should have been pruned
-        with obs._db(cfg.db_path) as conn:
-            count = conn.execute("SELECT count(*) FROM health_snapshots WHERE ts < '2001-01-01'").fetchone()[0]
-        assert count == 0
-
-    def test_record_alert_sent(self, cfg):
-        obs._init_db(cfg.db_path)
-        obs.record_alert_sent(cfg, "gateway_process", "critical", "not running")
-        with obs._db(cfg.db_path) as conn:
-            count = conn.execute("SELECT count(*) FROM alerts_sent").fetchone()[0]
-        assert count == 1
-
-
-# ---------------------------------------------------------------------------
-# Alerting tests
-# ---------------------------------------------------------------------------
-
-class TestAlerting:
-    def _snap(self, status):
-        return obs.HealthSnapshot(
-            ts="2026-01-01T00:00:00+00:00",
-            checks=[obs.CheckResult(name="gateway_process", status=status, message="test")],
-        )
-
-    def test_no_alert_when_ok(self, cfg):
-        snap = self._snap("ok")
-        prev = self._snap("ok")
-        obs._init_db(cfg.db_path)
-        with patch("observatory._telegram_send", return_value=True) as mock_send:
-            alerts = obs.maybe_alert(cfg, snap, prev)
-        mock_send.assert_not_called()
-        assert alerts == []
-
-    def test_alert_on_new_critical(self, cfg):
-        snap = self._snap("critical")
-        prev = self._snap("ok")
-        obs._init_db(cfg.db_path)
-        with patch("observatory._telegram_send", return_value=True) as mock_send:
-            alerts = obs.maybe_alert(cfg, snap, prev)
-        mock_send.assert_called_once()
-        assert len(alerts) == 1
-
-    def test_no_duplicate_alert(self, cfg):
-        snap = self._snap("critical")
-        prev = self._snap("critical")  # already critical
-        obs._init_db(cfg.db_path)
-        with patch("observatory._telegram_send", return_value=True) as mock_send:
-            alerts = obs.maybe_alert(cfg, snap, prev)
-        mock_send.assert_not_called()
-        assert alerts == []
-
-    def test_recovery_alert(self, cfg):
-        snap = self._snap("ok")
-        prev = self._snap("critical")
-        obs._init_db(cfg.db_path)
-        with patch("observatory._telegram_send", return_value=True) as mock_send:
-            alerts = obs.maybe_alert(cfg, snap, prev)
-        mock_send.assert_called_once()
-
-    def test_no_alert_without_token(self, cfg):
-        cfg.telegram_token = None
-        snap = self._snap("critical")
-        obs._init_db(cfg.db_path)
-        alerts = obs.maybe_alert(cfg, snap, None)
-        assert alerts == []
-
-    def test_no_alert_without_chat_id(self, cfg):
-        cfg.alert_chat_id = None
-        snap = self._snap("critical")
-        obs._init_db(cfg.db_path)
-        alerts = obs.maybe_alert(cfg, snap, None)
-        assert alerts == []
-
-
-# ---------------------------------------------------------------------------
-# Digest tests
-# ---------------------------------------------------------------------------
-
-class TestDigest:
-    def test_empty_digest(self, cfg):
-        obs._init_db(cfg.db_path)
-        digest = obs.build_digest(cfg)
-        assert "no health data" in digest.lower() or "24 hours" in digest.lower()
-
-    def test_digest_with_data(self, cfg):
-        obs._init_db(cfg.db_path)
-        from datetime import datetime, timezone, timedelta
-        ts = (datetime.now(timezone.utc) - timedelta(hours=1)).isoformat()
-        snap = obs.HealthSnapshot(
-            ts=ts,
-            checks=[
-                obs.CheckResult(name="gateway_process", status="ok", message="running"),
-                obs.CheckResult(name="disk", status="ok", message="50% used", value=50.0, unit="%"),
-                obs.CheckResult(name="webhook_http", status="ok", message="ok", value=150.0, unit="ms"),
-            ],
-        )
-        obs.store_snapshot(cfg, snap)
-        digest = obs.build_digest(cfg)
-        assert "Daily Digest" in digest
-        assert "Gateway" in digest or "gateway" in digest
-
-    def test_send_digest_no_token(self, cfg):
-        cfg.telegram_token = None
-        obs._init_db(cfg.db_path)
-        result = obs.send_digest(cfg)
-        assert result is False
-
-
-# ---------------------------------------------------------------------------
-# SLO tests
-# ---------------------------------------------------------------------------
-
-class TestSLO:
-    def test_slo_definitions_complete(self):
-        assert "gateway_uptime_pct" in obs.SLO_DEFINITIONS
-        assert "webhook_latency_ms" in obs.SLO_DEFINITIONS
-        assert "api_server_latency_ms" in obs.SLO_DEFINITIONS
-
-    def test_slo_targets(self):
-        assert obs.SLO_DEFINITIONS["gateway_uptime_pct"]["target"] == 99.5
-        assert obs.SLO_DEFINITIONS["webhook_latency_ms"]["target"] == 2000
-
-
-# ---------------------------------------------------------------------------
-# CLI tests
-# ---------------------------------------------------------------------------
-
-class TestCLI:
-    def test_check_exits_0_on_ok(self, cfg, monkeypatch, tmp_path):
-        monkeypatch.setenv("OBSERVATORY_DB_PATH", str(tmp_path / "obs.db"))
-        ok_snap = obs.HealthSnapshot(
-            ts="2026-01-01T00:00:00+00:00",
-            checks=[obs.CheckResult(name="all_good", status="ok", message="fine")],
-        )
-        with patch("observatory.collect_snapshot", return_value=ok_snap), \
-             patch("observatory.store_snapshot"):
-            rc = obs.main(["--check"])
-        assert rc == 0
-
-    def test_check_exits_nonzero_on_critical(self, cfg, monkeypatch, tmp_path):
-        monkeypatch.setenv("OBSERVATORY_DB_PATH", str(tmp_path / "obs.db"))
-        bad_snap = obs.HealthSnapshot(
-            ts="2026-01-01T00:00:00+00:00",
-            checks=[obs.CheckResult(name="gateway_process", status="critical", message="down")],
-        )
-        with patch("observatory.collect_snapshot", return_value=bad_snap), \
-             patch("observatory.store_snapshot"):
-            rc = obs.main(["--check"])
-        assert rc != 0
-
-    def test_digest_flag(self, monkeypatch, tmp_path):
-        monkeypatch.setenv("OBSERVATORY_DB_PATH", str(tmp_path / "obs.db"))
-        rc = obs.main(["--digest"])
-        assert rc == 0
-
-    def test_slo_flag(self, monkeypatch, tmp_path):
-        monkeypatch.setenv("OBSERVATORY_DB_PATH", str(tmp_path / "obs.db"))
-        rc = obs.main(["--slo"])
-        assert rc == 0
-
-    def test_history_flag(self, monkeypatch, tmp_path):
-        monkeypatch.setenv("OBSERVATORY_DB_PATH", str(tmp_path / "obs.db"))
-        rc = obs.main(["--history", "5"])
-        assert rc == 0
--- a/wizard-bootstrap/FORGE_OPERATIONS_GUIDE.md
+++ b/wizard-bootstrap/FORGE_OPERATIONS_GUIDE.md
@@ -1,215 +0,0 @@
-# Forge Operations Guide
-
-> **Audience:** Forge wizards joining the hermes-agent project
-> **Purpose:** Practical patterns, common pitfalls, and operational wisdom
-> **Companion to:** `WIZARD_ENVIRONMENT_CONTRACT.md`
-
---
-
-## The One Rule
-
-**Read the actual state before acting.**
-
-Before touching any service, config, or codebase: `ps aux | grep hermes`, `cat ~/.hermes/gateway_state.json`, `curl http://127.0.0.1:8642/health`. The forge punishes assumptions harder than it rewards speed. Evidence always beats intuition.
-
---
-
-## First 15 Minutes on a New System
-
-```bash
-# 1. Validate your environment
-python wizard-bootstrap/wizard_bootstrap.py
-
-# 2. Check what is actually running
-ps aux | grep -E 'hermes|python|gateway'
-
-# 3. Check the data directory
-ls -la ~/.hermes/
-cat ~/.hermes/gateway_state.json 2>/dev/null | python3 -m json.tool
-
-# 4. Verify health endpoints (if gateway is up)
-curl -sf http://127.0.0.1:8642/health | python3 -m json.tool
-
-# 5. Run the smoke test
-source venv/bin/activate
-python -m pytest tests/ -q -x --timeout=60 2>&1 | tail -20
-```
-
-Do not begin work until all five steps return clean output.
-
---
-
-## Import Chain — Know It, Respect It
-
-The dependency order is load-bearing. Violating it causes silent failures:
-
-```
-tools/registry.py   ← no deps; imported by everything
-       ↑
-tools/*.py          ← each calls registry.register() at import time
-       ↑
-model_tools.py      ← imports registry; triggers tool discovery
-       ↑
-run_agent.py / cli.py / batch_runner.py
-```
-
-**If you add a tool file**, you must also:
-1. Add its import to `model_tools.py` `_discover_tools()`
-2. Add it to `toolsets.py` (core or a named toolset)
-
-Missing either step causes the tool to silently not appear — no error, just absence.
-
---
-
-## The Five Profile Rules
-
-Hermes supports isolated profiles (`hermes -p myprofile`). Profile-unsafe code has caused repeated bugs. Memorize these:
-
-| Do this | Not this |
-|---------|----------|
-| `get_hermes_home()` | `Path.home() / ".hermes"` |
-| `display_hermes_home()` in user messages | hardcoded `~/.hermes` strings |
-| `get_hermes_home() / "sessions"` in tests | `~/.hermes/sessions` in tests |
-
-Import both from `hermes_constants`. Every `~/.hermes` hardcode is a latent profile bug.
-
---
-
-## Prompt Caching — Do Not Break It
-
-The agent caches system prompts. Cache breaks force re-billing of the entire context window on every turn. The following actions break caching mid-conversation and are forbidden:
-
- Altering past context
- Changing the active toolset
- Reloading memories or rebuilding the system prompt
-
-The only sanctioned context alteration is the context compressor (`agent/context_compressor.py`). If your feature touches the message history, read that file first.
-
---
-
-## Adding a Slash Command (Checklist)
-
-Four files, in order:
-
-1. **`hermes_cli/commands.py`** — add `CommandDef` to `COMMAND_REGISTRY`
-2. **`cli.py`** — add handler branch in `HermesCLI.process_command()`
-3. **`gateway/run.py`** — add handler if it should work in messaging platforms
-4. **Aliases** — add to the `aliases` tuple on the `CommandDef`; everything else updates automatically
-
-All downstream consumers (Telegram menu, Slack routing, autocomplete, help text) derive from `COMMAND_REGISTRY`. You never touch them directly.
-
---
-
-## Tool Schema Pitfalls
-
-**Do NOT cross-reference other toolsets in schema descriptions.**
-Writing "prefer `web_search` over this tool" in a browser tool's description will cause the model to hallucinate calls to `web_search` when it's not loaded. Cross-references belong in `get_tool_definitions()` post-processing blocks in `model_tools.py`.
-
-**Do NOT use `\033[K` (ANSI erase-to-EOL) in display code.**
-Under `prompt_toolkit`'s `patch_stdout`, it leaks as literal `?[K`. Use space-padding instead: `f"\r{line}{' ' * pad}"`.
-
-**Do NOT use `simple_term_menu` for interactive menus.**
-It ghosts on scroll in tmux/iTerm2. Use `curses` (stdlib). See `hermes_cli/tools_config.py` for the pattern.
-
---
-
-## Health Check Anatomy
-
-A healthy instance returns:
-
-```json
-{
-  "status": "ok",
-  "gateway_state": "running",
-  "platforms": {
-    "telegram": {"state": "connected"}
-  }
-}
-```
-
-| Field | Healthy value | What a bad value means |
-|-------|--------------|----------------------|
-| `status` | `"ok"` | HTTP server down |
-| `gateway_state` | `"running"` | Still starting or crashed |
-| `platforms.<name>.state` | `"connected"` | Auth failure or network issue |
-
-`gateway_state: "starting"` is normal for up to 60 s on boot. Beyond that, check logs for auth errors:
-
-```bash
-journalctl -u hermes-gateway --since "2 minutes ago" | grep -i "error\|token\|auth"
-```
-
---
-
-## Gateway Won't Start — Diagnosis Order
-
-1. `ss -tlnp | grep 8642` — port conflict?
-2. `cat ~/.hermes/gateway.pid` → `ps -p <pid>` — stale PID file?
-3. `hermes gateway start --replace` — clears stale locks and PIDs
-4. `HERMES_LOG_LEVEL=DEBUG hermes gateway start` — verbose output
-5. Check `~/.hermes/.env` — missing or placeholder token?
-
---
-
-## Before Every PR
-
-```bash
-source venv/bin/activate
-python -m pytest tests/ -q          # full suite: ~3 min, ~3000 tests
-python scripts/deploy-validate       # deployment health check
-python wizard-bootstrap/wizard_bootstrap.py  # environment sanity
-```
-
-All three must exit 0. Do not skip. "It works locally" is not sufficient evidence.
-
---
-
-## Session and State Files
-
-| Store | Location | Notes |
-|-------|----------|-------|
-| Sessions | `~/.hermes/sessions/*.json` | Persisted across restarts |
-| Memories | `~/.hermes/memories/*.md` | Written by the agent's memory tool |
-| Cron jobs | `~/.hermes/cron/*.json` | Scheduler state |
-| Gateway state | `~/.hermes/gateway_state.json` | Live platform connection status |
-| Response store | `~/.hermes/response_store.db` | SQLite WAL — API server only |
-
-All paths go through `get_hermes_home()`. Never hardcode. Always backup before a major update:
-
-```bash
-tar czf ~/backups/hermes_$(date +%F_%H%M).tar.gz ~/.hermes/
-```
-
---
-
-## Writing Tests
-
-```bash
-python -m pytest tests/path/to/test.py -q    # single file
-python -m pytest tests/ -q -k "test_name"    # by name
-python -m pytest tests/ -q -x               # stop on first failure
-```
-
-**Test isolation rules:**
- `tests/conftest.py` has an autouse fixture that redirects `HERMES_HOME` to a temp dir. Never write to `~/.hermes/` in tests.
- Profile tests must mock both `Path.home()` and `HERMES_HOME`. See `tests/hermes_cli/test_profiles.py` for the pattern.
- Do not mock the database. Integration tests should use real SQLite with a temp path.
-
---
-
-## Commit Conventions
-
-```
-feat: add X           # new capability
-fix: correct Y        # bug fix
-refactor: restructure Z  # no behaviour change
-test: add tests for W    # test-only
-chore: update deps       # housekeeping
-docs: clarify X          # documentation only
-```
-
-Include `Fixes #NNN` or `Refs #NNN` in the commit message body to close or reference issues automatically.
-
---
-
-*This guide lives in `wizard-bootstrap/`. Update it when you discover a new pitfall or pattern worth preserving.*