diff --git a/docs/VPS_RECOVERY_RUNBOOK.md b/docs/VPS_RECOVERY_RUNBOOK.md new file mode 100644 index 00000000..bf961dff --- /dev/null +++ b/docs/VPS_RECOVERY_RUNBOOK.md @@ -0,0 +1,314 @@ +# VPS Recovery Runbook +# Issue #481 — Single-node VPS Single Point of Failure +# Created: STEP35 free burn | 2026-04-26 + +## Risk Statement + +The Hermes VPS (143.198.27.163) hosts Gitea, FastAPI backend, and the Ezra/Allegro/Bezalel wizard houses. This is a single point of failure — if the VPS is lost, the entire forge and agent coordination layer is offline. + +**Risk Level:** High + +--- + +## Current Mitigations (As-Built) + +### 1. Daily Database Backups + +There is a daily backup job running on the VPS: + +``` +30 3 * * * /root/wizards/bezalel/backup_databases.sh +``` + +**What it backs up:** +- Gitea SQLite databases (`/var/lib/gitea/data/gitea.db` and related) +- Wizard configuration databases (if any) +- Retained for 7 days (estimated — verify script) + +**Where backups are stored:** (TBD — need to inspect `backup_databases.sh` on live VPS) + +**Important:** This script is NOT version-controlled in timmy-config. It exists only on the live VPS. + +### 2. Version-Controlled Configuration + +All operational configuration is version-controlled in `Timmy_Foundation/timmy-config`: +- `config.yaml` — Hermes harness configuration +- `playbooks/` — Agent playbooks +- `memories/` — Persistent memory YAML +- `cron/` — Cron job definitions (source of truth) +- `bin/` — Operational helper scripts +- `ansible/` — Infrastructure-as-code playbooks + +### 3. Ansible Deployment + +Wizard houses are deployed via Ansible from any machine with SSH access: + +```bash +cd ansible +ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit ezra +``` + +The VPS itself is disposable — wizard state is rebuilt from configuration + data backups. + +--- + +## Recovery Procedure + +### Pre-Recovery Checklist + +- [ ] Identify the failure scope (VPS destroyed vs. service outage) +- [ ] Obtain a replacement VPS (same region, preferably DigitalOcean or equivalent) +- [ ] Gather SSH private key for root access +- [ ] Locate the most recent backup from `/backups/` on the live VPS (if accessible) +- [ ] Ensure `~/.config/gitea/token` is available locally for API operations +- [ ] Confirm DNS will be updated to new VPS IP + +**TOTAL ESTIMATED RECOVERY TIME: 4–8 hours** (depending on backup availability and DNS propagation) + +### Step 1 — Provision Replacement VPS + +```bash +# Using DigitalOcean (current provider) +doctl compute droplet create \ + --image debian-12-x64 \ + --region nyc1 \ + --size s-2vcpu-4gb \ + --ssh-keys $(cat ~/.ssh/id_rsa.pub | doctl compute ssh-key list --format ID --no-header | head -1) \ + forge-recovery-$(date +%Y%m%d) +``` + +**Record the new VPS IP address.** + +Alternatively, reuse an existing standby VPS if available (Mitigation #3). + +### Step 2 — Install Base Dependencies + +SSH into new VPS as root and run: + +```bash +# Update system +apt update && apt upgrade -y + +# Install required packages +apt install -y python3 python3-pip python3-venv git curl wget jq sqlite3 + +# Install Docker (for Matrix Conduit if applicable) +curl -fsSL https://get.docker.com | sh +usermod -aG docker $USER + +# Create directory structure +mkdir -p /root/wizards/{ezra,allegro,bezalel} +mkdir -p /root/.hermes/{bin,skins,playbooks,memories,cron} +mkdir -p /var/log/ansible +``` + +**Time:** 15 minutes + +### Step 3 — Deploy timmy-config Repository + +```bash +# Clone timmy-config +cd /root +git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/timmy-config.git +cd timmy-config + +# Run deploy script to overlay configuration +./deploy.sh +``` + +This creates the canonical `~/.hermes/` configuration tree from version control. + +**Time:** 5 minutes + +### Step 4 — Restore Gitea Data from Backup + +**First, determine the backup format from the live VPS (if accessible):** + +```bash +# On LIVE VPS (if it's still reachable) +ssh root@143.198.27.163 "ls -la /backups/gitea/ 2>/dev/null || ls -la /root/backups/ 2>/dev/null || true" +``` + +**Expected locations:** +- `/backups/gitea/` (standard) +- `/var/backups/gitea/` +- `/root/backups/` + +**If you have a SQLite backup file (`gitea.db` or `gitea-YYYYMMDD.db`):** + +```bash +# On NEW VPS +# Stop Gitea if it's running (service will fail until data is restored) +systemctl stop gitea 2>/dev/null || true + +# Create data directory if needed +mkdir -p /var/lib/gitea/data + +# Restore the database +cp /path/to/backup/gitea.db /var/lib/gitea/data/gitea.db +chown gitea:gitea /var/lib/gitea/data/gitea.db +chmod 600 /var/lib/gitea/data/gitea.db + +# Restore custom templates/public if those were backed up +if [ -d "/backups/gitea/custom" ]; then + cp -r /backups/gitea/custom/* /var/lib/gitea/custom/ + chown -R gitea:gitea /var/lib/gitea/custom +fi +``` + +**Start Gitea:** + +```bash +systemctl start gitea +sleep 5 +systemctl status gitea +``` + +**Verify Gitea is healthy:** + +```bash +curl -s -o /dev/null -w "%{http_code}" https://forge.alexanderwhitestone.com/api/v1/version +# Expected: 200 +``` + +**Time:** 20 minutes + +### Step 5 — Restore FastAPI / Backend Services + +The FastAPI backend configuration lives in `timmy-config/config.yaml`. Since it's version-controlled, just verify: + +```bash +# Check config +cat ~/.hermes/config.yaml | grep -A5 'fastapi\|backend\|port' + +# Start the backend service (if managed via systemd) +systemctl start hermes-backend 2>/dev/null || true + +# Verify health +curl -s http://localhost:8645/health || echo "Backend endpoint may differ" +``` + +If the backend uses a separate systemd service, it should be defined in ansible roles. Deploy via ansible (Step 7). + +**Time:** 10 minutes + +### Step 6 — Deploy Wizard Houses via Ansible + +```bash +cd /root/timmy-config/ansible +ansible-playbook -i inventory/hosts.yml playbooks/site.yml +``` + +This will: +- Create wizard directories +- Deploy configuration +- Set up cron jobs +- Start systemd services + +**If Ansible fails because SSH keys aren't set up on the new VPS yet:** + +```bash +# On LOCAL machine (where you have SSH access to the new VPS) +cat ~/.ssh/id_rsa.pub | ssh root@ "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys" + +# Update ansible/inventory/hosts.yml with new IP for the `forge` and `ezra` hosts +# Then re-run ansible +``` + +**Time:** 15 minutes + +### Step 7 — Verify Fleet Health + +```bash +# Check all wizards +systemctl status hermes-{ezra,allegro,bezalel} 2>/dev/null + +# Check hermes state +ps aux | grep hermes + +# Check cron jobs +crontab -l + +# Check logs for errors +tail -50 /var/log/ansible/timmy-fleet.log +tail -50 /root/.hermes/logs/sprint/*.log 2>/dev/null || true +``` + +### Step 8 — Update DNS + +If the new VPS has a different IP than the old one, update DNS A records: + +| Service | Hostname | Current IP | +|---------------------------|-----------------------------------|------------------| +| Gitea / Forge | forge.alexanderwhitestone.com | 143.198.27.163 | +| (future) Nexus | nexus.timmytime.net | (TBD) | + +**Action:** Update the A record for `forge.alexanderwhitestone.com` to point to ``. + +**TTL:** 300 seconds (5 min) — propagation complete in ~15 min + +**Time:** 5 minutes + DNS propagation + +### Step 9 — Post-Recovery Validation + +Once DNS has propagated (wait 15 min, then): + +```bash +# 1. Gitea accessibility +curl -s -I https://forge.alexanderwhitestone.com/api/v1/version | head -1 +# Expected: HTTP/2 200 + +# 2. Issue creation test +# Use gitea-api.sh to file a test issue +gitea-api.sh issue create timmy-config "Recovery Test" "Automated post-recovery validation — can be closed." +# Expected: Issue # created + +# 3. Wizard heartbeat check +# Check latest fleet health logs +tail -30 ~/.local/timmy/fleet-health/*.json 2>/dev/null | head -1 + +# 4. Herald dispatch test +# File a simple issue and watch dispatch +``` + +**Close the test issue:** +```bash +gitea-api.sh issue close timmy-config +``` + +--- + +## Rollback Plan + +If recovery fails or the original VPS comes back online: + +1. **Pause DNS** — point to a static "maintenance" page or 502 +2. **Shut down the new VPS** — `shutdown -h now` (preserve disks for forensics) +3. **Revert to original VPS** once it's confirmed healthy +4. **Document the failure** — add an ADR to `docs/adr/` + +--- + +## Post-Mortem Actions + +After successful recovery: +1. Document the root cause of the VPS loss +2. Verify backup integrity — ensure `backup_databases.sh` actually works +3. Consider **Mitigation #3** — Cold standby VPS with automated sync +4. Consider **Mitigation #4** — Mirror all repos to GitHub as secondary +5. Update this runbook with any corrections discovered during recovery + +--- + +## Related Issues + +- #481 — Single-node VPS SPOF audit (this document) +- Future: Automated backup verification +- Future: Offsite backup sync (S3, remote) +- Future: Hot standby VPS with keepalived/HAProxy + +--- + +**Last updated:** 2026-04-26 +**Maintained by:** Timmy Foundation Infrastructure Team +**Review cadence:** After each recovery drill or actual recovery