docs(#481 ): add VPS recovery runbook for single-point-of-failure mitigation

Create docs/VPS_RECOVERY_RUNBOOK.md documenting: - Current backup infrastructure (daily backup_databases.sh) - Step-by-step recovery from VPS loss - Time estimates (4-8 hours) - DNS update procedures - Post-recovery validation checklist This is the smallest concrete fix addressing the SPOF audit. It provides actionable recovery instructions while remaining lightweight. Refs #481
2026-04-26 17:24:22 -04:00
1 changed files with 314 additions and 0 deletions
--- a/docs/VPS_RECOVERY_RUNBOOK.md
+++ b/docs/VPS_RECOVERY_RUNBOOK.md
@@ -0,0 +1,314 @@
+# VPS Recovery Runbook
+# Issue #481 — Single-node VPS Single Point of Failure
+# Created: STEP35 free burn | 2026-04-26
+
+## Risk Statement
+
+The Hermes VPS (143.198.27.163) hosts Gitea, FastAPI backend, and the Ezra/Allegro/Bezalel wizard houses. This is a single point of failure — if the VPS is lost, the entire forge and agent coordination layer is offline.
+
+**Risk Level:** High
+
+---
+
+## Current Mitigations (As-Built)
+
+### 1. Daily Database Backups
+
+There is a daily backup job running on the VPS:
+
+```
+30 3 * * * /root/wizards/bezalel/backup_databases.sh
+```
+
+**What it backs up:**
+- Gitea SQLite databases (`/var/lib/gitea/data/gitea.db` and related)
+- Wizard configuration databases (if any)
+- Retained for 7 days (estimated — verify script)
+
+**Where backups are stored:** (TBD — need to inspect `backup_databases.sh` on live VPS)
+
+**Important:** This script is NOT version-controlled in timmy-config. It exists only on the live VPS.
+
+### 2. Version-Controlled Configuration
+
+All operational configuration is version-controlled in `Timmy_Foundation/timmy-config`:
+- `config.yaml` — Hermes harness configuration
+- `playbooks/` — Agent playbooks
+- `memories/` — Persistent memory YAML
+- `cron/` — Cron job definitions (source of truth)
+- `bin/` — Operational helper scripts
+- `ansible/` — Infrastructure-as-code playbooks
+
+### 3. Ansible Deployment
+
+Wizard houses are deployed via Ansible from any machine with SSH access:
+
+```bash
+cd ansible
+ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit ezra
+```
+
+The VPS itself is disposable — wizard state is rebuilt from configuration + data backups.
+
+---
+
+## Recovery Procedure
+
+### Pre-Recovery Checklist
+
+- [ ] Identify the failure scope (VPS destroyed vs. service outage)
+- [ ] Obtain a replacement VPS (same region, preferably DigitalOcean or equivalent)
+- [ ] Gather SSH private key for root access
+- [ ] Locate the most recent backup from `/backups/` on the live VPS (if accessible)
+- [ ] Ensure `~/.config/gitea/token` is available locally for API operations
+- [ ] Confirm DNS will be updated to new VPS IP
+
+**TOTAL ESTIMATED RECOVERY TIME: 4–8 hours** (depending on backup availability and DNS propagation)
+
+### Step 1 — Provision Replacement VPS
+
+```bash
+# Using DigitalOcean (current provider)
+doctl compute droplet create \
+  --image debian-12-x64 \
+  --region nyc1 \
+  --size s-2vcpu-4gb \
+  --ssh-keys $(cat ~/.ssh/id_rsa.pub | doctl compute ssh-key list --format ID --no-header | head -1) \
+  forge-recovery-$(date +%Y%m%d)
+```
+
+**Record the new VPS IP address.**
+
+Alternatively, reuse an existing standby VPS if available (Mitigation #3).
+
+### Step 2 — Install Base Dependencies
+
+SSH into new VPS as root and run:
+
+```bash
+# Update system
+apt update && apt upgrade -y
+
+# Install required packages
+apt install -y python3 python3-pip python3-venv git curl wget jq sqlite3
+
+# Install Docker (for Matrix Conduit if applicable)
+curl -fsSL https://get.docker.com | sh
+usermod -aG docker $USER
+
+# Create directory structure
+mkdir -p /root/wizards/{ezra,allegro,bezalel}
+mkdir -p /root/.hermes/{bin,skins,playbooks,memories,cron}
+mkdir -p /var/log/ansible
+```
+
+**Time:** 15 minutes
+
+### Step 3 — Deploy timmy-config Repository
+
+```bash
+# Clone timmy-config
+cd /root
+git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/timmy-config.git
+cd timmy-config
+
+# Run deploy script to overlay configuration
+./deploy.sh
+```
+
+This creates the canonical `~/.hermes/` configuration tree from version control.
+
+**Time:** 5 minutes
+
+### Step 4 — Restore Gitea Data from Backup
+
+**First, determine the backup format from the live VPS (if accessible):**
+
+```bash
+# On LIVE VPS (if it's still reachable)
+ssh root@143.198.27.163 "ls -la /backups/gitea/ 2>/dev/null || ls -la /root/backups/ 2>/dev/null || true"
+```
+
+**Expected locations:**
+- `/backups/gitea/` (standard)
+- `/var/backups/gitea/`
+- `/root/backups/`
+
+**If you have a SQLite backup file (`gitea.db` or `gitea-YYYYMMDD.db`):**
+
+```bash
+# On NEW VPS
+# Stop Gitea if it's running (service will fail until data is restored)
+systemctl stop gitea 2>/dev/null || true
+
+# Create data directory if needed
+mkdir -p /var/lib/gitea/data
+
+# Restore the database
+cp /path/to/backup/gitea.db /var/lib/gitea/data/gitea.db
+chown gitea:gitea /var/lib/gitea/data/gitea.db
+chmod 600 /var/lib/gitea/data/gitea.db
+
+# Restore custom templates/public if those were backed up
+if [ -d "/backups/gitea/custom" ]; then
+  cp -r /backups/gitea/custom/* /var/lib/gitea/custom/
+  chown -R gitea:gitea /var/lib/gitea/custom
+fi
+```
+
+**Start Gitea:**
+
+```bash
+systemctl start gitea
+sleep 5
+systemctl status gitea
+```
+
+**Verify Gitea is healthy:**
+
+```bash
+curl -s -o /dev/null -w "%{http_code}" https://forge.alexanderwhitestone.com/api/v1/version
+# Expected: 200
+```
+
+**Time:** 20 minutes
+
+### Step 5 — Restore FastAPI / Backend Services
+
+The FastAPI backend configuration lives in `timmy-config/config.yaml`. Since it's version-controlled, just verify:
+
+```bash
+# Check config
+cat ~/.hermes/config.yaml | grep -A5 'fastapi\|backend\|port'
+
+# Start the backend service (if managed via systemd)
+systemctl start hermes-backend 2>/dev/null || true
+
+# Verify health
+curl -s http://localhost:8645/health || echo "Backend endpoint may differ"
+```
+
+If the backend uses a separate systemd service, it should be defined in ansible roles. Deploy via ansible (Step 7).
+
+**Time:** 10 minutes
+
+### Step 6 — Deploy Wizard Houses via Ansible
+
+```bash
+cd /root/timmy-config/ansible
+ansible-playbook -i inventory/hosts.yml playbooks/site.yml
+```
+
+This will:
+- Create wizard directories
+- Deploy configuration
+- Set up cron jobs
+- Start systemd services
+
+**If Ansible fails because SSH keys aren't set up on the new VPS yet:**
+
+```bash
+# On LOCAL machine (where you have SSH access to the new VPS)
+cat ~/.ssh/id_rsa.pub | ssh root@<NEW_VPS_IP> "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
+
+# Update ansible/inventory/hosts.yml with new IP for the `forge` and `ezra` hosts
+# Then re-run ansible
+```
+
+**Time:** 15 minutes
+
+### Step 7 — Verify Fleet Health
+
+```bash
+# Check all wizards
+systemctl status hermes-{ezra,allegro,bezalel} 2>/dev/null
+
+# Check hermes state
+ps aux | grep hermes
+
+# Check cron jobs
+crontab -l
+
+# Check logs for errors
+tail -50 /var/log/ansible/timmy-fleet.log
+tail -50 /root/.hermes/logs/sprint/*.log 2>/dev/null || true
+```
+
+### Step 8 — Update DNS
+
+If the new VPS has a different IP than the old one, update DNS A records:
+
+| Service                   | Hostname                          | Current IP       |
+|---------------------------|-----------------------------------|------------------|
+| Gitea / Forge             | forge.alexanderwhitestone.com     | 143.198.27.163   |
+| (future) Nexus            | nexus.timmytime.net               | (TBD)           |
+
+**Action:** Update the A record for `forge.alexanderwhitestone.com` to point to `<NEW_VPS_IP>`.
+
+**TTL:** 300 seconds (5 min) — propagation complete in ~15 min
+
+**Time:** 5 minutes + DNS propagation
+
+### Step 9 — Post-Recovery Validation
+
+Once DNS has propagated (wait 15 min, then):
+
+```bash
+# 1. Gitea accessibility
+curl -s -I https://forge.alexanderwhitestone.com/api/v1/version | head -1
+# Expected: HTTP/2 200
+
+# 2. Issue creation test
+# Use gitea-api.sh to file a test issue
+gitea-api.sh issue create timmy-config "Recovery Test" "Automated post-recovery validation — can be closed."
+# Expected: Issue #<N> created
+
+# 3. Wizard heartbeat check
+# Check latest fleet health logs
+tail -30 ~/.local/timmy/fleet-health/*.json 2>/dev/null | head -1
+
+# 4. Herald dispatch test
+# File a simple issue and watch dispatch
+```
+
+**Close the test issue:**
+```bash
+gitea-api.sh issue close timmy-config <TEST_ISSUE_NUM>
+```
+
+---
+
+## Rollback Plan
+
+If recovery fails or the original VPS comes back online:
+
+1. **Pause DNS** — point to a static "maintenance" page or 502
+2. **Shut down the new VPS** — `shutdown -h now` (preserve disks for forensics)
+3. **Revert to original VPS** once it's confirmed healthy
+4. **Document the failure** — add an ADR to `docs/adr/`
+
+---
+
+## Post-Mortem Actions
+
+After successful recovery:
+1. Document the root cause of the VPS loss
+2. Verify backup integrity — ensure `backup_databases.sh` actually works
+3. Consider **Mitigation #3** — Cold standby VPS with automated sync
+4. Consider **Mitigation #4** — Mirror all repos to GitHub as secondary
+5. Update this runbook with any corrections discovered during recovery
+
+---
+
+## Related Issues
+
+- #481 — Single-node VPS SPOF audit (this document)
+- Future: Automated backup verification
+- Future: Offsite backup sync (S3, remote)
+- Future: Hot standby VPS with keepalived/HAProxy
+
+---
+
+**Last updated:** 2026-04-26
+**Maintained by:** Timmy Foundation Infrastructure Team
+**Review cadence:** After each recovery drill or actual recovery