Files
timmy-config/docs/VPS_RECOVERY_RUNBOOK.md
Alexander Payne d352f4d05b
Some checks failed
Validate Config / Deploy Script Dry Run (pull_request) Successful in 12s
Validate Config / Playbook Schema Validation (pull_request) Successful in 26s
Architecture Lint / Lint Repository (pull_request) Failing after 23s
PR Checklist / pr-checklist (pull_request) Successful in 3m11s
Architecture Lint / Linter Tests (pull_request) Successful in 24s
Smoke Test / smoke (pull_request) Failing after 24s
Validate Config / YAML Lint (pull_request) Failing after 16s
Validate Config / JSON Validate (pull_request) Successful in 20s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m3s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Shell Script Lint (pull_request) Failing after 1m6s
Validate Config / Cron Syntax Check (pull_request) Successful in 14s
docs(#481): add VPS recovery runbook for single-point-of-failure mitigation
Create docs/VPS_RECOVERY_RUNBOOK.md documenting:
- Current backup infrastructure (daily backup_databases.sh)
- Step-by-step recovery from VPS loss
- Time estimates (4-8 hours)
- DNS update procedures
- Post-recovery validation checklist

This is the smallest concrete fix addressing the SPOF audit.
It provides actionable recovery instructions while remaining lightweight.

Refs #481
2026-04-26 17:24:22 -04:00

8.6 KiB
Raw Blame History

VPS Recovery Runbook

Issue #481 — Single-node VPS Single Point of Failure

Created: STEP35 free burn | 2026-04-26

Risk Statement

The Hermes VPS (143.198.27.163) hosts Gitea, FastAPI backend, and the Ezra/Allegro/Bezalel wizard houses. This is a single point of failure — if the VPS is lost, the entire forge and agent coordination layer is offline.

Risk Level: High


Current Mitigations (As-Built)

1. Daily Database Backups

There is a daily backup job running on the VPS:

30 3 * * * /root/wizards/bezalel/backup_databases.sh

What it backs up:

  • Gitea SQLite databases (/var/lib/gitea/data/gitea.db and related)
  • Wizard configuration databases (if any)
  • Retained for 7 days (estimated — verify script)

Where backups are stored: (TBD — need to inspect backup_databases.sh on live VPS)

Important: This script is NOT version-controlled in timmy-config. It exists only on the live VPS.

2. Version-Controlled Configuration

All operational configuration is version-controlled in Timmy_Foundation/timmy-config:

  • config.yaml — Hermes harness configuration
  • playbooks/ — Agent playbooks
  • memories/ — Persistent memory YAML
  • cron/ — Cron job definitions (source of truth)
  • bin/ — Operational helper scripts
  • ansible/ — Infrastructure-as-code playbooks

3. Ansible Deployment

Wizard houses are deployed via Ansible from any machine with SSH access:

cd ansible
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit ezra

The VPS itself is disposable — wizard state is rebuilt from configuration + data backups.


Recovery Procedure

Pre-Recovery Checklist

  • Identify the failure scope (VPS destroyed vs. service outage)
  • Obtain a replacement VPS (same region, preferably DigitalOcean or equivalent)
  • Gather SSH private key for root access
  • Locate the most recent backup from /backups/ on the live VPS (if accessible)
  • Ensure ~/.config/gitea/token is available locally for API operations
  • Confirm DNS will be updated to new VPS IP

TOTAL ESTIMATED RECOVERY TIME: 48 hours (depending on backup availability and DNS propagation)

Step 1 — Provision Replacement VPS

# Using DigitalOcean (current provider)
doctl compute droplet create \
  --image debian-12-x64 \
  --region nyc1 \
  --size s-2vcpu-4gb \
  --ssh-keys $(cat ~/.ssh/id_rsa.pub | doctl compute ssh-key list --format ID --no-header | head -1) \
  forge-recovery-$(date +%Y%m%d)

Record the new VPS IP address.

Alternatively, reuse an existing standby VPS if available (Mitigation #3).

Step 2 — Install Base Dependencies

SSH into new VPS as root and run:

# Update system
apt update && apt upgrade -y

# Install required packages
apt install -y python3 python3-pip python3-venv git curl wget jq sqlite3

# Install Docker (for Matrix Conduit if applicable)
curl -fsSL https://get.docker.com | sh
usermod -aG docker $USER

# Create directory structure
mkdir -p /root/wizards/{ezra,allegro,bezalel}
mkdir -p /root/.hermes/{bin,skins,playbooks,memories,cron}
mkdir -p /var/log/ansible

Time: 15 minutes

Step 3 — Deploy timmy-config Repository

# Clone timmy-config
cd /root
git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/timmy-config.git
cd timmy-config

# Run deploy script to overlay configuration
./deploy.sh

This creates the canonical ~/.hermes/ configuration tree from version control.

Time: 5 minutes

Step 4 — Restore Gitea Data from Backup

First, determine the backup format from the live VPS (if accessible):

# On LIVE VPS (if it's still reachable)
ssh root@143.198.27.163 "ls -la /backups/gitea/ 2>/dev/null || ls -la /root/backups/ 2>/dev/null || true"

Expected locations:

  • /backups/gitea/ (standard)
  • /var/backups/gitea/
  • /root/backups/

If you have a SQLite backup file (gitea.db or gitea-YYYYMMDD.db):

# On NEW VPS
# Stop Gitea if it's running (service will fail until data is restored)
systemctl stop gitea 2>/dev/null || true

# Create data directory if needed
mkdir -p /var/lib/gitea/data

# Restore the database
cp /path/to/backup/gitea.db /var/lib/gitea/data/gitea.db
chown gitea:gitea /var/lib/gitea/data/gitea.db
chmod 600 /var/lib/gitea/data/gitea.db

# Restore custom templates/public if those were backed up
if [ -d "/backups/gitea/custom" ]; then
  cp -r /backups/gitea/custom/* /var/lib/gitea/custom/
  chown -R gitea:gitea /var/lib/gitea/custom
fi

Start Gitea:

systemctl start gitea
sleep 5
systemctl status gitea

Verify Gitea is healthy:

curl -s -o /dev/null -w "%{http_code}" https://forge.alexanderwhitestone.com/api/v1/version
# Expected: 200

Time: 20 minutes

Step 5 — Restore FastAPI / Backend Services

The FastAPI backend configuration lives in timmy-config/config.yaml. Since it's version-controlled, just verify:

# Check config
cat ~/.hermes/config.yaml | grep -A5 'fastapi\|backend\|port'

# Start the backend service (if managed via systemd)
systemctl start hermes-backend 2>/dev/null || true

# Verify health
curl -s http://localhost:8645/health || echo "Backend endpoint may differ"

If the backend uses a separate systemd service, it should be defined in ansible roles. Deploy via ansible (Step 7).

Time: 10 minutes

Step 6 — Deploy Wizard Houses via Ansible

cd /root/timmy-config/ansible
ansible-playbook -i inventory/hosts.yml playbooks/site.yml

This will:

  • Create wizard directories
  • Deploy configuration
  • Set up cron jobs
  • Start systemd services

If Ansible fails because SSH keys aren't set up on the new VPS yet:

# On LOCAL machine (where you have SSH access to the new VPS)
cat ~/.ssh/id_rsa.pub | ssh root@<NEW_VPS_IP> "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"

# Update ansible/inventory/hosts.yml with new IP for the `forge` and `ezra` hosts
# Then re-run ansible

Time: 15 minutes

Step 7 — Verify Fleet Health

# Check all wizards
systemctl status hermes-{ezra,allegro,bezalel} 2>/dev/null

# Check hermes state
ps aux | grep hermes

# Check cron jobs
crontab -l

# Check logs for errors
tail -50 /var/log/ansible/timmy-fleet.log
tail -50 /root/.hermes/logs/sprint/*.log 2>/dev/null || true

Step 8 — Update DNS

If the new VPS has a different IP than the old one, update DNS A records:

Service Hostname Current IP
Gitea / Forge forge.alexanderwhitestone.com 143.198.27.163
(future) Nexus nexus.timmytime.net (TBD)

Action: Update the A record for forge.alexanderwhitestone.com to point to <NEW_VPS_IP>.

TTL: 300 seconds (5 min) — propagation complete in ~15 min

Time: 5 minutes + DNS propagation

Step 9 — Post-Recovery Validation

Once DNS has propagated (wait 15 min, then):

# 1. Gitea accessibility
curl -s -I https://forge.alexanderwhitestone.com/api/v1/version | head -1
# Expected: HTTP/2 200

# 2. Issue creation test
# Use gitea-api.sh to file a test issue
gitea-api.sh issue create timmy-config "Recovery Test" "Automated post-recovery validation — can be closed."
# Expected: Issue #<N> created

# 3. Wizard heartbeat check
# Check latest fleet health logs
tail -30 ~/.local/timmy/fleet-health/*.json 2>/dev/null | head -1

# 4. Herald dispatch test
# File a simple issue and watch dispatch

Close the test issue:

gitea-api.sh issue close timmy-config <TEST_ISSUE_NUM>

Rollback Plan

If recovery fails or the original VPS comes back online:

  1. Pause DNS — point to a static "maintenance" page or 502
  2. Shut down the new VPSshutdown -h now (preserve disks for forensics)
  3. Revert to original VPS once it's confirmed healthy
  4. Document the failure — add an ADR to docs/adr/

Post-Mortem Actions

After successful recovery:

  1. Document the root cause of the VPS loss
  2. Verify backup integrity — ensure backup_databases.sh actually works
  3. Consider Mitigation #3 — Cold standby VPS with automated sync
  4. Consider Mitigation #4 — Mirror all repos to GitHub as secondary
  5. Update this runbook with any corrections discovered during recovery

  • #481 — Single-node VPS SPOF audit (this document)
  • Future: Automated backup verification
  • Future: Offsite backup sync (S3, remote)
  • Future: Hot standby VPS with keepalived/HAProxy

Last updated: 2026-04-26 Maintained by: Timmy Foundation Infrastructure Team Review cadence: After each recovery drill or actual recovery