Files
timmy-config/docs/VPS_RECOVERY_RUNBOOK.md
Alexander Payne d352f4d05b
Some checks failed
Validate Config / Deploy Script Dry Run (pull_request) Successful in 12s
Validate Config / Playbook Schema Validation (pull_request) Successful in 26s
Architecture Lint / Lint Repository (pull_request) Failing after 23s
PR Checklist / pr-checklist (pull_request) Successful in 3m11s
Architecture Lint / Linter Tests (pull_request) Successful in 24s
Smoke Test / smoke (pull_request) Failing after 24s
Validate Config / YAML Lint (pull_request) Failing after 16s
Validate Config / JSON Validate (pull_request) Successful in 20s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m3s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Shell Script Lint (pull_request) Failing after 1m6s
Validate Config / Cron Syntax Check (pull_request) Successful in 14s
docs(#481): add VPS recovery runbook for single-point-of-failure mitigation
Create docs/VPS_RECOVERY_RUNBOOK.md documenting:
- Current backup infrastructure (daily backup_databases.sh)
- Step-by-step recovery from VPS loss
- Time estimates (4-8 hours)
- DNS update procedures
- Post-recovery validation checklist

This is the smallest concrete fix addressing the SPOF audit.
It provides actionable recovery instructions while remaining lightweight.

Refs #481
2026-04-26 17:24:22 -04:00

315 lines
8.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# VPS Recovery Runbook
# Issue #481 — Single-node VPS Single Point of Failure
# Created: STEP35 free burn | 2026-04-26
## Risk Statement
The Hermes VPS (143.198.27.163) hosts Gitea, FastAPI backend, and the Ezra/Allegro/Bezalel wizard houses. This is a single point of failure — if the VPS is lost, the entire forge and agent coordination layer is offline.
**Risk Level:** High
---
## Current Mitigations (As-Built)
### 1. Daily Database Backups
There is a daily backup job running on the VPS:
```
30 3 * * * /root/wizards/bezalel/backup_databases.sh
```
**What it backs up:**
- Gitea SQLite databases (`/var/lib/gitea/data/gitea.db` and related)
- Wizard configuration databases (if any)
- Retained for 7 days (estimated — verify script)
**Where backups are stored:** (TBD — need to inspect `backup_databases.sh` on live VPS)
**Important:** This script is NOT version-controlled in timmy-config. It exists only on the live VPS.
### 2. Version-Controlled Configuration
All operational configuration is version-controlled in `Timmy_Foundation/timmy-config`:
- `config.yaml` — Hermes harness configuration
- `playbooks/` — Agent playbooks
- `memories/` — Persistent memory YAML
- `cron/` — Cron job definitions (source of truth)
- `bin/` — Operational helper scripts
- `ansible/` — Infrastructure-as-code playbooks
### 3. Ansible Deployment
Wizard houses are deployed via Ansible from any machine with SSH access:
```bash
cd ansible
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit ezra
```
The VPS itself is disposable — wizard state is rebuilt from configuration + data backups.
---
## Recovery Procedure
### Pre-Recovery Checklist
- [ ] Identify the failure scope (VPS destroyed vs. service outage)
- [ ] Obtain a replacement VPS (same region, preferably DigitalOcean or equivalent)
- [ ] Gather SSH private key for root access
- [ ] Locate the most recent backup from `/backups/` on the live VPS (if accessible)
- [ ] Ensure `~/.config/gitea/token` is available locally for API operations
- [ ] Confirm DNS will be updated to new VPS IP
**TOTAL ESTIMATED RECOVERY TIME: 48 hours** (depending on backup availability and DNS propagation)
### Step 1 — Provision Replacement VPS
```bash
# Using DigitalOcean (current provider)
doctl compute droplet create \
--image debian-12-x64 \
--region nyc1 \
--size s-2vcpu-4gb \
--ssh-keys $(cat ~/.ssh/id_rsa.pub | doctl compute ssh-key list --format ID --no-header | head -1) \
forge-recovery-$(date +%Y%m%d)
```
**Record the new VPS IP address.**
Alternatively, reuse an existing standby VPS if available (Mitigation #3).
### Step 2 — Install Base Dependencies
SSH into new VPS as root and run:
```bash
# Update system
apt update && apt upgrade -y
# Install required packages
apt install -y python3 python3-pip python3-venv git curl wget jq sqlite3
# Install Docker (for Matrix Conduit if applicable)
curl -fsSL https://get.docker.com | sh
usermod -aG docker $USER
# Create directory structure
mkdir -p /root/wizards/{ezra,allegro,bezalel}
mkdir -p /root/.hermes/{bin,skins,playbooks,memories,cron}
mkdir -p /var/log/ansible
```
**Time:** 15 minutes
### Step 3 — Deploy timmy-config Repository
```bash
# Clone timmy-config
cd /root
git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/timmy-config.git
cd timmy-config
# Run deploy script to overlay configuration
./deploy.sh
```
This creates the canonical `~/.hermes/` configuration tree from version control.
**Time:** 5 minutes
### Step 4 — Restore Gitea Data from Backup
**First, determine the backup format from the live VPS (if accessible):**
```bash
# On LIVE VPS (if it's still reachable)
ssh root@143.198.27.163 "ls -la /backups/gitea/ 2>/dev/null || ls -la /root/backups/ 2>/dev/null || true"
```
**Expected locations:**
- `/backups/gitea/` (standard)
- `/var/backups/gitea/`
- `/root/backups/`
**If you have a SQLite backup file (`gitea.db` or `gitea-YYYYMMDD.db`):**
```bash
# On NEW VPS
# Stop Gitea if it's running (service will fail until data is restored)
systemctl stop gitea 2>/dev/null || true
# Create data directory if needed
mkdir -p /var/lib/gitea/data
# Restore the database
cp /path/to/backup/gitea.db /var/lib/gitea/data/gitea.db
chown gitea:gitea /var/lib/gitea/data/gitea.db
chmod 600 /var/lib/gitea/data/gitea.db
# Restore custom templates/public if those were backed up
if [ -d "/backups/gitea/custom" ]; then
cp -r /backups/gitea/custom/* /var/lib/gitea/custom/
chown -R gitea:gitea /var/lib/gitea/custom
fi
```
**Start Gitea:**
```bash
systemctl start gitea
sleep 5
systemctl status gitea
```
**Verify Gitea is healthy:**
```bash
curl -s -o /dev/null -w "%{http_code}" https://forge.alexanderwhitestone.com/api/v1/version
# Expected: 200
```
**Time:** 20 minutes
### Step 5 — Restore FastAPI / Backend Services
The FastAPI backend configuration lives in `timmy-config/config.yaml`. Since it's version-controlled, just verify:
```bash
# Check config
cat ~/.hermes/config.yaml | grep -A5 'fastapi\|backend\|port'
# Start the backend service (if managed via systemd)
systemctl start hermes-backend 2>/dev/null || true
# Verify health
curl -s http://localhost:8645/health || echo "Backend endpoint may differ"
```
If the backend uses a separate systemd service, it should be defined in ansible roles. Deploy via ansible (Step 7).
**Time:** 10 minutes
### Step 6 — Deploy Wizard Houses via Ansible
```bash
cd /root/timmy-config/ansible
ansible-playbook -i inventory/hosts.yml playbooks/site.yml
```
This will:
- Create wizard directories
- Deploy configuration
- Set up cron jobs
- Start systemd services
**If Ansible fails because SSH keys aren't set up on the new VPS yet:**
```bash
# On LOCAL machine (where you have SSH access to the new VPS)
cat ~/.ssh/id_rsa.pub | ssh root@<NEW_VPS_IP> "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
# Update ansible/inventory/hosts.yml with new IP for the `forge` and `ezra` hosts
# Then re-run ansible
```
**Time:** 15 minutes
### Step 7 — Verify Fleet Health
```bash
# Check all wizards
systemctl status hermes-{ezra,allegro,bezalel} 2>/dev/null
# Check hermes state
ps aux | grep hermes
# Check cron jobs
crontab -l
# Check logs for errors
tail -50 /var/log/ansible/timmy-fleet.log
tail -50 /root/.hermes/logs/sprint/*.log 2>/dev/null || true
```
### Step 8 — Update DNS
If the new VPS has a different IP than the old one, update DNS A records:
| Service | Hostname | Current IP |
|---------------------------|-----------------------------------|------------------|
| Gitea / Forge | forge.alexanderwhitestone.com | 143.198.27.163 |
| (future) Nexus | nexus.timmytime.net | (TBD) |
**Action:** Update the A record for `forge.alexanderwhitestone.com` to point to `<NEW_VPS_IP>`.
**TTL:** 300 seconds (5 min) — propagation complete in ~15 min
**Time:** 5 minutes + DNS propagation
### Step 9 — Post-Recovery Validation
Once DNS has propagated (wait 15 min, then):
```bash
# 1. Gitea accessibility
curl -s -I https://forge.alexanderwhitestone.com/api/v1/version | head -1
# Expected: HTTP/2 200
# 2. Issue creation test
# Use gitea-api.sh to file a test issue
gitea-api.sh issue create timmy-config "Recovery Test" "Automated post-recovery validation — can be closed."
# Expected: Issue #<N> created
# 3. Wizard heartbeat check
# Check latest fleet health logs
tail -30 ~/.local/timmy/fleet-health/*.json 2>/dev/null | head -1
# 4. Herald dispatch test
# File a simple issue and watch dispatch
```
**Close the test issue:**
```bash
gitea-api.sh issue close timmy-config <TEST_ISSUE_NUM>
```
---
## Rollback Plan
If recovery fails or the original VPS comes back online:
1. **Pause DNS** — point to a static "maintenance" page or 502
2. **Shut down the new VPS**`shutdown -h now` (preserve disks for forensics)
3. **Revert to original VPS** once it's confirmed healthy
4. **Document the failure** — add an ADR to `docs/adr/`
---
## Post-Mortem Actions
After successful recovery:
1. Document the root cause of the VPS loss
2. Verify backup integrity — ensure `backup_databases.sh` actually works
3. Consider **Mitigation #3** — Cold standby VPS with automated sync
4. Consider **Mitigation #4** — Mirror all repos to GitHub as secondary
5. Update this runbook with any corrections discovered during recovery
---
## Related Issues
- #481 — Single-node VPS SPOF audit (this document)
- Future: Automated backup verification
- Future: Offsite backup sync (S3, remote)
- Future: Hot standby VPS with keepalived/HAProxy
---
**Last updated:** 2026-04-26
**Maintained by:** Timmy Foundation Infrastructure Team
**Review cadence:** After each recovery drill or actual recovery