Compare commits
1 Commits
step35/573
...
step35/481
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
d352f4d05b |
314
docs/VPS_RECOVERY_RUNBOOK.md
Normal file
314
docs/VPS_RECOVERY_RUNBOOK.md
Normal file
@@ -0,0 +1,314 @@
|
||||
# VPS Recovery Runbook
|
||||
# Issue #481 — Single-node VPS Single Point of Failure
|
||||
# Created: STEP35 free burn | 2026-04-26
|
||||
|
||||
## Risk Statement
|
||||
|
||||
The Hermes VPS (143.198.27.163) hosts Gitea, FastAPI backend, and the Ezra/Allegro/Bezalel wizard houses. This is a single point of failure — if the VPS is lost, the entire forge and agent coordination layer is offline.
|
||||
|
||||
**Risk Level:** High
|
||||
|
||||
---
|
||||
|
||||
## Current Mitigations (As-Built)
|
||||
|
||||
### 1. Daily Database Backups
|
||||
|
||||
There is a daily backup job running on the VPS:
|
||||
|
||||
```
|
||||
30 3 * * * /root/wizards/bezalel/backup_databases.sh
|
||||
```
|
||||
|
||||
**What it backs up:**
|
||||
- Gitea SQLite databases (`/var/lib/gitea/data/gitea.db` and related)
|
||||
- Wizard configuration databases (if any)
|
||||
- Retained for 7 days (estimated — verify script)
|
||||
|
||||
**Where backups are stored:** (TBD — need to inspect `backup_databases.sh` on live VPS)
|
||||
|
||||
**Important:** This script is NOT version-controlled in timmy-config. It exists only on the live VPS.
|
||||
|
||||
### 2. Version-Controlled Configuration
|
||||
|
||||
All operational configuration is version-controlled in `Timmy_Foundation/timmy-config`:
|
||||
- `config.yaml` — Hermes harness configuration
|
||||
- `playbooks/` — Agent playbooks
|
||||
- `memories/` — Persistent memory YAML
|
||||
- `cron/` — Cron job definitions (source of truth)
|
||||
- `bin/` — Operational helper scripts
|
||||
- `ansible/` — Infrastructure-as-code playbooks
|
||||
|
||||
### 3. Ansible Deployment
|
||||
|
||||
Wizard houses are deployed via Ansible from any machine with SSH access:
|
||||
|
||||
```bash
|
||||
cd ansible
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit ezra
|
||||
```
|
||||
|
||||
The VPS itself is disposable — wizard state is rebuilt from configuration + data backups.
|
||||
|
||||
---
|
||||
|
||||
## Recovery Procedure
|
||||
|
||||
### Pre-Recovery Checklist
|
||||
|
||||
- [ ] Identify the failure scope (VPS destroyed vs. service outage)
|
||||
- [ ] Obtain a replacement VPS (same region, preferably DigitalOcean or equivalent)
|
||||
- [ ] Gather SSH private key for root access
|
||||
- [ ] Locate the most recent backup from `/backups/` on the live VPS (if accessible)
|
||||
- [ ] Ensure `~/.config/gitea/token` is available locally for API operations
|
||||
- [ ] Confirm DNS will be updated to new VPS IP
|
||||
|
||||
**TOTAL ESTIMATED RECOVERY TIME: 4–8 hours** (depending on backup availability and DNS propagation)
|
||||
|
||||
### Step 1 — Provision Replacement VPS
|
||||
|
||||
```bash
|
||||
# Using DigitalOcean (current provider)
|
||||
doctl compute droplet create \
|
||||
--image debian-12-x64 \
|
||||
--region nyc1 \
|
||||
--size s-2vcpu-4gb \
|
||||
--ssh-keys $(cat ~/.ssh/id_rsa.pub | doctl compute ssh-key list --format ID --no-header | head -1) \
|
||||
forge-recovery-$(date +%Y%m%d)
|
||||
```
|
||||
|
||||
**Record the new VPS IP address.**
|
||||
|
||||
Alternatively, reuse an existing standby VPS if available (Mitigation #3).
|
||||
|
||||
### Step 2 — Install Base Dependencies
|
||||
|
||||
SSH into new VPS as root and run:
|
||||
|
||||
```bash
|
||||
# Update system
|
||||
apt update && apt upgrade -y
|
||||
|
||||
# Install required packages
|
||||
apt install -y python3 python3-pip python3-venv git curl wget jq sqlite3
|
||||
|
||||
# Install Docker (for Matrix Conduit if applicable)
|
||||
curl -fsSL https://get.docker.com | sh
|
||||
usermod -aG docker $USER
|
||||
|
||||
# Create directory structure
|
||||
mkdir -p /root/wizards/{ezra,allegro,bezalel}
|
||||
mkdir -p /root/.hermes/{bin,skins,playbooks,memories,cron}
|
||||
mkdir -p /var/log/ansible
|
||||
```
|
||||
|
||||
**Time:** 15 minutes
|
||||
|
||||
### Step 3 — Deploy timmy-config Repository
|
||||
|
||||
```bash
|
||||
# Clone timmy-config
|
||||
cd /root
|
||||
git clone https://forge.alexanderwhitestone.com/Timmy_Foundation/timmy-config.git
|
||||
cd timmy-config
|
||||
|
||||
# Run deploy script to overlay configuration
|
||||
./deploy.sh
|
||||
```
|
||||
|
||||
This creates the canonical `~/.hermes/` configuration tree from version control.
|
||||
|
||||
**Time:** 5 minutes
|
||||
|
||||
### Step 4 — Restore Gitea Data from Backup
|
||||
|
||||
**First, determine the backup format from the live VPS (if accessible):**
|
||||
|
||||
```bash
|
||||
# On LIVE VPS (if it's still reachable)
|
||||
ssh root@143.198.27.163 "ls -la /backups/gitea/ 2>/dev/null || ls -la /root/backups/ 2>/dev/null || true"
|
||||
```
|
||||
|
||||
**Expected locations:**
|
||||
- `/backups/gitea/` (standard)
|
||||
- `/var/backups/gitea/`
|
||||
- `/root/backups/`
|
||||
|
||||
**If you have a SQLite backup file (`gitea.db` or `gitea-YYYYMMDD.db`):**
|
||||
|
||||
```bash
|
||||
# On NEW VPS
|
||||
# Stop Gitea if it's running (service will fail until data is restored)
|
||||
systemctl stop gitea 2>/dev/null || true
|
||||
|
||||
# Create data directory if needed
|
||||
mkdir -p /var/lib/gitea/data
|
||||
|
||||
# Restore the database
|
||||
cp /path/to/backup/gitea.db /var/lib/gitea/data/gitea.db
|
||||
chown gitea:gitea /var/lib/gitea/data/gitea.db
|
||||
chmod 600 /var/lib/gitea/data/gitea.db
|
||||
|
||||
# Restore custom templates/public if those were backed up
|
||||
if [ -d "/backups/gitea/custom" ]; then
|
||||
cp -r /backups/gitea/custom/* /var/lib/gitea/custom/
|
||||
chown -R gitea:gitea /var/lib/gitea/custom
|
||||
fi
|
||||
```
|
||||
|
||||
**Start Gitea:**
|
||||
|
||||
```bash
|
||||
systemctl start gitea
|
||||
sleep 5
|
||||
systemctl status gitea
|
||||
```
|
||||
|
||||
**Verify Gitea is healthy:**
|
||||
|
||||
```bash
|
||||
curl -s -o /dev/null -w "%{http_code}" https://forge.alexanderwhitestone.com/api/v1/version
|
||||
# Expected: 200
|
||||
```
|
||||
|
||||
**Time:** 20 minutes
|
||||
|
||||
### Step 5 — Restore FastAPI / Backend Services
|
||||
|
||||
The FastAPI backend configuration lives in `timmy-config/config.yaml`. Since it's version-controlled, just verify:
|
||||
|
||||
```bash
|
||||
# Check config
|
||||
cat ~/.hermes/config.yaml | grep -A5 'fastapi\|backend\|port'
|
||||
|
||||
# Start the backend service (if managed via systemd)
|
||||
systemctl start hermes-backend 2>/dev/null || true
|
||||
|
||||
# Verify health
|
||||
curl -s http://localhost:8645/health || echo "Backend endpoint may differ"
|
||||
```
|
||||
|
||||
If the backend uses a separate systemd service, it should be defined in ansible roles. Deploy via ansible (Step 7).
|
||||
|
||||
**Time:** 10 minutes
|
||||
|
||||
### Step 6 — Deploy Wizard Houses via Ansible
|
||||
|
||||
```bash
|
||||
cd /root/timmy-config/ansible
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/site.yml
|
||||
```
|
||||
|
||||
This will:
|
||||
- Create wizard directories
|
||||
- Deploy configuration
|
||||
- Set up cron jobs
|
||||
- Start systemd services
|
||||
|
||||
**If Ansible fails because SSH keys aren't set up on the new VPS yet:**
|
||||
|
||||
```bash
|
||||
# On LOCAL machine (where you have SSH access to the new VPS)
|
||||
cat ~/.ssh/id_rsa.pub | ssh root@<NEW_VPS_IP> "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
|
||||
|
||||
# Update ansible/inventory/hosts.yml with new IP for the `forge` and `ezra` hosts
|
||||
# Then re-run ansible
|
||||
```
|
||||
|
||||
**Time:** 15 minutes
|
||||
|
||||
### Step 7 — Verify Fleet Health
|
||||
|
||||
```bash
|
||||
# Check all wizards
|
||||
systemctl status hermes-{ezra,allegro,bezalel} 2>/dev/null
|
||||
|
||||
# Check hermes state
|
||||
ps aux | grep hermes
|
||||
|
||||
# Check cron jobs
|
||||
crontab -l
|
||||
|
||||
# Check logs for errors
|
||||
tail -50 /var/log/ansible/timmy-fleet.log
|
||||
tail -50 /root/.hermes/logs/sprint/*.log 2>/dev/null || true
|
||||
```
|
||||
|
||||
### Step 8 — Update DNS
|
||||
|
||||
If the new VPS has a different IP than the old one, update DNS A records:
|
||||
|
||||
| Service | Hostname | Current IP |
|
||||
|---------------------------|-----------------------------------|------------------|
|
||||
| Gitea / Forge | forge.alexanderwhitestone.com | 143.198.27.163 |
|
||||
| (future) Nexus | nexus.timmytime.net | (TBD) |
|
||||
|
||||
**Action:** Update the A record for `forge.alexanderwhitestone.com` to point to `<NEW_VPS_IP>`.
|
||||
|
||||
**TTL:** 300 seconds (5 min) — propagation complete in ~15 min
|
||||
|
||||
**Time:** 5 minutes + DNS propagation
|
||||
|
||||
### Step 9 — Post-Recovery Validation
|
||||
|
||||
Once DNS has propagated (wait 15 min, then):
|
||||
|
||||
```bash
|
||||
# 1. Gitea accessibility
|
||||
curl -s -I https://forge.alexanderwhitestone.com/api/v1/version | head -1
|
||||
# Expected: HTTP/2 200
|
||||
|
||||
# 2. Issue creation test
|
||||
# Use gitea-api.sh to file a test issue
|
||||
gitea-api.sh issue create timmy-config "Recovery Test" "Automated post-recovery validation — can be closed."
|
||||
# Expected: Issue #<N> created
|
||||
|
||||
# 3. Wizard heartbeat check
|
||||
# Check latest fleet health logs
|
||||
tail -30 ~/.local/timmy/fleet-health/*.json 2>/dev/null | head -1
|
||||
|
||||
# 4. Herald dispatch test
|
||||
# File a simple issue and watch dispatch
|
||||
```
|
||||
|
||||
**Close the test issue:**
|
||||
```bash
|
||||
gitea-api.sh issue close timmy-config <TEST_ISSUE_NUM>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
If recovery fails or the original VPS comes back online:
|
||||
|
||||
1. **Pause DNS** — point to a static "maintenance" page or 502
|
||||
2. **Shut down the new VPS** — `shutdown -h now` (preserve disks for forensics)
|
||||
3. **Revert to original VPS** once it's confirmed healthy
|
||||
4. **Document the failure** — add an ADR to `docs/adr/`
|
||||
|
||||
---
|
||||
|
||||
## Post-Mortem Actions
|
||||
|
||||
After successful recovery:
|
||||
1. Document the root cause of the VPS loss
|
||||
2. Verify backup integrity — ensure `backup_databases.sh` actually works
|
||||
3. Consider **Mitigation #3** — Cold standby VPS with automated sync
|
||||
4. Consider **Mitigation #4** — Mirror all repos to GitHub as secondary
|
||||
5. Update this runbook with any corrections discovered during recovery
|
||||
|
||||
---
|
||||
|
||||
## Related Issues
|
||||
|
||||
- #481 — Single-node VPS SPOF audit (this document)
|
||||
- Future: Automated backup verification
|
||||
- Future: Offsite backup sync (S3, remote)
|
||||
- Future: Hot standby VPS with keepalived/HAProxy
|
||||
|
||||
---
|
||||
|
||||
**Last updated:** 2026-04-26
|
||||
**Maintained by:** Timmy Foundation Infrastructure Team
|
||||
**Review cadence:** After each recovery drill or actual recovery
|
||||
Reference in New Issue
Block a user