Files
timmy-config/docs/backup-recovery-runbook.md
Rockachopa fb36197e8c feat(backup): add automated Gitea daily backup and recovery runbook
- Add bin/gitea-backup.sh: daily backup script using gitea dump
- Add cron/vps/gitea-daily-backup.yml: Hermes cron job (2 AM daily)
- Add docs/backup-recovery-runbook.md: complete recovery procedures

Addresses [AUDIT][RISK] Single-node VPS is a single point of failure.
Closes #481
2026-04-30 20:04:17 -04:00

156 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Gitea Backup & Recovery Runbook
**Last updated:** 2026-04-30
**Scope:** Single-node VPS (Ezra, 143.198.27.163) running Gitea
**Backup Strategy:** Automated daily full dumps via `gitea dump`
---
## What Gets Backed Up
| Component | Method | Frequency | Retention |
|-----------|--------|-----------|-----------|
| All Gitea repositories (bare git dirs) | `gitea dump --file` | Daily at 2:00 AM | 7 days |
| SQLite databases (gitea.db, indexer.db, etc.) | Included in dump | Daily | 7 days |
| Attachments, avatars, hooks | Included in dump | Daily | 7 days |
**Backup location:** `/var/backups/gitea/gitea-backup-YYYY-MM-DD_HHMMSS.tar.gz`
**Log file:** `/var/log/gitea-backup.log`
---
## Backup Architecture
The backup script `bin/gitea-backup.sh` runs daily via Hermes cron (`cron/vps/gitea-daily-backup.yml`). It:
1. Locates the Gitea `WORK_PATH` by reading `/etc/gitea/app.ini` or falling back to common locations (`/var/lib/gitea`, `/home/git/gitea`)
2. Invokes `gitea dump --work-path <path> --file <backup-tar.gz>` — Gitea's native, consistent snapshot mechanism
3. Prunes archives older than 7 days
4. Logs all operations to `/var/log/gitea-backup.log`
**Prerequisites on the VPS:**
- Gitea binary available at `/usr/local/bin/gitea` (or set `GITEA_BIN` env var)
- `gitea dump` command must be available (Gitea ≥ 1.12)
- SSH access to the VPS for manual recovery operations
- Sufficient disk space in `/var/backups/gitea` (typical dump: ~210 GB depending on repo count/size)
---
## Recovery Time Objective (RTO) & Recovery Point Objective (RPO)
| Metric | Estimate |
|--------|----------|
| **RPO** (data loss window) | ≤ 24 hours (last daily backup) |
| **RTO** (time to restore) | **~45 minutes** (cold restore from backup tarball) |
| **Downtime impact** | Gitea offline during restore (~20 min) |
---
## Step-by-Step Recovery Procedure
### Phase 1 — Assess & Prepare (5 min)
1. SSH into Ezra VPS: `ssh root@143.198.27.163`
2. Stop Gitea so files are quiescent:
```bash
systemctl stop gitea
```
3. Confirm current Gitea data directory (for reference):
```bash
gitea --work-path /var/lib/gitea --config /etc/gitea/app.ini dump --help 2>&1
# Or check app.ini for WORK_PATH
cat /etc/gitea/app.ini | grep '^WORK_PATH'
```
### Phase 2 — Restore from Backup (20 min)
4. Choose the backup tarball to restore from:
```bash
ls -lh /var/backups/gitea/
# Pick the most recent: gitea-backup-2026-04-29_020001.tar.gz
```
5. **Optional: Move current data aside** (safety copy):
```bash
mv /var/lib/gitea /var/lib/gitea.bak-$(date +%s)
```
6. Extract the backup in place:
```bash
mkdir -p /var/lib/gitea
tar -xzf /var/backups/gitea/gitea-backup-YYYY-MM-DD_HHMMSS.tar.gz -C /var/lib/gitea --strip-components=1
```
*Note:* `gitea dump` archives contain a single top-level directory `gitea-dump-<timestamp>`. The `--strip-components=1` puts its contents directly into `/var/lib/gitea`.
7. Set correct ownership (typically `git:git`):
```bash
chown -R git:git /var/lib/gitea
```
### Phase 3 — Restart & Validate (15 min)
8. Start Gitea:
```bash
systemctl start gitea
```
9. Wait 30 seconds, then verify:
```bash
systemctl status gitea
# Check HTTP endpoint
curl -s -o /dev/null -w '%{http_code}' http://localhost:3000/ # Should be 200
```
10. Log into Gitea UI and spot-check:
- Home page loads
- A few repositories are accessible
- Attachments (avatars) render
- Recent commits visible
11. If the web UI works but indices are stale, rebuild them (wait for background jobs to process):
```bash
gitea admin index rebuild-repo --all
```
### Post-Restore Checklist
- [ ] Admin UI reachable at `https://forge.alexanderwhitestone.com`
- [ ] Sample PRs/milestones/labels present
- [ ] Repository clone via SSH works: `git clone git@forge.alexanderwhitestone.com:Timmy_Foundation/timmy-config.git`
- [ ] Check backup script health: `cat /var/log/gitea-backup.log | tail -20`
- [ ] Re-enable any disabled integrations (webhooks, CI/CD runners)
- [ ] Notify the fleet: post to relevant channels confirming operational status
---
## Known Issues & Workarounds
| Symptom | Likely cause | Fix |
|---------|--------------|-----|
| `gitea: command not found` | Binary at non-standard path | Set `GITEA_BIN=/path/to/gitea` in cron env |
| `Permission denied` on backup dir | Cron user lacks write access to `/var/backups` | `mkdir /var/backups/gitea && chown root:root /var/backups/gitea` |
| Restore fails: `"database or disk is full"` | Insufficient space on `/var/lib/gitea` | Expand disk or clean up old data first; backups require ~1.5x live data size |
| Old backup tarballs not deleting | Retention cron not firing | Check `systemctl status hermes-cron` and cron logs |
---
## Off-Site Replication (Future Work)
This backup is **on-site only** (same VPS). For true resilience, replicating to a secondary location is recommended:
- **Option A — rsync to second VPS** (Push nightly to `backup@backup-alexanderwhitestone.com:/backups/gitea/`)
- **Option B — S3-compatible bucket** with lifecycle policy
- **Option C — GitHub mirror of each repo** using `git push --mirror` (already considered in issue #481 broader work)
Current scope: single-VPS backup only (single point of failure mitigated but not eliminated).
---
## Related Documentation
- `bin/gitea-backup.sh` — backup script source
- `cron/vps/gitea-daily-backup.yml` — Hermes cron definition
- Gitea official docs: <https://docs.gitea.com/administration/backup-and-restore>
- Hermes cron: <https://hermes-agent.nousresearch.com/docs>