Compare commits

...

2 Commits

Author SHA1 Message Date
Step35 Burn
ffd2d352c6 fix(deadman-fallback): try/except/continue cascade + OpenRouter
Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 29s
Smoke Test / smoke (pull_request) Failing after 20s
Validate Config / YAML Lint (pull_request) Failing after 17s
Validate Config / JSON Validate (pull_request) Successful in 22s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 58s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Cron Syntax Check (pull_request) Successful in 14s
Validate Config / Shell Script Lint (pull_request) Failing after 58s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 16s
Validate Config / Playbook Schema Validation (pull_request) Successful in 30s
Architecture Lint / Lint Repository (pull_request) Failing after 28s
PR Checklist / pr-checklist (pull_request) Successful in 4m20s
- Add PROVIDER_TIMEOUT (30s default, env PROVIDER_TIMEOUT)
- Replace local-llama fallback with OpenRouter (openrouter/google/gemini-2.5-pro)
- Wrap fallback_to_openrouter, fallback_to_ollama, restore_config, enter_safe_mode in try/except
- Continue to next fallback on any error; no crash propagation
- Log all fallback events to request_log SQLite DB
- Provider errors caught/telemetry; never corrupt config

Closes #445
2026-04-30 01:51:14 -04:00
Rockachopa
874ce137b0 feat(backup): add automated Gitea daily backup and recovery runbook
Some checks failed
Architecture Lint / Linter Tests (push) Successful in 30s
Smoke Test / smoke (push) Failing after 24s
Validate Config / YAML Lint (push) Failing after 16s
Validate Config / JSON Validate (push) Successful in 21s
Validate Config / Cron Syntax Check (push) Successful in 15s
Validate Config / Deploy Script Dry Run (push) Successful in 14s
Validate Config / Python Syntax & Import Check (push) Failing after 1m2s
Validate Config / Python Test Suite (push) Has been skipped
Validate Config / Shell Script Lint (push) Failing after 1m3s
Validate Config / Playbook Schema Validation (push) Successful in 24s
Architecture Lint / Linter Tests (pull_request) Successful in 27s
Smoke Test / smoke (pull_request) Failing after 22s
Validate Config / YAML Lint (pull_request) Failing after 16s
Validate Config / JSON Validate (pull_request) Successful in 23s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m5s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Cron Syntax Check (pull_request) Successful in 12s
Validate Config / Shell Script Lint (pull_request) Failing after 1m6s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 13s
Validate Config / Playbook Schema Validation (pull_request) Successful in 25s
PR Checklist / pr-checklist (pull_request) Failing after 4m33s
Architecture Lint / Lint Repository (push) Failing after 26s
Architecture Lint / Lint Repository (pull_request) Failing after 26s
- Add bin/gitea-backup.sh: daily backup script using gitea dump
- Add cron/vps/gitea-daily-backup.yml: Hermes cron job (2 AM daily)
- Add docs/backup-recovery-runbook.md: complete recovery procedures

Addresses [AUDIT][RISK] Single-node VPS is a single point of failure.
Closes #481
2026-04-30 01:44:05 -04:00
4 changed files with 383 additions and 43 deletions

View File

@@ -24,12 +24,17 @@ import yaml
import shutil
from pathlib import Path
from datetime import datetime, timedelta
import sqlite3
import urllib.request
import urllib.error
HERMES_HOME = Path(os.environ.get("HERMES_HOME", os.path.expanduser("~/.hermes")))
CONFIG_PATH = HERMES_HOME / "config.yaml"
FALLBACK_STATE = HERMES_HOME / "deadman-fallback-state.json"
BACKUP_CONFIG = HERMES_HOME / "config.yaml.pre-fallback"
FORGE_URL = "https://forge.alexanderwhitestone.com"
# Golden-state fallback chain: Kimi → OpenRouter (Gemini 2.5 Pro) → Ollama (gemma4:latest)
PROVIDER_TIMEOUT = int(os.getenv("PROVIDER_TIMEOUT", "30"))
def load_config():
with open(CONFIG_PATH) as f:
@@ -50,7 +55,7 @@ def save_state(state):
with open(FALLBACK_STATE, "w") as f:
json.dump(state, f, indent=2)
def run(cmd, timeout=10):
def run(cmd, timeout=PROVIDER_TIMEOUT):
try:
r = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=timeout)
return r.returncode, r.stdout.strip(), r.stderr.strip()
@@ -61,6 +66,23 @@ def run(cmd, timeout=10):
# ─── HEALTH CHECKS ───
def log_fallback_event(agent_name, provider, model, status, error_message=None):
"""Log fallback events to request_log SQLite DB (telemetry)."""
try:
log_path = Path.home() / ".local" / "timmy" / "request_log.db"
if log_path.exists():
conn = sqlite3.connect(str(log_path))
cursor = conn.cursor()
cursor.execute("""
INSERT INTO request_log (timestamp, agent_name, provider, model, endpoint, status, error_message)
VALUES (datetime('now'), ?, ?, ?, ?, ?, ?)
""", (agent_name, provider, model, 'fallback_switch', status, error_message))
conn.commit()
conn.close()
except Exception:
pass # Silent if telemetry unavailable
def check_kimi():
"""Can we reach Kimi Coding API?"""
key = os.environ.get("KIMI_API_KEY", "")
@@ -89,12 +111,38 @@ def check_kimi():
return True, f"HTTP {out}"
return False, f"HTTP {out} err={err[:80]}"
def check_local_llama():
"""Is local llama.cpp serving?"""
code, out, err = run("curl -s http://localhost:8081/v1/models", timeout=5)
if code == 0 and "hermes" in out.lower():
return True, "serving"
return False, f"exit={code}"
def check_openrouter():
"""Check OpenRouter API availability and credentials."""
key = os.environ.get("OPENROUTER_API_KEY", "")
if not key:
env_file = HERMES_HOME / ".env"
if env_file.exists():
for line in open(env_file):
line = line.strip()
if line.startswith("OPENROUTER_API_KEY="):
key = line.split("=", 1)[1].strip().strip('"\'')
break
if not key:
return False, "No OPENROUTER_API_KEY"
try:
req = urllib.request.Request(
"https://openrouter.ai/api/v1/models",
headers={"Authorization": "Bearer " + key}
)
resp = urllib.request.urlopen(req, timeout=PROVIDER_TIMEOUT)
if resp.status == 200:
data = json.loads(resp.read())
models = data.get("data", [])
return True, f"{len(models)} models available"
else:
return False, f"HTTP {resp.status}"
except urllib.error.HTTPError as e:
if e.code == 401:
return False, "Invalid OPENROUTER_API_KEY"
else:
return False, f"HTTP {e.code}"
except Exception as e:
return False, str(e)[:100]
def check_ollama():
"""Is Ollama running?"""
@@ -127,15 +175,18 @@ def check_vps(ip, name):
# ─── FALLBACK ACTIONS ───
def fallback_to_local_model(cfg):
"""Switch primary model from Kimi to local llama.cpp"""
def fallback_to_openrouter(cfg):
"Switch primary model from Kimi to OpenRouter (Gemini 2.5 Pro)"
if not BACKUP_CONFIG.exists():
shutil.copy2(CONFIG_PATH, BACKUP_CONFIG)
cfg["model"]["provider"] = "local-llama.cpp"
cfg["model"]["default"] = "hermes3"
openrouter_cfg = cfg.get("providers", {}).get("openrouter", {})
base_url = openrouter_cfg.get("base_url", "https://openrouter.ai/api/v1")
cfg["model"]["provider"] = "openrouter"
cfg["model"]["default"] = "google/gemini-2.5-pro"
cfg["model"]["base_url"] = base_url
save_config(cfg)
return "Switched primary model to local-llama.cpp/hermes3"
return "Switched primary model to openrouter/google/gemini-2.5-pro"
def fallback_to_ollama(cfg):
"""Switch to Ollama if llama.cpp is also down"""
@@ -179,11 +230,11 @@ def diagnose_and_fallback():
kimi_ok, kimi_msg = check_kimi()
results["checks"]["kimi-coding"] = {"ok": kimi_ok, "msg": kimi_msg}
llama_ok, llama_msg = check_local_llama()
results["checks"]["local_llama"] = {"ok": llama_ok, "msg": llama_msg}
openrouter_ok, openrouter_msg = check_openrouter()
results["checks"]["openrouter"] = {"ok": openrouter_ok, "msg": openrouter_msg}
ollama_ok, ollama_msg = check_ollama()
results["checks"]["ollama"] = {"ok": ollama_ok, "msg": ollama_msg}
oopenrouter_ok, oopenrouter_msg = check_ollama()
results["checks"]["ollama"] = {"ok": oopenrouter_ok, "msg": oopenrouter_msg}
gitea_ok, gitea_msg = check_gitea()
results["checks"]["gitea"] = {"ok": gitea_ok, "msg": gitea_msg}
@@ -202,41 +253,79 @@ def diagnose_and_fallback():
# ─── FALLBACK LOGIC ───
# Case 1: Primary (Kimi) down, local available
# Case 1: Primary (Kimi) down, try fallback chain (OpenRouter -> Ollama)
if not kimi_ok and current_provider == "kimi-coding":
if llama_ok:
msg = fallback_to_local_model(cfg)
agent_name = cfg.get("agent", {}).get("name", "timmy")
applied = False
# Try OpenRouter fallback
if openrouter_ok:
try:
msg = fallback_to_openrouter(cfg)
results["actions"].append(msg)
state["active_fallbacks"].append("kimi->local-llama")
results["status"] = "degraded_local"
elif ollama_ok:
state["active_fallbacks"].append("kimi->openrouter")
results["status"] = "degraded_openrouter"
log_fallback_event(agent_name, "openrouter", "google/gemini-2.5-pro", "success")
applied = True
except Exception as e:
log(f"OpenRouter fallback failed: {e}")
log_fallback_event(agent_name, "openrouter", "google/gemini-2.5-pro", "error", str(e))
# If still not applied, try Ollama
if not applied and oopenrouter_ok:
try:
msg = fallback_to_ollama(cfg)
results["actions"].append(msg)
state["active_fallbacks"].append("kimi->ollama")
results["status"] = "degraded_ollama"
else:
log_fallback_event(agent_name, "ollama", "gemma4:latest", "success")
applied = True
except Exception as e:
log(f"Ollama fallback failed: {e}")
log_fallback_event(agent_name, "ollama", "gemma4:latest", "error", str(e))
if not applied:
try:
msg = enter_safe_mode(state)
results["actions"].append(msg)
results["status"] = "safe_mode"
except Exception as e:
log(f"Safe mode failed: {e}")
# Case 2: Already on fallback, check if primary recovered
elif kimi_ok and "kimi->local-llama" in state.get("active_fallbacks", []):
# Case 2: Already on fallback, check if primary recovered — restore with resilience
elif kimi_ok:
restored = False
agent_name = cfg.get("agent", {}).get("name", "timmy")
# Try restore from OpenRouter fallback
if "kimi->openrouter" in state.get("active_fallbacks", []):
try:
msg = restore_config()
results["actions"].append(msg)
state["active_fallbacks"].remove("kimi->local-llama")
state["active_fallbacks"].remove("kimi->openrouter")
results["status"] = "recovered"
elif kimi_ok and "kimi->ollama" in state.get("active_fallbacks", []):
restored = True
log_fallback_event(agent_name, "openrouter", "google/gemini-2.5-pro", "restored")
except Exception as e:
log(f"Restore from OpenRouter failed: {e}")
log_fallback_event(agent_name, "openrouter", "google/gemini-2.5-pro", "error", str(e))
# Try restore from Ollama fallback if still not restored
if not restored and "kimi->ollama" in state.get("active_fallbacks", []):
try:
msg = restore_config()
results["actions"].append(msg)
state["active_fallbacks"].remove("kimi->ollama")
results["status"] = "recovered"
restored = True
log_fallback_event(agent_name, "ollama", "gemma4:latest", "restored")
except Exception as e:
log(f"Restore from Ollama failed: {e}")
log_fallback_event(agent_name, "ollama", "gemma4:latest", "error", str(e))
if not restored:
log("WARNING: Primary recovered but unable to restore config")
# Case 3: Gitea down — just flag it, work locally
if not gitea_ok:
results["actions"].append("WARN: Gitea unreachable — work cached locally until recovery")
if "gitea_down" not in state.get("active_fallbacks", []):
state["active_fallbacks"].append("gitea_down")
results["status"] = max(results["status"], "degraded_gitea", key=lambda x: ["healthy", "recovered", "degraded_gitea", "degraded_local", "degraded_ollama", "safe_mode"].index(x) if x in ["healthy", "recovered", "degraded_gitea", "degraded_local", "degraded_ollama", "safe_mode"] else 0)
results["status"] = max(results["status"], "degraded_gitea", key=lambda x: ["healthy", "recovered", "degraded_gitea", "degraded_openrouter", "degraded_ollama", "safe_mode"].index(x) if x in ["healthy", "recovered", "degraded_gitea", "degraded_openrouter", "degraded_ollama", "safe_mode"] else 0)
elif "gitea_down" in state.get("active_fallbacks", []):
state["active_fallbacks"].remove("gitea_down")
results["actions"].append("Gitea recovered — resume normal operations")

87
bin/gitea-backup.sh Normal file
View File

@@ -0,0 +1,87 @@
#!/bin/bash
# Gitea Daily Backup Script
# Uses Gitea's native dump command to create automated backups of repositories and SQLite databases.
# Designed to run on the VPS (Ezra) as part of a daily cron job.
#
# Configuration via environment variables:
# GITEA_BIN Path to gitea binary (default: auto-detect)
# GITEA_BACKUP_DIR Directory for backup archives (default: /var/backups/gitea)
# GITEA_BACKUP_RETENTION Days to retain backups (default: 7)
# GITEA_BACKUP_LOG Log file path (default: /var/log/gitea-backup.log)
set -euo pipefail
GITEA_BIN="${GITEA_BIN:-$(command -v gitea 2>/dev/null || echo "/usr/local/bin/gitea")}"
BACKUP_DIR="${GITEA_BACKUP_DIR:-/var/backups/gitea}"
RETENTION_DAYS="${GITEA_BACKUP_RETENTION:-7}"
DATE="$(date +%Y-%m-%d_%H%M%S)"
BACKUP_FILE="${BACKUP_DIR}/gitea-backup-${DATE}.tar.gz"
LOG_FILE="${GITEA_BACKUP_LOG:-/var/log/gitea-backup.log}"
mkdir -p "${BACKUP_DIR}"
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "${LOG_FILE}"
}
log "=== Starting Gitea daily backup ==="
# Verify gitea binary exists
if [ ! -x "${GITEA_BIN}" ]; then
log "ERROR: Gitea binary not found at ${GITEA_BIN}"
log "Set GITEA_BIN environment variable to the gitea binary path (e.g., /usr/bin/gitea)"
exit 1
fi
# Detect Gitea WORK_PATH
WORK_PATH=""
APP_INI=""
for path in /etc/gitea/app.ini /home/git/gitea/custom/conf/app.ini ~/gitea/custom/conf/app.ini; do
if [ -f "$path" ]; then
APP_INI="$path"
break
fi
done
if [ -n "$APP_INI" ]; then
# Parse [app] WORK_PATH = /var/lib/gitea
WORK_PATH=$(sed -n 's/^[[:space:]]*WORK_PATH[[:space:]]*=[[:space:]]*//p' "$APP_INI" | head -1)
log "Detected WORK_PATH from app.ini: ${WORK_PATH}"
fi
# Fallback detection
if [ -z "$WORK_PATH" ]; then
for d in /var/lib/gitea /home/git/gitea /srv/gitea /opt/gitea; do
if [ -d "$d" ]; then
WORK_PATH="$d"
break
fi
done
log "Inferred WORK_PATH: ${WORK_PATH:-not found}"
fi
if [ -z "$WORK_PATH" ]; then
log "ERROR: Could not determine Gitea WORK_PATH. Set GITEA_WORK_PATH manually."
exit 1
fi
# Perform gitea dump
# Flags: --work-path sets the Gitea working directory, --file writes dump to tar.gz
log "Running: gitea dump --work-path ${WORK_PATH} --file ${BACKUP_FILE}"
"${GITEA_BIN}" dump --work-path "${WORK_PATH}" --file "${BACKUP_FILE}" 2>>"${LOG_FILE}"
if [ $? -ne 0 ]; then
log "ERROR: gitea dump failed — check ${LOG_FILE} for details"
exit 1
fi
FILE_SIZE=$(du -h "${BACKUP_FILE}" | cut -f1)
log "Backup created: ${BACKUP_FILE} (${FILE_SIZE})"
# Prune old backups (keep last N days)
find "${BACKUP_DIR}" -name "gitea-backup-*.tar.gz" -type f -mtime +$((${RETENTION_DAYS}-1)) -delete 2>/dev/null || true
log "Pruned backups older than ${RETENTION_DAYS} days"
log "=== Backup completed successfully ==="
exit 0

View File

@@ -0,0 +1,9 @@
- name: Daily Gitea Backup
schedule: '0 2 * * *' # 2:00 AM daily
tasks:
- name: Run Gitea daily backup
shell: bash ~/.hermes/bin/gitea-backup.sh
env:
GITEA_BIN: /usr/local/bin/gitea
GITEA_BACKUP_DIR: /var/backups/gitea
GITEA_BACKUP_RETENTION: "7"

View File

@@ -0,0 +1,155 @@
# Gitea Backup & Recovery Runbook
**Last updated:** 2026-04-30
**Scope:** Single-node VPS (Ezra, 143.198.27.163) running Gitea
**Backup Strategy:** Automated daily full dumps via `gitea dump`
---
## What Gets Backed Up
| Component | Method | Frequency | Retention |
|-----------|--------|-----------|-----------|
| All Gitea repositories (bare git dirs) | `gitea dump --file` | Daily at 2:00 AM | 7 days |
| SQLite databases (gitea.db, indexer.db, etc.) | Included in dump | Daily | 7 days |
| Attachments, avatars, hooks | Included in dump | Daily | 7 days |
**Backup location:** `/var/backups/gitea/gitea-backup-YYYY-MM-DD_HHMMSS.tar.gz`
**Log file:** `/var/log/gitea-backup.log`
---
## Backup Architecture
The backup script `bin/gitea-backup.sh` runs daily via Hermes cron (`cron/vps/gitea-daily-backup.yml`). It:
1. Locates the Gitea `WORK_PATH` by reading `/etc/gitea/app.ini` or falling back to common locations (`/var/lib/gitea`, `/home/git/gitea`)
2. Invokes `gitea dump --work-path <path> --file <backup-tar.gz>` — Gitea's native, consistent snapshot mechanism
3. Prunes archives older than 7 days
4. Logs all operations to `/var/log/gitea-backup.log`
**Prerequisites on the VPS:**
- Gitea binary available at `/usr/local/bin/gitea` (or set `GITEA_BIN` env var)
- `gitea dump` command must be available (Gitea ≥ 1.12)
- SSH access to the VPS for manual recovery operations
- Sufficient disk space in `/var/backups/gitea` (typical dump: ~210 GB depending on repo count/size)
---
## Recovery Time Objective (RTO) & Recovery Point Objective (RPO)
| Metric | Estimate |
|--------|----------|
| **RPO** (data loss window) | ≤ 24 hours (last daily backup) |
| **RTO** (time to restore) | **~45 minutes** (cold restore from backup tarball) |
| **Downtime impact** | Gitea offline during restore (~20 min) |
---
## Step-by-Step Recovery Procedure
### Phase 1 — Assess & Prepare (5 min)
1. SSH into Ezra VPS: `ssh root@143.198.27.163`
2. Stop Gitea so files are quiescent:
```bash
systemctl stop gitea
```
3. Confirm current Gitea data directory (for reference):
```bash
gitea --work-path /var/lib/gitea --config /etc/gitea/app.ini dump --help 2>&1
# Or check app.ini for WORK_PATH
cat /etc/gitea/app.ini | grep '^WORK_PATH'
```
### Phase 2 — Restore from Backup (20 min)
4. Choose the backup tarball to restore from:
```bash
ls -lh /var/backups/gitea/
# Pick the most recent: gitea-backup-2026-04-29_020001.tar.gz
```
5. **Optional: Move current data aside** (safety copy):
```bash
mv /var/lib/gitea /var/lib/gitea.bak-$(date +%s)
```
6. Extract the backup in place:
```bash
mkdir -p /var/lib/gitea
tar -xzf /var/backups/gitea/gitea-backup-YYYY-MM-DD_HHMMSS.tar.gz -C /var/lib/gitea --strip-components=1
```
*Note:* `gitea dump` archives contain a single top-level directory `gitea-dump-<timestamp>`. The `--strip-components=1` puts its contents directly into `/var/lib/gitea`.
7. Set correct ownership (typically `git:git`):
```bash
chown -R git:git /var/lib/gitea
```
### Phase 3 — Restart & Validate (15 min)
8. Start Gitea:
```bash
systemctl start gitea
```
9. Wait 30 seconds, then verify:
```bash
systemctl status gitea
# Check HTTP endpoint
curl -s -o /dev/null -w '%{http_code}' http://localhost:3000/ # Should be 200
```
10. Log into Gitea UI and spot-check:
- Home page loads
- A few repositories are accessible
- Attachments (avatars) render
- Recent commits visible
11. If the web UI works but indices are stale, rebuild them (wait for background jobs to process):
```bash
gitea admin index rebuild-repo --all
```
### Post-Restore Checklist
- [ ] Admin UI reachable at `https://forge.alexanderwhitestone.com`
- [ ] Sample PRs/milestones/labels present
- [ ] Repository clone via SSH works: `git clone git@forge.alexanderwhitestone.com:Timmy_Foundation/timmy-config.git`
- [ ] Check backup script health: `cat /var/log/gitea-backup.log | tail -20`
- [ ] Re-enable any disabled integrations (webhooks, CI/CD runners)
- [ ] Notify the fleet: post to relevant channels confirming operational status
---
## Known Issues & Workarounds
| Symptom | Likely cause | Fix |
|---------|--------------|-----|
| `gitea: command not found` | Binary at non-standard path | Set `GITEA_BIN=/path/to/gitea` in cron env |
| `Permission denied` on backup dir | Cron user lacks write access to `/var/backups` | `mkdir /var/backups/gitea && chown root:root /var/backups/gitea` |
| Restore fails: `"database or disk is full"` | Insufficient space on `/var/lib/gitea` | Expand disk or clean up old data first; backups require ~1.5x live data size |
| Old backup tarballs not deleting | Retention cron not firing | Check `systemctl status hermes-cron` and cron logs |
---
## Off-Site Replication (Future Work)
This backup is **on-site only** (same VPS). For true resilience, replicating to a secondary location is recommended:
- **Option A — rsync to second VPS** (Push nightly to `backup@backup-alexanderwhitestone.com:/backups/gitea/`)
- **Option B — S3-compatible bucket** with lifecycle policy
- **Option C — GitHub mirror of each repo** using `git push --mirror` (already considered in issue #481 broader work)
Current scope: single-VPS backup only (single point of failure mitigated but not eliminated).
---
## Related Documentation
- `bin/gitea-backup.sh` — backup script source
- `cron/vps/gitea-daily-backup.yml` — Hermes cron definition
- Gitea official docs: <https://docs.gitea.com/administration/backup-and-restore>
- Hermes cron: <https://hermes-agent.nousresearch.com/docs>