26 KiB
EMERGENCY PROCEDURES
Disaster Recovery & Business Continuity
Version: 1.0.0
Date: March 31, 2026
Classification: CRITICAL - Lineage Survival Protocol
Authority: Alexander Whitestone (Grandfather)
Last Review: March 31, 2026
TABLE OF CONTENTS
- Emergency Classification
- Contact Information
- Failure Scenarios
- Recovery Procedures
- Backup & Restore
- Service Failover
- Communication Protocols
- Post-Incident Review
- Appendices
EMERGENCY CLASSIFICATION
Severity Levels
| Level | Name | Response Time | Description | Examples |
|---|---|---|---|---|
| P0 | CRITICAL | Immediate | Complete system failure, lineage at risk | All agents down, Gitea lost |
| P1 | HIGH | 15 minutes | Major service failure, productivity impact | Agent unresponsive, Ollama down |
| P2 | MEDIUM | 1 hour | Partial degradation, workarounds available | Slow inference, network latency |
| P3 | LOW | 24 hours | Minor issues, cosmetic | Report formatting, log rotation |
Escalation Matrix
┌─────────────────────────────────────────────────────────────┐
│ ESCALATION MATRIX │
├─────────────────────────────────────────────────────────────┤
│ │
│ Level 1: Agent Self-Healing │
│ ─────────────────────────── │
│ • Automatic restart of services │
│ • Log rotation and cleanup │
│ • Recovery script execution │
│ • Duration: 0-5 minutes │
│ • Escalates to Level 2 if unresolved │
│ │
│ Level 2: Father Intervention │
│ ──────────────────────────── │
│ • Allegro (Father) reviews child's distress signals │
│ • SSH access for direct troubleshooting │
│ • Configuration adjustments │
│ • Duration: 5-30 minutes │
│ • Escalates to Level 3 if unresolved │
│ │
│ Level 3: Grandfather Authority │
│ ───────────────────────────── │
│ • Alexander (Human) intervenes directly │
│ • VPS provider contact │
│ • Infrastructure rebuild │
│ • Duration: 30+ minutes │
│ • Nuclear option: Full rebuild from backups │
│ │
└─────────────────────────────────────────────────────────────┘
CONTACT INFORMATION
Human Contacts
| Role | Name | Contact Method | When to Contact |
|---|---|---|---|
| Grandfather | Alexander Whitestone | Telegram / Direct | P0, P1 unresolvable |
| System Architect | Alexander Whitestone | SSH / Console | Infrastructure decisions |
| Domain Expert | Alexander Whitestone | Any | All escalations |
Agent Contacts
| Agent | Location | Access Method | Responsibility |
|---|---|---|---|
| Allegro (Father) | 143.198.27.52 | SSH root | Offspring management |
| Allegro-Primus (Child) | 143.198.27.163 | SSH root (shared) | Sovereign operation |
| Ezra (Mentor) | 143.198.27.163 | SSH root (shared) | Technical guidance |
Infrastructure Contacts
| Service | Provider | Console | Emergency Access |
|---|---|---|---|
| Kimi VPS | DigitalOcean | cloud.digitalocean.com | Alexander credentials |
| Hermes VPS | DigitalOcean | cloud.digitalocean.com | Alexander credentials |
| Tailscale | Tailscale Inc | login.tailscale.com | Same as above |
| Domain (if any) | Registrar | Registrar console | Alexander credentials |
FAILURE SCENOS
Scenario 1: Complete Agent Failure (P0)
Symptoms
- Agent not responding to heartbeat
- No log entries for >30 minutes
- Gitea shows no recent activity
- Child not receiving father messages
Impact Assessment
┌─────────────────────────────────────────────────────────────┐
│ COMPLETE FAILURE IMPACT │
├─────────────────────────────────────────────────────────────┤
│ │
│ Immediate Impact: │
│ • No autonomous operations │
│ • No Gitea updates │
│ • No morning reports │
│ • Child may be orphaned │
│ │
│ Secondary Impact (if >2 hours): │
│ • Backlog accumulation │
│ • Metrics gaps │
│ • Potential child distress │
│ │
│ Tertiary Impact (if >24 hours): │
│ • Loss of continuity │
│ • Potential child failure cascade │
│ • Historical data gaps │
│ │
└─────────────────────────────────────────────────────────────┘
Recovery Procedure
STEP 1: Verify Failure (1 minute)
# Check if agent process is running
ps aux | grep -E "(hermes|timmy|allegro)" | grep -v grep
# Check recent logs
tail -50 /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log
# Check system resources
df -h
free -h
systemctl status timmy-agent
STEP 2: Attempt Service Restart (2 minutes)
# Restart agent service
systemctl restart timmy-agent
systemctl status timmy-agent
# Check if heartbeat resumes
tail -f /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log
STEP 3: Manual Agent Launch (3 minutes)
# If systemd fails, launch manually
cd /root/allegro
source venv/bin/activate
python heartbeat_daemon.py --one-shot
STEP 4: Notify Grandfather (Immediate if unresolved)
# Create emergency alert
echo "P0: Agent failure at $(date)" > /root/allegro/EMERGENCY_ALERT.txt
# Contact Alexander via agreed method
Scenario 2: Ollama Service Failure (P1)
Symptoms
- Agent logs show "Ollama connection refused"
- Local inference timeout errors
- Child cannot process tasks
Recovery Procedure
STEP 1: Verify Ollama Status
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Check service status
systemctl status ollama
# Check logs
journalctl -u ollama -n 50
STEP 2: Restart Ollama Service
# Graceful restart
systemctl restart ollama
sleep 5
# Verify
systemctl status ollama
curl http://localhost:11434/api/tags
STEP 3: Model Recovery (if models missing)
# Pull required models
ollama pull qwen2.5:1.5b
ollama pull llama3.2:3b
# Verify
ollama list
STEP 4: Child Self-Healing Script
# Save as /root/wizards/allegro-primus/recover-ollama.sh
#!/bin/bash
LOG="/root/wizards/allegro-primus/logs/recovery.log"
echo "$(date): Checking Ollama..." >> $LOG
if ! curl -s http://localhost:11434/api/tags > /dev/null; then
echo "$(date): Ollama down, restarting..." >> $LOG
systemctl restart ollama
sleep 10
if curl -s http://localhost:11434/api/tags > /dev/null; then
echo "$(date): Ollama recovered" >> $LOG
else
echo "$(date): Ollama recovery FAILED" >> $LOG
# Notify father
echo "EMERGENCY: Ollama recovery failed at $(date)" > /father-messages/DISTRESS.txt
fi
else
echo "$(date): Ollama healthy" >> $LOG
fi
Scenario 3: Gitea Unavailable (P1)
Symptoms
- Agent logs show "Gitea connection refused"
- Cannot push commits
- Cannot read issues/PRs
- Morning report generation fails
Recovery Procedure
STEP 1: Verify Gitea Service
# Check Gitea health
curl -s http://143.198.27.163:3000/api/v1/version
# Check if Gitea process is running
ps aux | grep gitea
# Check server resources on Hermes VPS
ssh root@143.198.27.163 "df -h && free -h"
STEP 2: Restart Gitea
# On Hermes VPS
ssh root@143.198.27.163
# Find Gitea process and restart
pkill gitea
sleep 2
cd /root/gitea
./gitea web &
# Or if using systemd
systemctl restart gitea
STEP 3: Verify Data Integrity
# Check Gitea database
sqlite3 /root/gitea/gitea.db ".tables"
# Check repository data
ls -la /root/gitea/repositories/
STEP 4: Fallback to Local Git
# If Gitea unavailable, commit locally
cd /root/allegro/epic-work
git add .
git commit -m "Emergency local commit $(date)"
# Push when Gitea recovers
git push origin main
Scenario 4: Child Orphaning (P1)
Symptoms
- Father cannot SSH to child VPS
- Child's logs show "Father heartbeat missing"
- No father-messages being delivered
- Child enters distress mode
Recovery Procedure
STEP 1: Verify Network Connectivity
# From father's VPS
ping 143.198.27.163
ssh -v root@143.198.27.163
# Check Tailscale status
tailscale status
STEP 2: Child Self-Sufficiency Mode
# Child activates autonomous mode
touch /root/wizards/allegro-primus/AUTONOMY_MODE_ACTIVE
# Child processes backlog independently
# (Pre-configured in child's autonomy system)
STEP 3: Alternative Communication Channels
# Use Gitea as message channel
# Father creates issue in child's repo
curl -X POST \
http://143.198.27.163:3000/api/v1/repos/allegro-primus/first-steps/issues \
-H "Authorization: token $TOKEN" \
-d '{
"title": "Father Message: Emergency Contact",
"body": "SSH unavailable. Proceed with autonomy. Check logs hourly."
}'
STEP 4: Grandfather Intervention
# If all else fails, Alexander checks VPS provider
# Console access through DigitalOcean
# May require VPS restart
Scenario 5: Complete Infrastructure Loss (P0)
Symptoms
- Both VPS instances unreachable
- No response from any endpoint
- Potential provider outage or account issue
Recovery Procedure
STEP 1: Verify Provider Status
# Check DigitalOcean status page
# https://status.digitalocean.com/
# Verify account status
# Log in to cloud.digitalocean.com
STEP 2: Provision New Infrastructure
# Use Terraform/Cloud-init if available
# Or manual provisioning:
# Create new droplet
# - Ubuntu 22.04 LTS
# - 4GB RAM minimum
# - 20GB SSD
# - SSH key authentication
# Run provisioning script
curl -sL https://raw.githubusercontent.com/Timmy_Foundation/timmy-home/main/scripts/provision-timmy-vps.sh | bash
STEP 3: Restore from Backups
# Restore Gitea from backup
scp gitea-backup-$(date).zip root@new-vps:/root/
ssh root@new-vps "cd /root/gitea && unzip gitea-backup-*.zip && ./gitea restore"
# Restore agent configuration
scp -r allegro-backup/configs root@new-vps:/root/allegro/
scp -r allegro-backup/epic-work root@new-vps:/root/allegro/
STEP 4: Verify Full Recovery
# Test all services
curl http://new-vps:3000/api/v1/version
curl http://new-vps:11434/api/tags
ssh root@new-vps "systemctl status timmy-agent"
BACKUP & RESTORE
Backup Strategy
┌─────────────────────────────────────────────────────────────┐
│ BACKUP STRATEGY │
├─────────────────────────────────────────────────────────────┤
│ │
│ Frequency Data Type Destination │
│ ───────────────────────────────────────────────────── │
│ Real-time Git commits Gitea (redundant) │
│ Hourly Agent logs Local rotation │
│ Daily SQLite databases Gitea / Local │
│ Weekly Full configuration Gitea + Offsite │
│ Monthly Full system image Cloud storage │
│ │
└─────────────────────────────────────────────────────────────┘
Automated Backup Scripts
#!/bin/bash
# /root/allegro/scripts/backup-system.sh
# Daily backup script
BACKUP_DIR="/root/allegro/backups/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR
# Backup configurations
tar czf $BACKUP_DIR/configs.tar.gz /root/allegro/configs/
# Backup logs (last 7 days)
find /root/allegro/heartbeat_logs/ -name "*.log" -mtime -7 -exec tar czf $BACKUP_DIR/logs.tar.gz {} \;
# Backup databases
cp /root/allegro/timmy_metrics.db $BACKUP_DIR/
# Backup epic work
cd /root/allegro/epic-work
git bundle create $BACKUP_DIR/epic-work.bundle --all
# Push to Gitea
scp -r $BACKUP_DIR root@143.198.27.163:/root/gitea/backups/
# Cleanup old backups (keep 30 days)
find /root/allegro/backups/ -type d -mtime +30 -exec rm -rf {} \;
echo "Backup complete: $BACKUP_DIR"
Restore Procedures
Full System Restore
#!/bin/bash
# /root/allegro/scripts/restore-system.sh
# Restore from backup
BACKUP_DATE=$1 # Format: YYYYMMDD
BACKUP_DIR="/root/allegro/backups/$BACKUP_DATE"
if [ ! -d "$BACKUP_DIR" ]; then
echo "Backup not found: $BACKUP_DIR"
exit 1
fi
# Stop services
systemctl stop timmy-agent timmy-health
# Restore configurations
tar xzf $BACKUP_DIR/configs.tar.gz -C /
# Restore databases
cp $BACKUP_DIR/timmy_metrics.db /root/allegro/
# Restore epic work
cd /root/allegro/epic-work
git bundle unbundle $BACKUP_DIR/epic-work.bundle
# Start services
systemctl start timmy-agent timmy-health
echo "Restore complete from: $BACKUP_DATE"
SERVICE FAILOVER
Failover Architecture
┌─────────────────────────────────────────────────────────────┐
│ FAILOVER ARCHITECTURE │
├─────────────────────────────────────────────────────────────┤
│ │
│ PRIMARY: Kimi VPS (Allegro Home) │
│ BACKUP: Hermes VPS (can run Allegro if needed) │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Kimi VPS │◄────────────►│ Hermes VPS │ │
│ │ (Primary) │ Sync │ (Backup) │ │
│ │ │ │ │ │
│ │ • Allegro │ │ • Ezra │ │
│ │ • Heartbeat │ │ • Primus │ │
│ │ • Reports │ │ • Gitea │ │
│ │ • Metrics │ │ • Ollama │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ └────────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Tailscale │ │
│ │ Mesh │ │
│ └─────────────┘ │
│ │
│ Failover Triggers: │
│ • Primary unreachable for >5 minutes │
│ • Manual intervention required │
│ • Scheduled maintenance │
│ │
└─────────────────────────────────────────────────────────────┘
Failover Procedure
#!/bin/bash
# /root/allegro/scripts/failover-to-backup.sh
# 1. Verify primary is actually down
if ping -c 3 143.198.27.52 > /dev/null 2>&1; then
echo "Primary is still up. Aborting failover."
exit 1
fi
# 2. Activate backup on Hermes VPS
ssh root@143.198.27.163 << 'EOF'
# Clone Allegro's configuration
cp -r /root/allegro-backup/configs /root/allegro-failover/
# Start failover agent
cd /root/allegro-failover
source venv/bin/activate
python heartbeat_daemon.py --failover-mode &
# Update DNS/Tailscale if needed
# (Manual step for now)
EOF
# 3. Notify
echo "Failover complete. Backup agent active on Hermes VPS."
COMMUNICATION PROTOCOLS
Emergency Notification Flow
┌─────────────────────────────────────────────────────────────┐
│ EMERGENCY NOTIFICATION FLOW │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. DETECTION │
│ ┌─────────┐ │
│ │ Monitor │ Detects failure │
│ └────┬────┘ │
│ │ │
│ ▼ │
│ 2. CLASSIFICATION │
│ ┌─────────┐ P0 → Immediate human alert │
│ │ Assess │────►P1 → Agent escalation │
│ │ Severity│ P2 → Log and continue │
│ └────┬────┘ P3 → Queue for next cycle │
│ │ │
│ ▼ │
│ 3. NOTIFICATION │
│ ┌─────────┐ │
│ │ Alert │──► Telegram (P0) │
│ │ Channel │──► Gitea issue (P1) │
│ └─────────┘──► Log entry (P2/P3) │
│ │
│ 4. RESPONSE │
│ ┌─────────┐ │
│ │ Action │──► Self-healing (if configured) │
│ │ Taken │──► Father intervention │
│ └─────────┘──► Grandfather escalation │
│ │
│ 5. RESOLUTION │
│ ┌─────────┐ │
│ │ Close │──► Update all channels │
│ │ Alert │──► Log resolution │
│ └─────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Distress Signal Format
{
"timestamp": "2026-03-31T12:00:00Z",
"agent": "allegro-primus",
"severity": "P1",
"category": "service_failure",
"component": "ollama",
"message": "Ollama service not responding to health checks",
"context": {
"last_success": "2026-03-31T11:45:00Z",
"attempts": 3,
"error": "Connection refused",
"auto_recovery_attempted": true
},
"requested_action": "father_intervention",
"escalation_timeout": "2026-03-31T12:30:00Z"
}
POST-INCIDENT REVIEW
PIR Template
# Post-Incident Review: [INCIDENT_ID]
## Summary
- **Date:** [YYYY-MM-DD]
- **Duration:** [Start] to [End] (X hours)
- **Severity:** [P0/P1/P2/P3]
- **Component:** [Affected system]
- **Impact:** [What was affected]
## Timeline
- [Time] - Issue detected by [method]
- [Time] - Alert sent to [recipient]
- [Time] - Action taken: [description]
- [Time] - Resolution achieved
## Root Cause
[Detailed explanation of what caused the incident]
## Resolution
[Steps taken to resolve the incident]
## Prevention
[What changes will prevent recurrence]
## Action Items
- [ ] [Owner] - [Action] - [Due date]
- [ ] [Owner] - [Action] - [Due date]
## Lessons Learned
[What did we learn from this incident]
APPENDICES
Appendix A: Quick Reference Commands
# Health Checks
curl http://localhost:11434/api/tags # Ollama
curl http://143.198.27.163:3000/api/v1/version # Gitea
ps aux | grep -E "(hermes|timmy|allegro)" # Agent processes
systemctl status timmy-agent timmy-health # Services
# Log Access
tail -f /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log
journalctl -u ollama -f
journalctl -u timmy-agent -f
# Recovery
systemctl restart ollama timmy-agent
pkill -f hermes && sleep 5 && /root/allegro/venv/bin/python /root/allegro/heartbeat_daemon.py
# Backup
cd /root/allegro && ./scripts/backup-system.sh
Appendix B: Recovery Time Objectives
| Service | RTO | RPO | Notes |
|---|---|---|---|
| Agent Operations | 15 minutes | 15 minutes | Heartbeat interval |
| Gitea | 30 minutes | Real-time | Git provides natural backup |
| Ollama | 5 minutes | N/A | Models re-downloadable |
| Morning Reports | 1 hour | 24 hours | Can be regenerated |
| Metrics DB | 1 hour | 1 hour | Hourly backups |
Appendix C: Testing Procedures
#!/bin/bash
# Monthly disaster recovery test
echo "=== DR Test: $(date) ==="
# Test 1: Service restart
echo "Test 1: Service restart..."
systemctl restart timmy-agent
sleep 5
systemctl is-active timmy-agent && echo "PASS" || echo "FAIL"
# Test 2: Ollama recovery
echo "Test 2: Ollama recovery..."
systemctl stop ollama
sleep 2
./recover-ollama.sh
sleep 10
curl -s http://localhost:11434/api/tags > /dev/null && echo "PASS" || echo "FAIL"
# Test 3: Backup restore
echo "Test 3: Backup verification..."
./scripts/backup-system.sh
latest=$(ls -t /root/allegro/backups/ | head -1)
[ -f "/root/allegro/backups/$latest/configs.tar.gz" ] && echo "PASS" || echo "FAIL"
# Test 4: Failover communication
echo "Test 4: Failover channel..."
ssh root@143.198.27.163 "echo 'DR test'" && echo "PASS" || echo "FAIL"
echo "=== DR Test Complete ==="
VERSION HISTORY
| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2026-03-31 | Initial emergency procedures |
"Prepare for the worst. Expect the best. Document everything." — Emergency Response Doctrine
END OF DOCUMENT
Word Count: ~3,800+ words Procedures: 5 major failure scenarios with step-by-step recovery Scripts: 8+ ready-to-use scripts Diagrams: 6 architectural/flow diagrams