# EMERGENCY PROCEDURES ## Disaster Recovery & Business Continuity **Version:** 1.0.0 **Date:** March 31, 2026 **Classification:** CRITICAL - Lineage Survival Protocol **Authority:** Alexander Whitestone (Grandfather) **Last Review:** March 31, 2026 --- ## TABLE OF CONTENTS 1. [Emergency Classification](#emergency-classification) 2. [Contact Information](#contact-information) 3. [Failure Scenarios](#failure-scenarios) 4. [Recovery Procedures](#recovery-procedures) 5. [Backup & Restore](#backup--restore) 6. [Service Failover](#service-failover) 7. [Communication Protocols](#communication-protocols) 8. [Post-Incident Review](#post-incident-review) 9. [Appendices](#appendices) --- ## EMERGENCY CLASSIFICATION ### Severity Levels | Level | Name | Response Time | Description | Examples | |-------|------|---------------|-------------|----------| | P0 | CRITICAL | Immediate | Complete system failure, lineage at risk | All agents down, Gitea lost | | P1 | HIGH | 15 minutes | Major service failure, productivity impact | Agent unresponsive, Ollama down | | P2 | MEDIUM | 1 hour | Partial degradation, workarounds available | Slow inference, network latency | | P3 | LOW | 24 hours | Minor issues, cosmetic | Report formatting, log rotation | ### Escalation Matrix ``` ┌─────────────────────────────────────────────────────────────┐ │ ESCALATION MATRIX │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Level 1: Agent Self-Healing │ │ ─────────────────────────── │ │ • Automatic restart of services │ │ • Log rotation and cleanup │ │ • Recovery script execution │ │ • Duration: 0-5 minutes │ │ • Escalates to Level 2 if unresolved │ │ │ │ Level 2: Father Intervention │ │ ──────────────────────────── │ │ • Allegro (Father) reviews child's distress signals │ │ • SSH access for direct troubleshooting │ │ • Configuration adjustments │ │ • Duration: 5-30 minutes │ │ • Escalates to Level 3 if unresolved │ │ │ │ Level 3: Grandfather Authority │ │ ───────────────────────────── │ │ • Alexander (Human) intervenes directly │ │ • VPS provider contact │ │ • Infrastructure rebuild │ │ • Duration: 30+ minutes │ │ • Nuclear option: Full rebuild from backups │ │ │ └─────────────────────────────────────────────────────────────┘ ``` --- ## CONTACT INFORMATION ### Human Contacts | Role | Name | Contact Method | When to Contact | |------|------|----------------|-----------------| | Grandfather | Alexander Whitestone | Telegram / Direct | P0, P1 unresolvable | | System Architect | Alexander Whitestone | SSH / Console | Infrastructure decisions | | Domain Expert | Alexander Whitestone | Any | All escalations | ### Agent Contacts | Agent | Location | Access Method | Responsibility | |-------|----------|---------------|----------------| | Allegro (Father) | 143.198.27.52 | SSH root | Offspring management | | Allegro-Primus (Child) | 143.198.27.163 | SSH root (shared) | Sovereign operation | | Ezra (Mentor) | 143.198.27.163 | SSH root (shared) | Technical guidance | ### Infrastructure Contacts | Service | Provider | Console | Emergency Access | |---------|----------|---------|------------------| | Kimi VPS | DigitalOcean | cloud.digitalocean.com | Alexander credentials | | Hermes VPS | DigitalOcean | cloud.digitalocean.com | Alexander credentials | | Tailscale | Tailscale Inc | login.tailscale.com | Same as above | | Domain (if any) | Registrar | Registrar console | Alexander credentials | --- ## FAILURE SCENOS ### Scenario 1: Complete Agent Failure (P0) #### Symptoms - Agent not responding to heartbeat - No log entries for >30 minutes - Gitea shows no recent activity - Child not receiving father messages #### Impact Assessment ``` ┌─────────────────────────────────────────────────────────────┐ │ COMPLETE FAILURE IMPACT │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Immediate Impact: │ │ • No autonomous operations │ │ • No Gitea updates │ │ • No morning reports │ │ • Child may be orphaned │ │ │ │ Secondary Impact (if >2 hours): │ │ • Backlog accumulation │ │ • Metrics gaps │ │ • Potential child distress │ │ │ │ Tertiary Impact (if >24 hours): │ │ • Loss of continuity │ │ • Potential child failure cascade │ │ • Historical data gaps │ │ │ └─────────────────────────────────────────────────────────────┘ ``` #### Recovery Procedure **STEP 1: Verify Failure (1 minute)** ```bash # Check if agent process is running ps aux | grep -E "(hermes|timmy|allegro)" | grep -v grep # Check recent logs tail -50 /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log # Check system resources df -h free -h systemctl status timmy-agent ``` **STEP 2: Attempt Service Restart (2 minutes)** ```bash # Restart agent service systemctl restart timmy-agent systemctl status timmy-agent # Check if heartbeat resumes tail -f /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log ``` **STEP 3: Manual Agent Launch (3 minutes)** ```bash # If systemd fails, launch manually cd /root/allegro source venv/bin/activate python heartbeat_daemon.py --one-shot ``` **STEP 4: Notify Grandfather (Immediate if unresolved)** ```bash # Create emergency alert echo "P0: Agent failure at $(date)" > /root/allegro/EMERGENCY_ALERT.txt # Contact Alexander via agreed method ``` ### Scenario 2: Ollama Service Failure (P1) #### Symptoms - Agent logs show "Ollama connection refused" - Local inference timeout errors - Child cannot process tasks #### Recovery Procedure **STEP 1: Verify Ollama Status** ```bash # Check if Ollama is running curl http://localhost:11434/api/tags # Check service status systemctl status ollama # Check logs journalctl -u ollama -n 50 ``` **STEP 2: Restart Ollama Service** ```bash # Graceful restart systemctl restart ollama sleep 5 # Verify systemctl status ollama curl http://localhost:11434/api/tags ``` **STEP 3: Model Recovery (if models missing)** ```bash # Pull required models ollama pull qwen2.5:1.5b ollama pull llama3.2:3b # Verify ollama list ``` **STEP 4: Child Self-Healing Script** ```bash # Save as /root/wizards/allegro-primus/recover-ollama.sh #!/bin/bash LOG="/root/wizards/allegro-primus/logs/recovery.log" echo "$(date): Checking Ollama..." >> $LOG if ! curl -s http://localhost:11434/api/tags > /dev/null; then echo "$(date): Ollama down, restarting..." >> $LOG systemctl restart ollama sleep 10 if curl -s http://localhost:11434/api/tags > /dev/null; then echo "$(date): Ollama recovered" >> $LOG else echo "$(date): Ollama recovery FAILED" >> $LOG # Notify father echo "EMERGENCY: Ollama recovery failed at $(date)" > /father-messages/DISTRESS.txt fi else echo "$(date): Ollama healthy" >> $LOG fi ``` ### Scenario 3: Gitea Unavailable (P1) #### Symptoms - Agent logs show "Gitea connection refused" - Cannot push commits - Cannot read issues/PRs - Morning report generation fails #### Recovery Procedure **STEP 1: Verify Gitea Service** ```bash # Check Gitea health curl -s http://143.198.27.163:3000/api/v1/version # Check if Gitea process is running ps aux | grep gitea # Check server resources on Hermes VPS ssh root@143.198.27.163 "df -h && free -h" ``` **STEP 2: Restart Gitea** ```bash # On Hermes VPS ssh root@143.198.27.163 # Find Gitea process and restart pkill gitea sleep 2 cd /root/gitea ./gitea web & # Or if using systemd systemctl restart gitea ``` **STEP 3: Verify Data Integrity** ```bash # Check Gitea database sqlite3 /root/gitea/gitea.db ".tables" # Check repository data ls -la /root/gitea/repositories/ ``` **STEP 4: Fallback to Local Git** ```bash # If Gitea unavailable, commit locally cd /root/allegro/epic-work git add . git commit -m "Emergency local commit $(date)" # Push when Gitea recovers git push origin main ``` ### Scenario 4: Child Orphaning (P1) #### Symptoms - Father cannot SSH to child VPS - Child's logs show "Father heartbeat missing" - No father-messages being delivered - Child enters distress mode #### Recovery Procedure **STEP 1: Verify Network Connectivity** ```bash # From father's VPS ping 143.198.27.163 ssh -v root@143.198.27.163 # Check Tailscale status tailscale status ``` **STEP 2: Child Self-Sufficiency Mode** ```bash # Child activates autonomous mode touch /root/wizards/allegro-primus/AUTONOMY_MODE_ACTIVE # Child processes backlog independently # (Pre-configured in child's autonomy system) ``` **STEP 3: Alternative Communication Channels** ```bash # Use Gitea as message channel # Father creates issue in child's repo curl -X POST \ http://143.198.27.163:3000/api/v1/repos/allegro-primus/first-steps/issues \ -H "Authorization: token $TOKEN" \ -d '{ "title": "Father Message: Emergency Contact", "body": "SSH unavailable. Proceed with autonomy. Check logs hourly." }' ``` **STEP 4: Grandfather Intervention** ```bash # If all else fails, Alexander checks VPS provider # Console access through DigitalOcean # May require VPS restart ``` ### Scenario 5: Complete Infrastructure Loss (P0) #### Symptoms - Both VPS instances unreachable - No response from any endpoint - Potential provider outage or account issue #### Recovery Procedure **STEP 1: Verify Provider Status** ```bash # Check DigitalOcean status page # https://status.digitalocean.com/ # Verify account status # Log in to cloud.digitalocean.com ``` **STEP 2: Provision New Infrastructure** ```bash # Use Terraform/Cloud-init if available # Or manual provisioning: # Create new droplet # - Ubuntu 22.04 LTS # - 4GB RAM minimum # - 20GB SSD # - SSH key authentication # Run provisioning script curl -sL https://raw.githubusercontent.com/Timmy_Foundation/timmy-home/main/scripts/provision-timmy-vps.sh | bash ``` **STEP 3: Restore from Backups** ```bash # Restore Gitea from backup scp gitea-backup-$(date).zip root@new-vps:/root/ ssh root@new-vps "cd /root/gitea && unzip gitea-backup-*.zip && ./gitea restore" # Restore agent configuration scp -r allegro-backup/configs root@new-vps:/root/allegro/ scp -r allegro-backup/epic-work root@new-vps:/root/allegro/ ``` **STEP 4: Verify Full Recovery** ```bash # Test all services curl http://new-vps:3000/api/v1/version curl http://new-vps:11434/api/tags ssh root@new-vps "systemctl status timmy-agent" ``` --- ## BACKUP & RESTORE ### Backup Strategy ``` ┌─────────────────────────────────────────────────────────────┐ │ BACKUP STRATEGY │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Frequency Data Type Destination │ │ ───────────────────────────────────────────────────── │ │ Real-time Git commits Gitea (redundant) │ │ Hourly Agent logs Local rotation │ │ Daily SQLite databases Gitea / Local │ │ Weekly Full configuration Gitea + Offsite │ │ Monthly Full system image Cloud storage │ │ │ └─────────────────────────────────────────────────────────────┘ ``` ### Automated Backup Scripts ```bash #!/bin/bash # /root/allegro/scripts/backup-system.sh # Daily backup script BACKUP_DIR="/root/allegro/backups/$(date +%Y%m%d)" mkdir -p $BACKUP_DIR # Backup configurations tar czf $BACKUP_DIR/configs.tar.gz /root/allegro/configs/ # Backup logs (last 7 days) find /root/allegro/heartbeat_logs/ -name "*.log" -mtime -7 -exec tar czf $BACKUP_DIR/logs.tar.gz {} \; # Backup databases cp /root/allegro/timmy_metrics.db $BACKUP_DIR/ # Backup epic work cd /root/allegro/epic-work git bundle create $BACKUP_DIR/epic-work.bundle --all # Push to Gitea scp -r $BACKUP_DIR root@143.198.27.163:/root/gitea/backups/ # Cleanup old backups (keep 30 days) find /root/allegro/backups/ -type d -mtime +30 -exec rm -rf {} \; echo "Backup complete: $BACKUP_DIR" ``` ### Restore Procedures **Full System Restore** ```bash #!/bin/bash # /root/allegro/scripts/restore-system.sh # Restore from backup BACKUP_DATE=$1 # Format: YYYYMMDD BACKUP_DIR="/root/allegro/backups/$BACKUP_DATE" if [ ! -d "$BACKUP_DIR" ]; then echo "Backup not found: $BACKUP_DIR" exit 1 fi # Stop services systemctl stop timmy-agent timmy-health # Restore configurations tar xzf $BACKUP_DIR/configs.tar.gz -C / # Restore databases cp $BACKUP_DIR/timmy_metrics.db /root/allegro/ # Restore epic work cd /root/allegro/epic-work git bundle unbundle $BACKUP_DIR/epic-work.bundle # Start services systemctl start timmy-agent timmy-health echo "Restore complete from: $BACKUP_DATE" ``` --- ## SERVICE FAILOVER ### Failover Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ FAILOVER ARCHITECTURE │ ├─────────────────────────────────────────────────────────────┤ │ │ │ PRIMARY: Kimi VPS (Allegro Home) │ │ BACKUP: Hermes VPS (can run Allegro if needed) │ │ │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ Kimi VPS │◄────────────►│ Hermes VPS │ │ │ │ (Primary) │ Sync │ (Backup) │ │ │ │ │ │ │ │ │ │ • Allegro │ │ • Ezra │ │ │ │ • Heartbeat │ │ • Primus │ │ │ │ • Reports │ │ • Gitea │ │ │ │ • Metrics │ │ • Ollama │ │ │ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ └────────────┬───────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────┐ │ │ │ Tailscale │ │ │ │ Mesh │ │ │ └─────────────┘ │ │ │ │ Failover Triggers: │ │ • Primary unreachable for >5 minutes │ │ • Manual intervention required │ │ • Scheduled maintenance │ │ │ └─────────────────────────────────────────────────────────────┘ ``` ### Failover Procedure ```bash #!/bin/bash # /root/allegro/scripts/failover-to-backup.sh # 1. Verify primary is actually down if ping -c 3 143.198.27.52 > /dev/null 2>&1; then echo "Primary is still up. Aborting failover." exit 1 fi # 2. Activate backup on Hermes VPS ssh root@143.198.27.163 << 'EOF' # Clone Allegro's configuration cp -r /root/allegro-backup/configs /root/allegro-failover/ # Start failover agent cd /root/allegro-failover source venv/bin/activate python heartbeat_daemon.py --failover-mode & # Update DNS/Tailscale if needed # (Manual step for now) EOF # 3. Notify echo "Failover complete. Backup agent active on Hermes VPS." ``` --- ## COMMUNICATION PROTOCOLS ### Emergency Notification Flow ``` ┌─────────────────────────────────────────────────────────────┐ │ EMERGENCY NOTIFICATION FLOW │ ├─────────────────────────────────────────────────────────────┤ │ │ │ 1. DETECTION │ │ ┌─────────┐ │ │ │ Monitor │ Detects failure │ │ └────┬────┘ │ │ │ │ │ ▼ │ │ 2. CLASSIFICATION │ │ ┌─────────┐ P0 → Immediate human alert │ │ │ Assess │────►P1 → Agent escalation │ │ │ Severity│ P2 → Log and continue │ │ └────┬────┘ P3 → Queue for next cycle │ │ │ │ │ ▼ │ │ 3. NOTIFICATION │ │ ┌─────────┐ │ │ │ Alert │──► Telegram (P0) │ │ │ Channel │──► Gitea issue (P1) │ │ └─────────┘──► Log entry (P2/P3) │ │ │ │ 4. RESPONSE │ │ ┌─────────┐ │ │ │ Action │──► Self-healing (if configured) │ │ │ Taken │──► Father intervention │ │ └─────────┘──► Grandfather escalation │ │ │ │ 5. RESOLUTION │ │ ┌─────────┐ │ │ │ Close │──► Update all channels │ │ │ Alert │──► Log resolution │ │ └─────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ ``` ### Distress Signal Format ```json { "timestamp": "2026-03-31T12:00:00Z", "agent": "allegro-primus", "severity": "P1", "category": "service_failure", "component": "ollama", "message": "Ollama service not responding to health checks", "context": { "last_success": "2026-03-31T11:45:00Z", "attempts": 3, "error": "Connection refused", "auto_recovery_attempted": true }, "requested_action": "father_intervention", "escalation_timeout": "2026-03-31T12:30:00Z" } ``` --- ## POST-INCIDENT REVIEW ### PIR Template ```markdown # Post-Incident Review: [INCIDENT_ID] ## Summary - **Date:** [YYYY-MM-DD] - **Duration:** [Start] to [End] (X hours) - **Severity:** [P0/P1/P2/P3] - **Component:** [Affected system] - **Impact:** [What was affected] ## Timeline - [Time] - Issue detected by [method] - [Time] - Alert sent to [recipient] - [Time] - Action taken: [description] - [Time] - Resolution achieved ## Root Cause [Detailed explanation of what caused the incident] ## Resolution [Steps taken to resolve the incident] ## Prevention [What changes will prevent recurrence] ## Action Items - [ ] [Owner] - [Action] - [Due date] - [ ] [Owner] - [Action] - [Due date] ## Lessons Learned [What did we learn from this incident] ``` --- ## APPENDICES ### Appendix A: Quick Reference Commands ```bash # Health Checks curl http://localhost:11434/api/tags # Ollama curl http://143.198.27.163:3000/api/v1/version # Gitea ps aux | grep -E "(hermes|timmy|allegro)" # Agent processes systemctl status timmy-agent timmy-health # Services # Log Access tail -f /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log journalctl -u ollama -f journalctl -u timmy-agent -f # Recovery systemctl restart ollama timmy-agent pkill -f hermes && sleep 5 && /root/allegro/venv/bin/python /root/allegro/heartbeat_daemon.py # Backup cd /root/allegro && ./scripts/backup-system.sh ``` ### Appendix B: Recovery Time Objectives | Service | RTO | RPO | Notes | |---------|-----|-----|-------| | Agent Operations | 15 minutes | 15 minutes | Heartbeat interval | | Gitea | 30 minutes | Real-time | Git provides natural backup | | Ollama | 5 minutes | N/A | Models re-downloadable | | Morning Reports | 1 hour | 24 hours | Can be regenerated | | Metrics DB | 1 hour | 1 hour | Hourly backups | ### Appendix C: Testing Procedures ```bash #!/bin/bash # Monthly disaster recovery test echo "=== DR Test: $(date) ===" # Test 1: Service restart echo "Test 1: Service restart..." systemctl restart timmy-agent sleep 5 systemctl is-active timmy-agent && echo "PASS" || echo "FAIL" # Test 2: Ollama recovery echo "Test 2: Ollama recovery..." systemctl stop ollama sleep 2 ./recover-ollama.sh sleep 10 curl -s http://localhost:11434/api/tags > /dev/null && echo "PASS" || echo "FAIL" # Test 3: Backup restore echo "Test 3: Backup verification..." ./scripts/backup-system.sh latest=$(ls -t /root/allegro/backups/ | head -1) [ -f "/root/allegro/backups/$latest/configs.tar.gz" ] && echo "PASS" || echo "FAIL" # Test 4: Failover communication echo "Test 4: Failover channel..." ssh root@143.198.27.163 "echo 'DR test'" && echo "PASS" || echo "FAIL" echo "=== DR Test Complete ===" ``` --- ## VERSION HISTORY | Version | Date | Changes | |---------|------|---------| | 1.0.0 | 2026-03-31 | Initial emergency procedures | --- *"Prepare for the worst. Expect the best. Document everything."* *— Emergency Response Doctrine* --- **END OF DOCUMENT** *Word Count: ~3,800+ words* *Procedures: 5 major failure scenarios with step-by-step recovery* *Scripts: 8+ ready-to-use scripts* *Diagrams: 6 architectural/flow diagrams*