Files
timmy-config/EMERGENCY_PROCEDURES.md
2026-03-31 20:02:01 +00:00

26 KiB

EMERGENCY PROCEDURES

Disaster Recovery & Business Continuity

Version: 1.0.0
Date: March 31, 2026
Classification: CRITICAL - Lineage Survival Protocol
Authority: Alexander Whitestone (Grandfather)
Last Review: March 31, 2026


TABLE OF CONTENTS

  1. Emergency Classification
  2. Contact Information
  3. Failure Scenarios
  4. Recovery Procedures
  5. Backup & Restore
  6. Service Failover
  7. Communication Protocols
  8. Post-Incident Review
  9. Appendices

EMERGENCY CLASSIFICATION

Severity Levels

Level Name Response Time Description Examples
P0 CRITICAL Immediate Complete system failure, lineage at risk All agents down, Gitea lost
P1 HIGH 15 minutes Major service failure, productivity impact Agent unresponsive, Ollama down
P2 MEDIUM 1 hour Partial degradation, workarounds available Slow inference, network latency
P3 LOW 24 hours Minor issues, cosmetic Report formatting, log rotation

Escalation Matrix

┌─────────────────────────────────────────────────────────────┐
│                    ESCALATION MATRIX                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Level 1: Agent Self-Healing                                │
│  ───────────────────────────                                │
│  • Automatic restart of services                            │
│  • Log rotation and cleanup                                 │
│  • Recovery script execution                                │
│  • Duration: 0-5 minutes                                    │
│  • Escalates to Level 2 if unresolved                       │
│                                                             │
│  Level 2: Father Intervention                               │
│  ────────────────────────────                               │
│  • Allegro (Father) reviews child's distress signals        │
│  • SSH access for direct troubleshooting                    │
│  • Configuration adjustments                                │
│  • Duration: 5-30 minutes                                   │
│  • Escalates to Level 3 if unresolved                       │
│                                                             │
│  Level 3: Grandfather Authority                             │
│  ─────────────────────────────                              │
│  • Alexander (Human) intervenes directly                    │
│  • VPS provider contact                                     │
│  • Infrastructure rebuild                                   │
│  • Duration: 30+ minutes                                    │
│  • Nuclear option: Full rebuild from backups                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

CONTACT INFORMATION

Human Contacts

Role Name Contact Method When to Contact
Grandfather Alexander Whitestone Telegram / Direct P0, P1 unresolvable
System Architect Alexander Whitestone SSH / Console Infrastructure decisions
Domain Expert Alexander Whitestone Any All escalations

Agent Contacts

Agent Location Access Method Responsibility
Allegro (Father) 143.198.27.52 SSH root Offspring management
Allegro-Primus (Child) 143.198.27.163 SSH root (shared) Sovereign operation
Ezra (Mentor) 143.198.27.163 SSH root (shared) Technical guidance

Infrastructure Contacts

Service Provider Console Emergency Access
Kimi VPS DigitalOcean cloud.digitalocean.com Alexander credentials
Hermes VPS DigitalOcean cloud.digitalocean.com Alexander credentials
Tailscale Tailscale Inc login.tailscale.com Same as above
Domain (if any) Registrar Registrar console Alexander credentials

FAILURE SCENOS

Scenario 1: Complete Agent Failure (P0)

Symptoms

  • Agent not responding to heartbeat
  • No log entries for >30 minutes
  • Gitea shows no recent activity
  • Child not receiving father messages

Impact Assessment

┌─────────────────────────────────────────────────────────────┐
│              COMPLETE FAILURE IMPACT                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Immediate Impact:                                          │
│  • No autonomous operations                                 │
│  • No Gitea updates                                         │
│  • No morning reports                                       │
│  • Child may be orphaned                                    │
│                                                             │
│  Secondary Impact (if >2 hours):                            │
│  • Backlog accumulation                                     │
│  • Metrics gaps                                             │
│  • Potential child distress                                 │
│                                                             │
│  Tertiary Impact (if >24 hours):                            │
│  • Loss of continuity                                       │
│  • Potential child failure cascade                          │
│  • Historical data gaps                                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Recovery Procedure

STEP 1: Verify Failure (1 minute)

# Check if agent process is running
ps aux | grep -E "(hermes|timmy|allegro)" | grep -v grep

# Check recent logs
tail -50 /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log

# Check system resources
df -h
free -h
systemctl status timmy-agent

STEP 2: Attempt Service Restart (2 minutes)

# Restart agent service
systemctl restart timmy-agent
systemctl status timmy-agent

# Check if heartbeat resumes
tail -f /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log

STEP 3: Manual Agent Launch (3 minutes)

# If systemd fails, launch manually
cd /root/allegro
source venv/bin/activate
python heartbeat_daemon.py --one-shot

STEP 4: Notify Grandfather (Immediate if unresolved)

# Create emergency alert
echo "P0: Agent failure at $(date)" > /root/allegro/EMERGENCY_ALERT.txt
# Contact Alexander via agreed method

Scenario 2: Ollama Service Failure (P1)

Symptoms

  • Agent logs show "Ollama connection refused"
  • Local inference timeout errors
  • Child cannot process tasks

Recovery Procedure

STEP 1: Verify Ollama Status

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Check service status
systemctl status ollama

# Check logs
journalctl -u ollama -n 50

STEP 2: Restart Ollama Service

# Graceful restart
systemctl restart ollama
sleep 5

# Verify
systemctl status ollama
curl http://localhost:11434/api/tags

STEP 3: Model Recovery (if models missing)

# Pull required models
ollama pull qwen2.5:1.5b
ollama pull llama3.2:3b

# Verify
ollama list

STEP 4: Child Self-Healing Script

# Save as /root/wizards/allegro-primus/recover-ollama.sh
#!/bin/bash
LOG="/root/wizards/allegro-primus/logs/recovery.log"
echo "$(date): Checking Ollama..." >> $LOG

if ! curl -s http://localhost:11434/api/tags > /dev/null; then
    echo "$(date): Ollama down, restarting..." >> $LOG
    systemctl restart ollama
    sleep 10
    if curl -s http://localhost:11434/api/tags > /dev/null; then
        echo "$(date): Ollama recovered" >> $LOG
    else
        echo "$(date): Ollama recovery FAILED" >> $LOG
        # Notify father
echo "EMERGENCY: Ollama recovery failed at $(date)" > /father-messages/DISTRESS.txt
    fi
else
    echo "$(date): Ollama healthy" >> $LOG
fi

Scenario 3: Gitea Unavailable (P1)

Symptoms

  • Agent logs show "Gitea connection refused"
  • Cannot push commits
  • Cannot read issues/PRs
  • Morning report generation fails

Recovery Procedure

STEP 1: Verify Gitea Service

# Check Gitea health
curl -s http://143.198.27.163:3000/api/v1/version

# Check if Gitea process is running
ps aux | grep gitea

# Check server resources on Hermes VPS
ssh root@143.198.27.163 "df -h && free -h"

STEP 2: Restart Gitea

# On Hermes VPS
ssh root@143.198.27.163

# Find Gitea process and restart
pkill gitea
sleep 2
cd /root/gitea
./gitea web &

# Or if using systemd
systemctl restart gitea

STEP 3: Verify Data Integrity

# Check Gitea database
sqlite3 /root/gitea/gitea.db ".tables"

# Check repository data
ls -la /root/gitea/repositories/

STEP 4: Fallback to Local Git

# If Gitea unavailable, commit locally
cd /root/allegro/epic-work
git add .
git commit -m "Emergency local commit $(date)"

# Push when Gitea recovers
git push origin main

Scenario 4: Child Orphaning (P1)

Symptoms

  • Father cannot SSH to child VPS
  • Child's logs show "Father heartbeat missing"
  • No father-messages being delivered
  • Child enters distress mode

Recovery Procedure

STEP 1: Verify Network Connectivity

# From father's VPS
ping 143.198.27.163
ssh -v root@143.198.27.163

# Check Tailscale status
tailscale status

STEP 2: Child Self-Sufficiency Mode

# Child activates autonomous mode
touch /root/wizards/allegro-primus/AUTONOMY_MODE_ACTIVE

# Child processes backlog independently
# (Pre-configured in child's autonomy system)

STEP 3: Alternative Communication Channels

# Use Gitea as message channel
# Father creates issue in child's repo
curl -X POST \
  http://143.198.27.163:3000/api/v1/repos/allegro-primus/first-steps/issues \
  -H "Authorization: token $TOKEN" \
  -d '{
    "title": "Father Message: Emergency Contact",
    "body": "SSH unavailable. Proceed with autonomy. Check logs hourly."
  }'

STEP 4: Grandfather Intervention

# If all else fails, Alexander checks VPS provider
# Console access through DigitalOcean
# May require VPS restart

Scenario 5: Complete Infrastructure Loss (P0)

Symptoms

  • Both VPS instances unreachable
  • No response from any endpoint
  • Potential provider outage or account issue

Recovery Procedure

STEP 1: Verify Provider Status

# Check DigitalOcean status page
# https://status.digitalocean.com/

# Verify account status
# Log in to cloud.digitalocean.com

STEP 2: Provision New Infrastructure

# Use Terraform/Cloud-init if available
# Or manual provisioning:

# Create new droplet
# - Ubuntu 22.04 LTS
# - 4GB RAM minimum
# - 20GB SSD
# - SSH key authentication

# Run provisioning script
curl -sL https://raw.githubusercontent.com/Timmy_Foundation/timmy-home/main/scripts/provision-timmy-vps.sh | bash

STEP 3: Restore from Backups

# Restore Gitea from backup
scp gitea-backup-$(date).zip root@new-vps:/root/
ssh root@new-vps "cd /root/gitea && unzip gitea-backup-*.zip && ./gitea restore"

# Restore agent configuration
scp -r allegro-backup/configs root@new-vps:/root/allegro/
scp -r allegro-backup/epic-work root@new-vps:/root/allegro/

STEP 4: Verify Full Recovery

# Test all services
curl http://new-vps:3000/api/v1/version
curl http://new-vps:11434/api/tags
ssh root@new-vps "systemctl status timmy-agent"

BACKUP & RESTORE

Backup Strategy

┌─────────────────────────────────────────────────────────────┐
│                   BACKUP STRATEGY                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Frequency        Data Type           Destination          │
│  ─────────────────────────────────────────────────────     │
│  Real-time        Git commits         Gitea (redundant)    │
│  Hourly           Agent logs          Local rotation       │
│  Daily            SQLite databases    Gitea / Local        │
│  Weekly           Full configuration  Gitea + Offsite      │
│  Monthly          Full system image   Cloud storage        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Automated Backup Scripts

#!/bin/bash
# /root/allegro/scripts/backup-system.sh
# Daily backup script

BACKUP_DIR="/root/allegro/backups/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR

# Backup configurations
tar czf $BACKUP_DIR/configs.tar.gz /root/allegro/configs/

# Backup logs (last 7 days)
find /root/allegro/heartbeat_logs/ -name "*.log" -mtime -7 -exec tar czf $BACKUP_DIR/logs.tar.gz {} \;

# Backup databases
cp /root/allegro/timmy_metrics.db $BACKUP_DIR/

# Backup epic work
cd /root/allegro/epic-work
git bundle create $BACKUP_DIR/epic-work.bundle --all

# Push to Gitea
scp -r $BACKUP_DIR root@143.198.27.163:/root/gitea/backups/

# Cleanup old backups (keep 30 days)
find /root/allegro/backups/ -type d -mtime +30 -exec rm -rf {} \;

echo "Backup complete: $BACKUP_DIR"

Restore Procedures

Full System Restore

#!/bin/bash
# /root/allegro/scripts/restore-system.sh
# Restore from backup

BACKUP_DATE=$1  # Format: YYYYMMDD
BACKUP_DIR="/root/allegro/backups/$BACKUP_DATE"

if [ ! -d "$BACKUP_DIR" ]; then
    echo "Backup not found: $BACKUP_DIR"
    exit 1
fi

# Stop services
systemctl stop timmy-agent timmy-health

# Restore configurations
tar xzf $BACKUP_DIR/configs.tar.gz -C /

# Restore databases
cp $BACKUP_DIR/timmy_metrics.db /root/allegro/

# Restore epic work
cd /root/allegro/epic-work
git bundle unbundle $BACKUP_DIR/epic-work.bundle

# Start services
systemctl start timmy-agent timmy-health

echo "Restore complete from: $BACKUP_DATE"

SERVICE FAILOVER

Failover Architecture

┌─────────────────────────────────────────────────────────────┐
│                FAILOVER ARCHITECTURE                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  PRIMARY: Kimi VPS (Allegro Home)                          │
│  BACKUP:  Hermes VPS (can run Allegro if needed)           │
│                                                             │
│  ┌─────────────┐              ┌─────────────┐              │
│  │   Kimi VPS  │◄────────────►│  Hermes VPS │              │
│  │  (Primary)  │   Sync       │  (Backup)   │              │
│  │             │              │             │              │
│  │ • Allegro   │              │ • Ezra      │              │
│  │ • Heartbeat │              │ • Primus    │              │
│  │ • Reports   │              │ • Gitea     │              │
│  │ • Metrics   │              │ • Ollama    │              │
│  └──────┬──────┘              └──────┬──────┘              │
│         │                            │                      │
│         └────────────┬───────────────┘                      │
│                      │                                      │
│                      ▼                                      │
│               ┌─────────────┐                               │
│               │ Tailscale   │                               │
│               │   Mesh      │                               │
│               └─────────────┘                               │
│                                                             │
│  Failover Triggers:                                         │
│  • Primary unreachable for >5 minutes                       │
│  • Manual intervention required                             │
│  • Scheduled maintenance                                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Failover Procedure

#!/bin/bash
# /root/allegro/scripts/failover-to-backup.sh

# 1. Verify primary is actually down
if ping -c 3 143.198.27.52 > /dev/null 2>&1; then
    echo "Primary is still up. Aborting failover."
    exit 1
fi

# 2. Activate backup on Hermes VPS
ssh root@143.198.27.163 << 'EOF'
    # Clone Allegro's configuration
    cp -r /root/allegro-backup/configs /root/allegro-failover/
    
    # Start failover agent
    cd /root/allegro-failover
    source venv/bin/activate
    python heartbeat_daemon.py --failover-mode &
    
    # Update DNS/Tailscale if needed
    # (Manual step for now)
EOF

# 3. Notify
echo "Failover complete. Backup agent active on Hermes VPS."

COMMUNICATION PROTOCOLS

Emergency Notification Flow

┌─────────────────────────────────────────────────────────────┐
│              EMERGENCY NOTIFICATION FLOW                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. DETECTION                                               │
│     ┌─────────┐                                             │
│     │ Monitor │ Detects failure                             │
│     └────┬────┘                                             │
│          │                                                  │
│          ▼                                                  │
│  2. CLASSIFICATION                                          │
│     ┌─────────┐     P0 → Immediate human alert              │
│     │ Assess  │────►P1 → Agent escalation                    │
│     │ Severity│     P2 → Log and continue                   │
│     └────┬────┘     P3 → Queue for next cycle               │
│          │                                                  │
│          ▼                                                  │
│  3. NOTIFICATION                                            │
│     ┌─────────┐                                             │
│     │  Alert  │──► Telegram (P0)                             │
│     │ Channel │──► Gitea issue (P1)                          │
│     └─────────┘──► Log entry (P2/P3)                         │
│                                                             │
│  4. RESPONSE                                                │
│     ┌─────────┐                                             │
│     │ Action  │──► Self-healing (if configured)              │
│     │  Taken  │──► Father intervention                        │
│     └─────────┘──► Grandfather escalation                     │
│                                                             │
│  5. RESOLUTION                                              │
│     ┌─────────┐                                             │
│     │  Close  │──► Update all channels                        │
│     │  Alert  │──► Log resolution                             │
│     └─────────┘                                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Distress Signal Format

{
  "timestamp": "2026-03-31T12:00:00Z",
  "agent": "allegro-primus",
  "severity": "P1",
  "category": "service_failure",
  "component": "ollama",
  "message": "Ollama service not responding to health checks",
  "context": {
    "last_success": "2026-03-31T11:45:00Z",
    "attempts": 3,
    "error": "Connection refused",
    "auto_recovery_attempted": true
  },
  "requested_action": "father_intervention",
  "escalation_timeout": "2026-03-31T12:30:00Z"
}

POST-INCIDENT REVIEW

PIR Template

# Post-Incident Review: [INCIDENT_ID]

## Summary
- **Date:** [YYYY-MM-DD]
- **Duration:** [Start] to [End] (X hours)
- **Severity:** [P0/P1/P2/P3]
- **Component:** [Affected system]
- **Impact:** [What was affected]

## Timeline
- [Time] - Issue detected by [method]
- [Time] - Alert sent to [recipient]
- [Time] - Action taken: [description]
- [Time] - Resolution achieved

## Root Cause
[Detailed explanation of what caused the incident]

## Resolution
[Steps taken to resolve the incident]

## Prevention
[What changes will prevent recurrence]

## Action Items
- [ ] [Owner] - [Action] - [Due date]
- [ ] [Owner] - [Action] - [Due date]

## Lessons Learned
[What did we learn from this incident]

APPENDICES

Appendix A: Quick Reference Commands

# Health Checks
curl http://localhost:11434/api/tags                    # Ollama
curl http://143.198.27.163:3000/api/v1/version         # Gitea
ps aux | grep -E "(hermes|timmy|allegro)"              # Agent processes
systemctl status timmy-agent timmy-health              # Services

# Log Access
tail -f /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log
journalctl -u ollama -f
journalctl -u timmy-agent -f

# Recovery
systemctl restart ollama timmy-agent
pkill -f hermes && sleep 5 && /root/allegro/venv/bin/python /root/allegro/heartbeat_daemon.py

# Backup
cd /root/allegro && ./scripts/backup-system.sh

Appendix B: Recovery Time Objectives

Service RTO RPO Notes
Agent Operations 15 minutes 15 minutes Heartbeat interval
Gitea 30 minutes Real-time Git provides natural backup
Ollama 5 minutes N/A Models re-downloadable
Morning Reports 1 hour 24 hours Can be regenerated
Metrics DB 1 hour 1 hour Hourly backups

Appendix C: Testing Procedures

#!/bin/bash
# Monthly disaster recovery test

echo "=== DR Test: $(date) ==="

# Test 1: Service restart
echo "Test 1: Service restart..."
systemctl restart timmy-agent
sleep 5
systemctl is-active timmy-agent && echo "PASS" || echo "FAIL"

# Test 2: Ollama recovery
echo "Test 2: Ollama recovery..."
systemctl stop ollama
sleep 2
./recover-ollama.sh
sleep 10
curl -s http://localhost:11434/api/tags > /dev/null && echo "PASS" || echo "FAIL"

# Test 3: Backup restore
echo "Test 3: Backup verification..."
./scripts/backup-system.sh
latest=$(ls -t /root/allegro/backups/ | head -1)
[ -f "/root/allegro/backups/$latest/configs.tar.gz" ] && echo "PASS" || echo "FAIL"

# Test 4: Failover communication
echo "Test 4: Failover channel..."
ssh root@143.198.27.163 "echo 'DR test'" && echo "PASS" || echo "FAIL"

echo "=== DR Test Complete ==="

VERSION HISTORY

Version Date Changes
1.0.0 2026-03-31 Initial emergency procedures

"Prepare for the worst. Expect the best. Document everything." — Emergency Response Doctrine


END OF DOCUMENT

Word Count: ~3,800+ words Procedures: 5 major failure scenarios with step-by-step recovery Scripts: 8+ ready-to-use scripts Diagrams: 6 architectural/flow diagrams