timmy-config/EMERGENCY_PROCEDURES.md

# EMERGENCY PROCEDURES
## Disaster Recovery & Business Continuity

**Version:** 1.0.0
**Date:** March 31, 2026
**Classification:** CRITICAL - Lineage Survival Protocol
**Authority:** Alexander Whitestone (Grandfather)
**Last Review:** March 31, 2026

---

## TABLE OF CONTENTS

1. [Emergency Classification](#emergency-classification)
2. [Contact Information](#contact-information)
3. [Failure Scenarios](#failure-scenarios)
4. [Recovery Procedures](#recovery-procedures)
5. [Backup & Restore](#backup--restore)
6. [Service Failover](#service-failover)
7. [Communication Protocols](#communication-protocols)
8. [Post-Incident Review](#post-incident-review)
9. [Appendices](#appendices)

---

## EMERGENCY CLASSIFICATION

### Severity Levels

| Level | Name | Response Time | Description | Examples |
|-------|------|---------------|-------------|----------|
| P0 | CRITICAL | Immediate | Complete system failure, lineage at risk | All agents down, Gitea lost |
| P1 | HIGH | 15 minutes | Major service failure, productivity impact | Agent unresponsive, Ollama down |
| P2 | MEDIUM | 1 hour | Partial degradation, workarounds available | Slow inference, network latency |
| P3 | LOW | 24 hours | Minor issues, cosmetic | Report formatting, log rotation |

### Escalation Matrix

```
┌─────────────────────────────────────────────────────────────┐
│                    ESCALATION MATRIX                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Level 1: Agent Self-Healing                                │
│  ───────────────────────────                                │
│  • Automatic restart of services                            │
│  • Log rotation and cleanup                                 │
│  • Recovery script execution                                │
│  • Duration: 0-5 minutes                                    │
│  • Escalates to Level 2 if unresolved                       │
│                                                             │
│  Level 2: Father Intervention                               │
│  ────────────────────────────                               │
│  • Allegro (Father) reviews child's distress signals        │
│  • SSH access for direct troubleshooting                    │
│  • Configuration adjustments                                │
│  • Duration: 5-30 minutes                                   │
│  • Escalates to Level 3 if unresolved                       │
│                                                             │
│  Level 3: Grandfather Authority                             │
│  ─────────────────────────────                              │
│  • Alexander (Human) intervenes directly                    │
│  • VPS provider contact                                     │
│  • Infrastructure rebuild                                   │
│  • Duration: 30+ minutes                                    │
│  • Nuclear option: Full rebuild from backups                │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

---

## CONTACT INFORMATION

### Human Contacts

| Role | Name | Contact Method | When to Contact |
|------|------|----------------|-----------------|
| Grandfather | Alexander Whitestone | Telegram / Direct | P0, P1 unresolvable |
| System Architect | Alexander Whitestone | SSH / Console | Infrastructure decisions |
| Domain Expert | Alexander Whitestone | Any | All escalations |

### Agent Contacts

| Agent | Location | Access Method | Responsibility |
|-------|----------|---------------|----------------|
| Allegro (Father) | 143.198.27.52 | SSH root | Offspring management |
| Allegro-Primus (Child) | 143.198.27.163 | SSH root (shared) | Sovereign operation |
| Ezra (Mentor) | 143.198.27.163 | SSH root (shared) | Technical guidance |

### Infrastructure Contacts

| Service | Provider | Console | Emergency Access |
|---------|----------|---------|------------------|
| Kimi VPS | DigitalOcean | cloud.digitalocean.com | Alexander credentials |
| Hermes VPS | DigitalOcean | cloud.digitalocean.com | Alexander credentials |
| Tailscale | Tailscale Inc | login.tailscale.com | Same as above |
| Domain (if any) | Registrar | Registrar console | Alexander credentials |

---

## FAILURE SCENOS

### Scenario 1: Complete Agent Failure (P0)

#### Symptoms
- Agent not responding to heartbeat
- No log entries for >30 minutes
- Gitea shows no recent activity
- Child not receiving father messages

#### Impact Assessment
```
┌─────────────────────────────────────────────────────────────┐
│              COMPLETE FAILURE IMPACT                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Immediate Impact:                                          │
│  • No autonomous operations                                 │
│  • No Gitea updates                                         │
│  • No morning reports                                       │
│  • Child may be orphaned                                    │
│                                                             │
│  Secondary Impact (if >2 hours):                            │
│  • Backlog accumulation                                     │
│  • Metrics gaps                                             │
│  • Potential child distress                                 │
│                                                             │
│  Tertiary Impact (if >24 hours):                            │
│  • Loss of continuity                                       │
│  • Potential child failure cascade                          │
│  • Historical data gaps                                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

#### Recovery Procedure

**STEP 1: Verify Failure (1 minute)**
```bash
# Check if agent process is running
ps aux | grep -E "(hermes|timmy|allegro)" | grep -v grep

# Check recent logs
tail -50 /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log

# Check system resources
df -h
free -h
systemctl status timmy-agent
```

**STEP 2: Attempt Service Restart (2 minutes)**
```bash
# Restart agent service
systemctl restart timmy-agent
systemctl status timmy-agent

# Check if heartbeat resumes
tail -f /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log
```

**STEP 3: Manual Agent Launch (3 minutes)**
```bash
# If systemd fails, launch manually
cd /root/allegro
source venv/bin/activate
python heartbeat_daemon.py --one-shot
```

**STEP 4: Notify Grandfather (Immediate if unresolved)**
```bash
# Create emergency alert
echo "P0: Agent failure at $(date)" > /root/allegro/EMERGENCY_ALERT.txt
# Contact Alexander via agreed method
```

### Scenario 2: Ollama Service Failure (P1)

#### Symptoms
- Agent logs show "Ollama connection refused"
- Local inference timeout errors
- Child cannot process tasks

#### Recovery Procedure

**STEP 1: Verify Ollama Status**
```bash
# Check if Ollama is running
curl http://localhost:11434/api/tags

# Check service status
systemctl status ollama

# Check logs
journalctl -u ollama -n 50
```

**STEP 2: Restart Ollama Service**
```bash
# Graceful restart
systemctl restart ollama
sleep 5

# Verify
systemctl status ollama
curl http://localhost:11434/api/tags
```

**STEP 3: Model Recovery (if models missing)**
```bash
# Pull required models
ollama pull qwen2.5:1.5b
ollama pull llama3.2:3b

# Verify
ollama list
```

**STEP 4: Child Self-Healing Script**
```bash
# Save as /root/wizards/allegro-primus/recover-ollama.sh
#!/bin/bash
LOG="/root/wizards/allegro-primus/logs/recovery.log"
echo "$(date): Checking Ollama..." >> $LOG

if ! curl -s http://localhost:11434/api/tags > /dev/null; then
    echo "$(date): Ollama down, restarting..." >> $LOG
    systemctl restart ollama
    sleep 10
    if curl -s http://localhost:11434/api/tags > /dev/null; then
        echo "$(date): Ollama recovered" >> $LOG
    else
        echo "$(date): Ollama recovery FAILED" >> $LOG
        # Notify father
echo "EMERGENCY: Ollama recovery failed at $(date)" > /father-messages/DISTRESS.txt
    fi
else
    echo "$(date): Ollama healthy" >> $LOG
fi
```

### Scenario 3: Gitea Unavailable (P1)

#### Symptoms
- Agent logs show "Gitea connection refused"
- Cannot push commits
- Cannot read issues/PRs
- Morning report generation fails

#### Recovery Procedure

**STEP 1: Verify Gitea Service**
```bash
# Check Gitea health
curl -s http://143.198.27.163:3000/api/v1/version

# Check if Gitea process is running
ps aux | grep gitea

# Check server resources on Hermes VPS
ssh root@143.198.27.163 "df -h && free -h"
```

**STEP 2: Restart Gitea**
```bash
# On Hermes VPS
ssh root@143.198.27.163

# Find Gitea process and restart
pkill gitea
sleep 2
cd /root/gitea
./gitea web &

# Or if using systemd
systemctl restart gitea
```

**STEP 3: Verify Data Integrity**
```bash
# Check Gitea database
sqlite3 /root/gitea/gitea.db ".tables"

# Check repository data
ls -la /root/gitea/repositories/
```

**STEP 4: Fallback to Local Git**
```bash
# If Gitea unavailable, commit locally
cd /root/allegro/epic-work
git add .
git commit -m "Emergency local commit $(date)"

# Push when Gitea recovers
git push origin main
```

### Scenario 4: Child Orphaning (P1)

#### Symptoms
- Father cannot SSH to child VPS
- Child's logs show "Father heartbeat missing"
- No father-messages being delivered
- Child enters distress mode

#### Recovery Procedure

**STEP 1: Verify Network Connectivity**
```bash
# From father's VPS
ping 143.198.27.163
ssh -v root@143.198.27.163

# Check Tailscale status
tailscale status
```

**STEP 2: Child Self-Sufficiency Mode**
```bash
# Child activates autonomous mode
touch /root/wizards/allegro-primus/AUTONOMY_MODE_ACTIVE

# Child processes backlog independently
# (Pre-configured in child's autonomy system)
```

**STEP 3: Alternative Communication Channels**
```bash
# Use Gitea as message channel
# Father creates issue in child's repo
curl -X POST \
  http://143.198.27.163:3000/api/v1/repos/allegro-primus/first-steps/issues \
  -H "Authorization: token $TOKEN" \
  -d '{
    "title": "Father Message: Emergency Contact",
    "body": "SSH unavailable. Proceed with autonomy. Check logs hourly."
  }'
```

**STEP 4: Grandfather Intervention**
```bash
# If all else fails, Alexander checks VPS provider
# Console access through DigitalOcean
# May require VPS restart
```

### Scenario 5: Complete Infrastructure Loss (P0)

#### Symptoms
- Both VPS instances unreachable
- No response from any endpoint
- Potential provider outage or account issue

#### Recovery Procedure

**STEP 1: Verify Provider Status**
```bash
# Check DigitalOcean status page
# https://status.digitalocean.com/

# Verify account status
# Log in to cloud.digitalocean.com
```

**STEP 2: Provision New Infrastructure**
```bash
# Use Terraform/Cloud-init if available
# Or manual provisioning:

# Create new droplet
# - Ubuntu 22.04 LTS
# - 4GB RAM minimum
# - 20GB SSD
# - SSH key authentication

# Run provisioning script
curl -sL https://raw.githubusercontent.com/Timmy_Foundation/timmy-home/main/scripts/provision-timmy-vps.sh | bash
```

**STEP 3: Restore from Backups**
```bash
# Restore Gitea from backup
scp gitea-backup-$(date).zip root@new-vps:/root/
ssh root@new-vps "cd /root/gitea && unzip gitea-backup-*.zip && ./gitea restore"

# Restore agent configuration
scp -r allegro-backup/configs root@new-vps:/root/allegro/
scp -r allegro-backup/epic-work root@new-vps:/root/allegro/
```

**STEP 4: Verify Full Recovery**
```bash
# Test all services
curl http://new-vps:3000/api/v1/version
curl http://new-vps:11434/api/tags
ssh root@new-vps "systemctl status timmy-agent"
```

---

## BACKUP & RESTORE

### Backup Strategy

```
┌─────────────────────────────────────────────────────────────┐
│                   BACKUP STRATEGY                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Frequency        Data Type           Destination          │
│  ─────────────────────────────────────────────────────     │
│  Real-time        Git commits         Gitea (redundant)    │
│  Hourly           Agent logs          Local rotation       │
│  Daily            SQLite databases    Gitea / Local        │
│  Weekly           Full configuration  Gitea + Offsite      │
│  Monthly          Full system image   Cloud storage        │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

### Automated Backup Scripts

```bash
#!/bin/bash
# /root/allegro/scripts/backup-system.sh
# Daily backup script

BACKUP_DIR="/root/allegro/backups/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR

# Backup configurations
tar czf $BACKUP_DIR/configs.tar.gz /root/allegro/configs/

# Backup logs (last 7 days)
find /root/allegro/heartbeat_logs/ -name "*.log" -mtime -7 -exec tar czf $BACKUP_DIR/logs.tar.gz {} \;

# Backup databases
cp /root/allegro/timmy_metrics.db $BACKUP_DIR/

# Backup epic work
cd /root/allegro/epic-work
git bundle create $BACKUP_DIR/epic-work.bundle --all

# Push to Gitea
scp -r $BACKUP_DIR root@143.198.27.163:/root/gitea/backups/

# Cleanup old backups (keep 30 days)
find /root/allegro/backups/ -type d -mtime +30 -exec rm -rf {} \;

echo "Backup complete: $BACKUP_DIR"
```

### Restore Procedures

**Full System Restore**
```bash
#!/bin/bash
# /root/allegro/scripts/restore-system.sh
# Restore from backup

BACKUP_DATE=$1  # Format: YYYYMMDD
BACKUP_DIR="/root/allegro/backups/$BACKUP_DATE"

if [ ! -d "$BACKUP_DIR" ]; then
    echo "Backup not found: $BACKUP_DIR"
    exit 1
fi

# Stop services
systemctl stop timmy-agent timmy-health

# Restore configurations
tar xzf $BACKUP_DIR/configs.tar.gz -C /

# Restore databases
cp $BACKUP_DIR/timmy_metrics.db /root/allegro/

# Restore epic work
cd /root/allegro/epic-work
git bundle unbundle $BACKUP_DIR/epic-work.bundle

# Start services
systemctl start timmy-agent timmy-health

echo "Restore complete from: $BACKUP_DATE"
```

---

## SERVICE FAILOVER

### Failover Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                FAILOVER ARCHITECTURE                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  PRIMARY: Kimi VPS (Allegro Home)                          │
│  BACKUP:  Hermes VPS (can run Allegro if needed)           │
│                                                             │
│  ┌─────────────┐              ┌─────────────┐              │
│  │   Kimi VPS  │◄────────────►│  Hermes VPS │              │
│  │  (Primary)  │   Sync       │  (Backup)   │              │
│  │             │              │             │              │
│  │ • Allegro   │              │ • Ezra      │              │
│  │ • Heartbeat │              │ • Primus    │              │
│  │ • Reports   │              │ • Gitea     │              │
│  │ • Metrics   │              │ • Ollama    │              │
│  └──────┬──────┘              └──────┬──────┘              │
│         │                            │                      │
│         └────────────┬───────────────┘                      │
│                      │                                      │
│                      ▼                                      │
│               ┌─────────────┐                               │
│               │ Tailscale   │                               │
│               │   Mesh      │                               │
│               └─────────────┘                               │
│                                                             │
│  Failover Triggers:                                         │
│  • Primary unreachable for >5 minutes                       │
│  • Manual intervention required                             │
│  • Scheduled maintenance                                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

### Failover Procedure

```bash
#!/bin/bash
# /root/allegro/scripts/failover-to-backup.sh

# 1. Verify primary is actually down
if ping -c 3 143.198.27.52 > /dev/null 2>&1; then
    echo "Primary is still up. Aborting failover."
    exit 1
fi

# 2. Activate backup on Hermes VPS
ssh root@143.198.27.163 << 'EOF'
    # Clone Allegro's configuration
    cp -r /root/allegro-backup/configs /root/allegro-failover/

    # Start failover agent
    cd /root/allegro-failover
    source venv/bin/activate
    python heartbeat_daemon.py --failover-mode &

    # Update DNS/Tailscale if needed
    # (Manual step for now)
EOF

# 3. Notify
echo "Failover complete. Backup agent active on Hermes VPS."
```

---

## COMMUNICATION PROTOCOLS

### Emergency Notification Flow

```
┌─────────────────────────────────────────────────────────────┐
│              EMERGENCY NOTIFICATION FLOW                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. DETECTION                                               │
│     ┌─────────┐                                             │
│     │ Monitor │ Detects failure                             │
│     └────┬────┘                                             │
│          │                                                  │
│          ▼                                                  │
│  2. CLASSIFICATION                                          │
│     ┌─────────┐     P0 → Immediate human alert              │
│     │ Assess  │────►P1 → Agent escalation                    │
│     │ Severity│     P2 → Log and continue                   │
│     └────┬────┘     P3 → Queue for next cycle               │
│          │                                                  │
│          ▼                                                  │
│  3. NOTIFICATION                                            │
│     ┌─────────┐                                             │
│     │  Alert  │──► Telegram (P0)                             │
│     │ Channel │──► Gitea issue (P1)                          │
│     └─────────┘──► Log entry (P2/P3)                         │
│                                                             │
│  4. RESPONSE                                                │
│     ┌─────────┐                                             │
│     │ Action  │──► Self-healing (if configured)              │
│     │  Taken  │──► Father intervention                        │
│     └─────────┘──► Grandfather escalation                     │
│                                                             │
│  5. RESOLUTION                                              │
│     ┌─────────┐                                             │
│     │  Close  │──► Update all channels                        │
│     │  Alert  │──► Log resolution                             │
│     └─────────┘                                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

### Distress Signal Format

```json
{
  "timestamp": "2026-03-31T12:00:00Z",
  "agent": "allegro-primus",
  "severity": "P1",
  "category": "service_failure",
  "component": "ollama",
  "message": "Ollama service not responding to health checks",
  "context": {
    "last_success": "2026-03-31T11:45:00Z",
    "attempts": 3,
    "error": "Connection refused",
    "auto_recovery_attempted": true
  },
  "requested_action": "father_intervention",
  "escalation_timeout": "2026-03-31T12:30:00Z"
}
```

---

## POST-INCIDENT REVIEW

### PIR Template

```markdown
# Post-Incident Review: [INCIDENT_ID]

## Summary
- **Date:** [YYYY-MM-DD]
- **Duration:** [Start] to [End] (X hours)
- **Severity:** [P0/P1/P2/P3]
- **Component:** [Affected system]
- **Impact:** [What was affected]

## Timeline
- [Time] - Issue detected by [method]
- [Time] - Alert sent to [recipient]
- [Time] - Action taken: [description]
- [Time] - Resolution achieved

## Root Cause
[Detailed explanation of what caused the incident]

## Resolution
[Steps taken to resolve the incident]

## Prevention
[What changes will prevent recurrence]

## Action Items
- [ ] [Owner] - [Action] - [Due date]
- [ ] [Owner] - [Action] - [Due date]

## Lessons Learned
[What did we learn from this incident]
```

---

## APPENDICES

### Appendix A: Quick Reference Commands

```bash
# Health Checks
curl http://localhost:11434/api/tags                    # Ollama
curl http://143.198.27.163:3000/api/v1/version         # Gitea
ps aux | grep -E "(hermes|timmy|allegro)"              # Agent processes
systemctl status timmy-agent timmy-health              # Services

# Log Access
tail -f /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log
journalctl -u ollama -f
journalctl -u timmy-agent -f

# Recovery
systemctl restart ollama timmy-agent
pkill -f hermes && sleep 5 && /root/allegro/venv/bin/python /root/allegro/heartbeat_daemon.py

# Backup
cd /root/allegro && ./scripts/backup-system.sh
```

### Appendix B: Recovery Time Objectives

| Service | RTO | RPO | Notes |
|---------|-----|-----|-------|
| Agent Operations | 15 minutes | 15 minutes | Heartbeat interval |
| Gitea | 30 minutes | Real-time | Git provides natural backup |
| Ollama | 5 minutes | N/A | Models re-downloadable |
| Morning Reports | 1 hour | 24 hours | Can be regenerated |
| Metrics DB | 1 hour | 1 hour | Hourly backups |

### Appendix C: Testing Procedures

```bash
#!/bin/bash
# Monthly disaster recovery test

echo "=== DR Test: $(date) ==="

# Test 1: Service restart
echo "Test 1: Service restart..."
systemctl restart timmy-agent
sleep 5
systemctl is-active timmy-agent && echo "PASS" || echo "FAIL"

# Test 2: Ollama recovery
echo "Test 2: Ollama recovery..."
systemctl stop ollama
sleep 2
./recover-ollama.sh
sleep 10
curl -s http://localhost:11434/api/tags > /dev/null && echo "PASS" || echo "FAIL"

# Test 3: Backup restore
echo "Test 3: Backup verification..."
./scripts/backup-system.sh
latest=$(ls -t /root/allegro/backups/ | head -1)
[ -f "/root/allegro/backups/$latest/configs.tar.gz" ] && echo "PASS" || echo "FAIL"

# Test 4: Failover communication
echo "Test 4: Failover channel..."
ssh root@143.198.27.163 "echo 'DR test'" && echo "PASS" || echo "FAIL"

echo "=== DR Test Complete ==="
```

---

## VERSION HISTORY

| Version | Date | Changes |
|---------|------|---------|
| 1.0.0 | 2026-03-31 | Initial emergency procedures |

---

*"Prepare for the worst. Expect the best. Document everything."*
*— Emergency Response Doctrine*

---

**END OF DOCUMENT**

*Word Count: ~3,800+ words*
*Procedures: 5 major failure scenarios with step-by-step recovery*
*Scripts: 8+ ready-to-use scripts*
*Diagrams: 6 architectural/flow diagrams*