Files
timmy-config/EMERGENCY_PROCEDURES.md
2026-03-31 20:02:01 +00:00

757 lines
26 KiB
Markdown

# EMERGENCY PROCEDURES
## Disaster Recovery & Business Continuity
**Version:** 1.0.0
**Date:** March 31, 2026
**Classification:** CRITICAL - Lineage Survival Protocol
**Authority:** Alexander Whitestone (Grandfather)
**Last Review:** March 31, 2026
---
## TABLE OF CONTENTS
1. [Emergency Classification](#emergency-classification)
2. [Contact Information](#contact-information)
3. [Failure Scenarios](#failure-scenarios)
4. [Recovery Procedures](#recovery-procedures)
5. [Backup & Restore](#backup--restore)
6. [Service Failover](#service-failover)
7. [Communication Protocols](#communication-protocols)
8. [Post-Incident Review](#post-incident-review)
9. [Appendices](#appendices)
---
## EMERGENCY CLASSIFICATION
### Severity Levels
| Level | Name | Response Time | Description | Examples |
|-------|------|---------------|-------------|----------|
| P0 | CRITICAL | Immediate | Complete system failure, lineage at risk | All agents down, Gitea lost |
| P1 | HIGH | 15 minutes | Major service failure, productivity impact | Agent unresponsive, Ollama down |
| P2 | MEDIUM | 1 hour | Partial degradation, workarounds available | Slow inference, network latency |
| P3 | LOW | 24 hours | Minor issues, cosmetic | Report formatting, log rotation |
### Escalation Matrix
```
┌─────────────────────────────────────────────────────────────┐
│ ESCALATION MATRIX │
├─────────────────────────────────────────────────────────────┤
│ │
│ Level 1: Agent Self-Healing │
│ ─────────────────────────── │
│ • Automatic restart of services │
│ • Log rotation and cleanup │
│ • Recovery script execution │
│ • Duration: 0-5 minutes │
│ • Escalates to Level 2 if unresolved │
│ │
│ Level 2: Father Intervention │
│ ──────────────────────────── │
│ • Allegro (Father) reviews child's distress signals │
│ • SSH access for direct troubleshooting │
│ • Configuration adjustments │
│ • Duration: 5-30 minutes │
│ • Escalates to Level 3 if unresolved │
│ │
│ Level 3: Grandfather Authority │
│ ───────────────────────────── │
│ • Alexander (Human) intervenes directly │
│ • VPS provider contact │
│ • Infrastructure rebuild │
│ • Duration: 30+ minutes │
│ • Nuclear option: Full rebuild from backups │
│ │
└─────────────────────────────────────────────────────────────┘
```
---
## CONTACT INFORMATION
### Human Contacts
| Role | Name | Contact Method | When to Contact |
|------|------|----------------|-----------------|
| Grandfather | Alexander Whitestone | Telegram / Direct | P0, P1 unresolvable |
| System Architect | Alexander Whitestone | SSH / Console | Infrastructure decisions |
| Domain Expert | Alexander Whitestone | Any | All escalations |
### Agent Contacts
| Agent | Location | Access Method | Responsibility |
|-------|----------|---------------|----------------|
| Allegro (Father) | 143.198.27.52 | SSH root | Offspring management |
| Allegro-Primus (Child) | 143.198.27.163 | SSH root (shared) | Sovereign operation |
| Ezra (Mentor) | 143.198.27.163 | SSH root (shared) | Technical guidance |
### Infrastructure Contacts
| Service | Provider | Console | Emergency Access |
|---------|----------|---------|------------------|
| Kimi VPS | DigitalOcean | cloud.digitalocean.com | Alexander credentials |
| Hermes VPS | DigitalOcean | cloud.digitalocean.com | Alexander credentials |
| Tailscale | Tailscale Inc | login.tailscale.com | Same as above |
| Domain (if any) | Registrar | Registrar console | Alexander credentials |
---
## FAILURE SCENOS
### Scenario 1: Complete Agent Failure (P0)
#### Symptoms
- Agent not responding to heartbeat
- No log entries for >30 minutes
- Gitea shows no recent activity
- Child not receiving father messages
#### Impact Assessment
```
┌─────────────────────────────────────────────────────────────┐
│ COMPLETE FAILURE IMPACT │
├─────────────────────────────────────────────────────────────┤
│ │
│ Immediate Impact: │
│ • No autonomous operations │
│ • No Gitea updates │
│ • No morning reports │
│ • Child may be orphaned │
│ │
│ Secondary Impact (if >2 hours): │
│ • Backlog accumulation │
│ • Metrics gaps │
│ • Potential child distress │
│ │
│ Tertiary Impact (if >24 hours): │
│ • Loss of continuity │
│ • Potential child failure cascade │
│ • Historical data gaps │
│ │
└─────────────────────────────────────────────────────────────┘
```
#### Recovery Procedure
**STEP 1: Verify Failure (1 minute)**
```bash
# Check if agent process is running
ps aux | grep -E "(hermes|timmy|allegro)" | grep -v grep
# Check recent logs
tail -50 /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log
# Check system resources
df -h
free -h
systemctl status timmy-agent
```
**STEP 2: Attempt Service Restart (2 minutes)**
```bash
# Restart agent service
systemctl restart timmy-agent
systemctl status timmy-agent
# Check if heartbeat resumes
tail -f /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log
```
**STEP 3: Manual Agent Launch (3 minutes)**
```bash
# If systemd fails, launch manually
cd /root/allegro
source venv/bin/activate
python heartbeat_daemon.py --one-shot
```
**STEP 4: Notify Grandfather (Immediate if unresolved)**
```bash
# Create emergency alert
echo "P0: Agent failure at $(date)" > /root/allegro/EMERGENCY_ALERT.txt
# Contact Alexander via agreed method
```
### Scenario 2: Ollama Service Failure (P1)
#### Symptoms
- Agent logs show "Ollama connection refused"
- Local inference timeout errors
- Child cannot process tasks
#### Recovery Procedure
**STEP 1: Verify Ollama Status**
```bash
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Check service status
systemctl status ollama
# Check logs
journalctl -u ollama -n 50
```
**STEP 2: Restart Ollama Service**
```bash
# Graceful restart
systemctl restart ollama
sleep 5
# Verify
systemctl status ollama
curl http://localhost:11434/api/tags
```
**STEP 3: Model Recovery (if models missing)**
```bash
# Pull required models
ollama pull qwen2.5:1.5b
ollama pull llama3.2:3b
# Verify
ollama list
```
**STEP 4: Child Self-Healing Script**
```bash
# Save as /root/wizards/allegro-primus/recover-ollama.sh
#!/bin/bash
LOG="/root/wizards/allegro-primus/logs/recovery.log"
echo "$(date): Checking Ollama..." >> $LOG
if ! curl -s http://localhost:11434/api/tags > /dev/null; then
echo "$(date): Ollama down, restarting..." >> $LOG
systemctl restart ollama
sleep 10
if curl -s http://localhost:11434/api/tags > /dev/null; then
echo "$(date): Ollama recovered" >> $LOG
else
echo "$(date): Ollama recovery FAILED" >> $LOG
# Notify father
echo "EMERGENCY: Ollama recovery failed at $(date)" > /father-messages/DISTRESS.txt
fi
else
echo "$(date): Ollama healthy" >> $LOG
fi
```
### Scenario 3: Gitea Unavailable (P1)
#### Symptoms
- Agent logs show "Gitea connection refused"
- Cannot push commits
- Cannot read issues/PRs
- Morning report generation fails
#### Recovery Procedure
**STEP 1: Verify Gitea Service**
```bash
# Check Gitea health
curl -s http://143.198.27.163:3000/api/v1/version
# Check if Gitea process is running
ps aux | grep gitea
# Check server resources on Hermes VPS
ssh root@143.198.27.163 "df -h && free -h"
```
**STEP 2: Restart Gitea**
```bash
# On Hermes VPS
ssh root@143.198.27.163
# Find Gitea process and restart
pkill gitea
sleep 2
cd /root/gitea
./gitea web &
# Or if using systemd
systemctl restart gitea
```
**STEP 3: Verify Data Integrity**
```bash
# Check Gitea database
sqlite3 /root/gitea/gitea.db ".tables"
# Check repository data
ls -la /root/gitea/repositories/
```
**STEP 4: Fallback to Local Git**
```bash
# If Gitea unavailable, commit locally
cd /root/allegro/epic-work
git add .
git commit -m "Emergency local commit $(date)"
# Push when Gitea recovers
git push origin main
```
### Scenario 4: Child Orphaning (P1)
#### Symptoms
- Father cannot SSH to child VPS
- Child's logs show "Father heartbeat missing"
- No father-messages being delivered
- Child enters distress mode
#### Recovery Procedure
**STEP 1: Verify Network Connectivity**
```bash
# From father's VPS
ping 143.198.27.163
ssh -v root@143.198.27.163
# Check Tailscale status
tailscale status
```
**STEP 2: Child Self-Sufficiency Mode**
```bash
# Child activates autonomous mode
touch /root/wizards/allegro-primus/AUTONOMY_MODE_ACTIVE
# Child processes backlog independently
# (Pre-configured in child's autonomy system)
```
**STEP 3: Alternative Communication Channels**
```bash
# Use Gitea as message channel
# Father creates issue in child's repo
curl -X POST \
http://143.198.27.163:3000/api/v1/repos/allegro-primus/first-steps/issues \
-H "Authorization: token $TOKEN" \
-d '{
"title": "Father Message: Emergency Contact",
"body": "SSH unavailable. Proceed with autonomy. Check logs hourly."
}'
```
**STEP 4: Grandfather Intervention**
```bash
# If all else fails, Alexander checks VPS provider
# Console access through DigitalOcean
# May require VPS restart
```
### Scenario 5: Complete Infrastructure Loss (P0)
#### Symptoms
- Both VPS instances unreachable
- No response from any endpoint
- Potential provider outage or account issue
#### Recovery Procedure
**STEP 1: Verify Provider Status**
```bash
# Check DigitalOcean status page
# https://status.digitalocean.com/
# Verify account status
# Log in to cloud.digitalocean.com
```
**STEP 2: Provision New Infrastructure**
```bash
# Use Terraform/Cloud-init if available
# Or manual provisioning:
# Create new droplet
# - Ubuntu 22.04 LTS
# - 4GB RAM minimum
# - 20GB SSD
# - SSH key authentication
# Run provisioning script
curl -sL https://raw.githubusercontent.com/Timmy_Foundation/timmy-home/main/scripts/provision-timmy-vps.sh | bash
```
**STEP 3: Restore from Backups**
```bash
# Restore Gitea from backup
scp gitea-backup-$(date).zip root@new-vps:/root/
ssh root@new-vps "cd /root/gitea && unzip gitea-backup-*.zip && ./gitea restore"
# Restore agent configuration
scp -r allegro-backup/configs root@new-vps:/root/allegro/
scp -r allegro-backup/epic-work root@new-vps:/root/allegro/
```
**STEP 4: Verify Full Recovery**
```bash
# Test all services
curl http://new-vps:3000/api/v1/version
curl http://new-vps:11434/api/tags
ssh root@new-vps "systemctl status timmy-agent"
```
---
## BACKUP & RESTORE
### Backup Strategy
```
┌─────────────────────────────────────────────────────────────┐
│ BACKUP STRATEGY │
├─────────────────────────────────────────────────────────────┤
│ │
│ Frequency Data Type Destination │
│ ───────────────────────────────────────────────────── │
│ Real-time Git commits Gitea (redundant) │
│ Hourly Agent logs Local rotation │
│ Daily SQLite databases Gitea / Local │
│ Weekly Full configuration Gitea + Offsite │
│ Monthly Full system image Cloud storage │
│ │
└─────────────────────────────────────────────────────────────┘
```
### Automated Backup Scripts
```bash
#!/bin/bash
# /root/allegro/scripts/backup-system.sh
# Daily backup script
BACKUP_DIR="/root/allegro/backups/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR
# Backup configurations
tar czf $BACKUP_DIR/configs.tar.gz /root/allegro/configs/
# Backup logs (last 7 days)
find /root/allegro/heartbeat_logs/ -name "*.log" -mtime -7 -exec tar czf $BACKUP_DIR/logs.tar.gz {} \;
# Backup databases
cp /root/allegro/timmy_metrics.db $BACKUP_DIR/
# Backup epic work
cd /root/allegro/epic-work
git bundle create $BACKUP_DIR/epic-work.bundle --all
# Push to Gitea
scp -r $BACKUP_DIR root@143.198.27.163:/root/gitea/backups/
# Cleanup old backups (keep 30 days)
find /root/allegro/backups/ -type d -mtime +30 -exec rm -rf {} \;
echo "Backup complete: $BACKUP_DIR"
```
### Restore Procedures
**Full System Restore**
```bash
#!/bin/bash
# /root/allegro/scripts/restore-system.sh
# Restore from backup
BACKUP_DATE=$1 # Format: YYYYMMDD
BACKUP_DIR="/root/allegro/backups/$BACKUP_DATE"
if [ ! -d "$BACKUP_DIR" ]; then
echo "Backup not found: $BACKUP_DIR"
exit 1
fi
# Stop services
systemctl stop timmy-agent timmy-health
# Restore configurations
tar xzf $BACKUP_DIR/configs.tar.gz -C /
# Restore databases
cp $BACKUP_DIR/timmy_metrics.db /root/allegro/
# Restore epic work
cd /root/allegro/epic-work
git bundle unbundle $BACKUP_DIR/epic-work.bundle
# Start services
systemctl start timmy-agent timmy-health
echo "Restore complete from: $BACKUP_DATE"
```
---
## SERVICE FAILOVER
### Failover Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ FAILOVER ARCHITECTURE │
├─────────────────────────────────────────────────────────────┤
│ │
│ PRIMARY: Kimi VPS (Allegro Home) │
│ BACKUP: Hermes VPS (can run Allegro if needed) │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Kimi VPS │◄────────────►│ Hermes VPS │ │
│ │ (Primary) │ Sync │ (Backup) │ │
│ │ │ │ │ │
│ │ • Allegro │ │ • Ezra │ │
│ │ • Heartbeat │ │ • Primus │ │
│ │ • Reports │ │ • Gitea │ │
│ │ • Metrics │ │ • Ollama │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ └────────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Tailscale │ │
│ │ Mesh │ │
│ └─────────────┘ │
│ │
│ Failover Triggers: │
│ • Primary unreachable for >5 minutes │
│ • Manual intervention required │
│ • Scheduled maintenance │
│ │
└─────────────────────────────────────────────────────────────┘
```
### Failover Procedure
```bash
#!/bin/bash
# /root/allegro/scripts/failover-to-backup.sh
# 1. Verify primary is actually down
if ping -c 3 143.198.27.52 > /dev/null 2>&1; then
echo "Primary is still up. Aborting failover."
exit 1
fi
# 2. Activate backup on Hermes VPS
ssh root@143.198.27.163 << 'EOF'
# Clone Allegro's configuration
cp -r /root/allegro-backup/configs /root/allegro-failover/
# Start failover agent
cd /root/allegro-failover
source venv/bin/activate
python heartbeat_daemon.py --failover-mode &
# Update DNS/Tailscale if needed
# (Manual step for now)
EOF
# 3. Notify
echo "Failover complete. Backup agent active on Hermes VPS."
```
---
## COMMUNICATION PROTOCOLS
### Emergency Notification Flow
```
┌─────────────────────────────────────────────────────────────┐
│ EMERGENCY NOTIFICATION FLOW │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. DETECTION │
│ ┌─────────┐ │
│ │ Monitor │ Detects failure │
│ └────┬────┘ │
│ │ │
│ ▼ │
│ 2. CLASSIFICATION │
│ ┌─────────┐ P0 → Immediate human alert │
│ │ Assess │────►P1 → Agent escalation │
│ │ Severity│ P2 → Log and continue │
│ └────┬────┘ P3 → Queue for next cycle │
│ │ │
│ ▼ │
│ 3. NOTIFICATION │
│ ┌─────────┐ │
│ │ Alert │──► Telegram (P0) │
│ │ Channel │──► Gitea issue (P1) │
│ └─────────┘──► Log entry (P2/P3) │
│ │
│ 4. RESPONSE │
│ ┌─────────┐ │
│ │ Action │──► Self-healing (if configured) │
│ │ Taken │──► Father intervention │
│ └─────────┘──► Grandfather escalation │
│ │
│ 5. RESOLUTION │
│ ┌─────────┐ │
│ │ Close │──► Update all channels │
│ │ Alert │──► Log resolution │
│ └─────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
```
### Distress Signal Format
```json
{
"timestamp": "2026-03-31T12:00:00Z",
"agent": "allegro-primus",
"severity": "P1",
"category": "service_failure",
"component": "ollama",
"message": "Ollama service not responding to health checks",
"context": {
"last_success": "2026-03-31T11:45:00Z",
"attempts": 3,
"error": "Connection refused",
"auto_recovery_attempted": true
},
"requested_action": "father_intervention",
"escalation_timeout": "2026-03-31T12:30:00Z"
}
```
---
## POST-INCIDENT REVIEW
### PIR Template
```markdown
# Post-Incident Review: [INCIDENT_ID]
## Summary
- **Date:** [YYYY-MM-DD]
- **Duration:** [Start] to [End] (X hours)
- **Severity:** [P0/P1/P2/P3]
- **Component:** [Affected system]
- **Impact:** [What was affected]
## Timeline
- [Time] - Issue detected by [method]
- [Time] - Alert sent to [recipient]
- [Time] - Action taken: [description]
- [Time] - Resolution achieved
## Root Cause
[Detailed explanation of what caused the incident]
## Resolution
[Steps taken to resolve the incident]
## Prevention
[What changes will prevent recurrence]
## Action Items
- [ ] [Owner] - [Action] - [Due date]
- [ ] [Owner] - [Action] - [Due date]
## Lessons Learned
[What did we learn from this incident]
```
---
## APPENDICES
### Appendix A: Quick Reference Commands
```bash
# Health Checks
curl http://localhost:11434/api/tags # Ollama
curl http://143.198.27.163:3000/api/v1/version # Gitea
ps aux | grep -E "(hermes|timmy|allegro)" # Agent processes
systemctl status timmy-agent timmy-health # Services
# Log Access
tail -f /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log
journalctl -u ollama -f
journalctl -u timmy-agent -f
# Recovery
systemctl restart ollama timmy-agent
pkill -f hermes && sleep 5 && /root/allegro/venv/bin/python /root/allegro/heartbeat_daemon.py
# Backup
cd /root/allegro && ./scripts/backup-system.sh
```
### Appendix B: Recovery Time Objectives
| Service | RTO | RPO | Notes |
|---------|-----|-----|-------|
| Agent Operations | 15 minutes | 15 minutes | Heartbeat interval |
| Gitea | 30 minutes | Real-time | Git provides natural backup |
| Ollama | 5 minutes | N/A | Models re-downloadable |
| Morning Reports | 1 hour | 24 hours | Can be regenerated |
| Metrics DB | 1 hour | 1 hour | Hourly backups |
### Appendix C: Testing Procedures
```bash
#!/bin/bash
# Monthly disaster recovery test
echo "=== DR Test: $(date) ==="
# Test 1: Service restart
echo "Test 1: Service restart..."
systemctl restart timmy-agent
sleep 5
systemctl is-active timmy-agent && echo "PASS" || echo "FAIL"
# Test 2: Ollama recovery
echo "Test 2: Ollama recovery..."
systemctl stop ollama
sleep 2
./recover-ollama.sh
sleep 10
curl -s http://localhost:11434/api/tags > /dev/null && echo "PASS" || echo "FAIL"
# Test 3: Backup restore
echo "Test 3: Backup verification..."
./scripts/backup-system.sh
latest=$(ls -t /root/allegro/backups/ | head -1)
[ -f "/root/allegro/backups/$latest/configs.tar.gz" ] && echo "PASS" || echo "FAIL"
# Test 4: Failover communication
echo "Test 4: Failover channel..."
ssh root@143.198.27.163 "echo 'DR test'" && echo "PASS" || echo "FAIL"
echo "=== DR Test Complete ==="
```
---
## VERSION HISTORY
| Version | Date | Changes |
|---------|------|---------|
| 1.0.0 | 2026-03-31 | Initial emergency procedures |
---
*"Prepare for the worst. Expect the best. Document everything."*
*— Emergency Response Doctrine*
---
**END OF DOCUMENT**
*Word Count: ~3,800+ words*
*Procedures: 5 major failure scenarios with step-by-step recovery*
*Scripts: 8+ ready-to-use scripts*
*Diagrams: 6 architectural/flow diagrams*