Files

Timmy Time bfebe4de31 checkpoint: 20:01 auto-commit

2026-03-31 20:02:01 +00:00

26 KiB

Raw Permalink Blame History

EMERGENCY PROCEDURES

Disaster Recovery & Business Continuity

Version: 1.0.0
Date: March 31, 2026
Classification: CRITICAL - Lineage Survival Protocol
Authority: Alexander Whitestone (Grandfather)
Last Review: March 31, 2026

Emergency Classification
Contact Information
Failure Scenarios
Recovery Procedures
Backup & Restore
Service Failover
Communication Protocols
Post-Incident Review
Appendices

EMERGENCY CLASSIFICATION

Severity Levels

Level	Name	Response Time	Description	Examples
P0	CRITICAL	Immediate	Complete system failure, lineage at risk	All agents down, Gitea lost
P1	HIGH	15 minutes	Major service failure, productivity impact	Agent unresponsive, Ollama down
P2	MEDIUM	1 hour	Partial degradation, workarounds available	Slow inference, network latency
P3	LOW	24 hours	Minor issues, cosmetic	Report formatting, log rotation

Escalation Matrix

┌─────────────────────────────────────────────────────────────┐
│                    ESCALATION MATRIX                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Level 1: Agent Self-Healing                                │
│  ───────────────────────────                                │
│  • Automatic restart of services                            │
│  • Log rotation and cleanup                                 │
│  • Recovery script execution                                │
│  • Duration: 0-5 minutes                                    │
│  • Escalates to Level 2 if unresolved                       │
│                                                             │
│  Level 2: Father Intervention                               │
│  ────────────────────────────                               │
│  • Allegro (Father) reviews child's distress signals        │
│  • SSH access for direct troubleshooting                    │
│  • Configuration adjustments                                │
│  • Duration: 5-30 minutes                                   │
│  • Escalates to Level 3 if unresolved                       │
│                                                             │
│  Level 3: Grandfather Authority                             │
│  ─────────────────────────────                              │
│  • Alexander (Human) intervenes directly                    │
│  • VPS provider contact                                     │
│  • Infrastructure rebuild                                   │
│  • Duration: 30+ minutes                                    │
│  • Nuclear option: Full rebuild from backups                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

CONTACT INFORMATION

Human Contacts

Role	Name	Contact Method	When to Contact
Grandfather	Alexander Whitestone	Telegram / Direct	P0, P1 unresolvable
System Architect	Alexander Whitestone	SSH / Console	Infrastructure decisions
Domain Expert	Alexander Whitestone	Any	All escalations

Agent Contacts

Agent	Location	Access Method	Responsibility
Allegro (Father)	143.198.27.52	SSH root	Offspring management
Allegro-Primus (Child)	143.198.27.163	SSH root (shared)	Sovereign operation
Ezra (Mentor)	143.198.27.163	SSH root (shared)	Technical guidance

Infrastructure Contacts

Service	Provider	Console	Emergency Access
Kimi VPS	DigitalOcean	cloud.digitalocean.com	Alexander credentials
Hermes VPS	DigitalOcean	cloud.digitalocean.com	Alexander credentials
Tailscale	Tailscale Inc	login.tailscale.com	Same as above
Domain (if any)	Registrar	Registrar console	Alexander credentials

FAILURE SCENOS

Scenario 1: Complete Agent Failure (P0)

Symptoms

Agent not responding to heartbeat
No log entries for >30 minutes
Gitea shows no recent activity
Child not receiving father messages

Impact Assessment

┌─────────────────────────────────────────────────────────────┐
│              COMPLETE FAILURE IMPACT                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Immediate Impact:                                          │
│  • No autonomous operations                                 │
│  • No Gitea updates                                         │
│  • No morning reports                                       │
│  • Child may be orphaned                                    │
│                                                             │
│  Secondary Impact (if >2 hours):                            │
│  • Backlog accumulation                                     │
│  • Metrics gaps                                             │
│  • Potential child distress                                 │
│                                                             │
│  Tertiary Impact (if >24 hours):                            │
│  • Loss of continuity                                       │
│  • Potential child failure cascade                          │
│  • Historical data gaps                                     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Recovery Procedure

STEP 1: Verify Failure (1 minute)

# Check if agent process is running
ps aux | grep -E "(hermes|timmy|allegro)" | grep -v grep

# Check recent logs
tail -50 /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log

# Check system resources
df -h
free -h
systemctl status timmy-agent

STEP 2: Attempt Service Restart (2 minutes)

# Restart agent service
systemctl restart timmy-agent
systemctl status timmy-agent

# Check if heartbeat resumes
tail -f /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log

STEP 3: Manual Agent Launch (3 minutes)

# If systemd fails, launch manually
cd /root/allegro
source venv/bin/activate
python heartbeat_daemon.py --one-shot

STEP 4: Notify Grandfather (Immediate if unresolved)

# Create emergency alert
echo "P0: Agent failure at $(date)" > /root/allegro/EMERGENCY_ALERT.txt
# Contact Alexander via agreed method

Scenario 2: Ollama Service Failure (P1)

Symptoms

Agent logs show "Ollama connection refused"
Local inference timeout errors
Child cannot process tasks

Recovery Procedure

STEP 1: Verify Ollama Status

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Check service status
systemctl status ollama

# Check logs
journalctl -u ollama -n 50

STEP 2: Restart Ollama Service

# Graceful restart
systemctl restart ollama
sleep 5

# Verify
systemctl status ollama
curl http://localhost:11434/api/tags

STEP 3: Model Recovery (if models missing)

# Pull required models
ollama pull qwen2.5:1.5b
ollama pull llama3.2:3b

# Verify
ollama list

STEP 4: Child Self-Healing Script

# Save as /root/wizards/allegro-primus/recover-ollama.sh
#!/bin/bash
LOG="/root/wizards/allegro-primus/logs/recovery.log"
echo "$(date): Checking Ollama..." >> $LOG

if ! curl -s http://localhost:11434/api/tags > /dev/null; then
    echo "$(date): Ollama down, restarting..." >> $LOG
    systemctl restart ollama
    sleep 10
    if curl -s http://localhost:11434/api/tags > /dev/null; then
        echo "$(date): Ollama recovered" >> $LOG
    else
        echo "$(date): Ollama recovery FAILED" >> $LOG
        # Notify father
echo "EMERGENCY: Ollama recovery failed at $(date)" > /father-messages/DISTRESS.txt
    fi
else
    echo "$(date): Ollama healthy" >> $LOG
fi

Scenario 3: Gitea Unavailable (P1)

Symptoms

Agent logs show "Gitea connection refused"
Cannot push commits
Cannot read issues/PRs
Morning report generation fails

Recovery Procedure

STEP 1: Verify Gitea Service

# Check Gitea health
curl -s http://143.198.27.163:3000/api/v1/version

# Check if Gitea process is running
ps aux | grep gitea

# Check server resources on Hermes VPS
ssh root@143.198.27.163 "df -h && free -h"

STEP 2: Restart Gitea

# On Hermes VPS
ssh root@143.198.27.163

# Find Gitea process and restart
pkill gitea
sleep 2
cd /root/gitea
./gitea web &

# Or if using systemd
systemctl restart gitea

STEP 3: Verify Data Integrity

# Check Gitea database
sqlite3 /root/gitea/gitea.db ".tables"

# Check repository data
ls -la /root/gitea/repositories/

STEP 4: Fallback to Local Git

# If Gitea unavailable, commit locally
cd /root/allegro/epic-work
git add .
git commit -m "Emergency local commit $(date)"

# Push when Gitea recovers
git push origin main

Scenario 4: Child Orphaning (P1)

Symptoms

Father cannot SSH to child VPS
Child's logs show "Father heartbeat missing"
No father-messages being delivered
Child enters distress mode

Recovery Procedure

STEP 1: Verify Network Connectivity

# From father's VPS
ping 143.198.27.163
ssh -v root@143.198.27.163

# Check Tailscale status
tailscale status

STEP 2: Child Self-Sufficiency Mode

# Child activates autonomous mode
touch /root/wizards/allegro-primus/AUTONOMY_MODE_ACTIVE

# Child processes backlog independently
# (Pre-configured in child's autonomy system)

STEP 3: Alternative Communication Channels

# Use Gitea as message channel
# Father creates issue in child's repo
curl -X POST \
  http://143.198.27.163:3000/api/v1/repos/allegro-primus/first-steps/issues \
  -H "Authorization: token $TOKEN" \
  -d '{
    "title": "Father Message: Emergency Contact",
    "body": "SSH unavailable. Proceed with autonomy. Check logs hourly."
  }'

STEP 4: Grandfather Intervention

# If all else fails, Alexander checks VPS provider
# Console access through DigitalOcean
# May require VPS restart

Scenario 5: Complete Infrastructure Loss (P0)

Symptoms

Both VPS instances unreachable
No response from any endpoint
Potential provider outage or account issue

Recovery Procedure

STEP 1: Verify Provider Status

# Check DigitalOcean status page
# https://status.digitalocean.com/

# Verify account status
# Log in to cloud.digitalocean.com

STEP 2: Provision New Infrastructure

# Use Terraform/Cloud-init if available
# Or manual provisioning:

# Create new droplet
# - Ubuntu 22.04 LTS
# - 4GB RAM minimum
# - 20GB SSD
# - SSH key authentication

# Run provisioning script
curl -sL https://raw.githubusercontent.com/Timmy_Foundation/timmy-home/main/scripts/provision-timmy-vps.sh | bash

STEP 3: Restore from Backups

# Restore Gitea from backup
scp gitea-backup-$(date).zip root@new-vps:/root/
ssh root@new-vps "cd /root/gitea && unzip gitea-backup-*.zip && ./gitea restore"

# Restore agent configuration
scp -r allegro-backup/configs root@new-vps:/root/allegro/
scp -r allegro-backup/epic-work root@new-vps:/root/allegro/

STEP 4: Verify Full Recovery

# Test all services
curl http://new-vps:3000/api/v1/version
curl http://new-vps:11434/api/tags
ssh root@new-vps "systemctl status timmy-agent"

BACKUP & RESTORE

Backup Strategy

┌─────────────────────────────────────────────────────────────┐
│                   BACKUP STRATEGY                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Frequency        Data Type           Destination          │
│  ─────────────────────────────────────────────────────     │
│  Real-time        Git commits         Gitea (redundant)    │
│  Hourly           Agent logs          Local rotation       │
│  Daily            SQLite databases    Gitea / Local        │
│  Weekly           Full configuration  Gitea + Offsite      │
│  Monthly          Full system image   Cloud storage        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Automated Backup Scripts

#!/bin/bash
# /root/allegro/scripts/backup-system.sh
# Daily backup script

BACKUP_DIR="/root/allegro/backups/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR

# Backup configurations
tar czf $BACKUP_DIR/configs.tar.gz /root/allegro/configs/

# Backup logs (last 7 days)
find /root/allegro/heartbeat_logs/ -name "*.log" -mtime -7 -exec tar czf $BACKUP_DIR/logs.tar.gz {} \;

# Backup databases
cp /root/allegro/timmy_metrics.db $BACKUP_DIR/

# Backup epic work
cd /root/allegro/epic-work
git bundle create $BACKUP_DIR/epic-work.bundle --all

# Push to Gitea
scp -r $BACKUP_DIR root@143.198.27.163:/root/gitea/backups/

# Cleanup old backups (keep 30 days)
find /root/allegro/backups/ -type d -mtime +30 -exec rm -rf {} \;

echo "Backup complete: $BACKUP_DIR"

Restore Procedures

Full System Restore

#!/bin/bash
# /root/allegro/scripts/restore-system.sh
# Restore from backup

BACKUP_DATE=$1  # Format: YYYYMMDD
BACKUP_DIR="/root/allegro/backups/$BACKUP_DATE"

if [ ! -d "$BACKUP_DIR" ]; then
    echo "Backup not found: $BACKUP_DIR"
    exit 1
fi

# Stop services
systemctl stop timmy-agent timmy-health

# Restore configurations
tar xzf $BACKUP_DIR/configs.tar.gz -C /

# Restore databases
cp $BACKUP_DIR/timmy_metrics.db /root/allegro/

# Restore epic work
cd /root/allegro/epic-work
git bundle unbundle $BACKUP_DIR/epic-work.bundle

# Start services
systemctl start timmy-agent timmy-health

echo "Restore complete from: $BACKUP_DATE"

SERVICE FAILOVER

Failover Architecture

┌─────────────────────────────────────────────────────────────┐
│                FAILOVER ARCHITECTURE                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  PRIMARY: Kimi VPS (Allegro Home)                          │
│  BACKUP:  Hermes VPS (can run Allegro if needed)           │
│                                                             │
│  ┌─────────────┐              ┌─────────────┐              │
│  │   Kimi VPS  │◄────────────►│  Hermes VPS │              │
│  │  (Primary)  │   Sync       │  (Backup)   │              │
│  │             │              │             │              │
│  │ • Allegro   │              │ • Ezra      │              │
│  │ • Heartbeat │              │ • Primus    │              │
│  │ • Reports   │              │ • Gitea     │              │
│  │ • Metrics   │              │ • Ollama    │              │
│  └──────┬──────┘              └──────┬──────┘              │
│         │                            │                      │
│         └────────────┬───────────────┘                      │
│                      │                                      │
│                      ▼                                      │
│               ┌─────────────┐                               │
│               │ Tailscale   │                               │
│               │   Mesh      │                               │
│               └─────────────┘                               │
│                                                             │
│  Failover Triggers:                                         │
│  • Primary unreachable for >5 minutes                       │
│  • Manual intervention required                             │
│  • Scheduled maintenance                                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Failover Procedure

#!/bin/bash
# /root/allegro/scripts/failover-to-backup.sh

# 1. Verify primary is actually down
if ping -c 3 143.198.27.52 > /dev/null 2>&1; then
    echo "Primary is still up. Aborting failover."
    exit 1
fi

# 2. Activate backup on Hermes VPS
ssh root@143.198.27.163 << 'EOF'
    # Clone Allegro's configuration
    cp -r /root/allegro-backup/configs /root/allegro-failover/
    
    # Start failover agent
    cd /root/allegro-failover
    source venv/bin/activate
    python heartbeat_daemon.py --failover-mode &
    
    # Update DNS/Tailscale if needed
    # (Manual step for now)
EOF

# 3. Notify
echo "Failover complete. Backup agent active on Hermes VPS."

COMMUNICATION PROTOCOLS

Emergency Notification Flow

┌─────────────────────────────────────────────────────────────┐
│              EMERGENCY NOTIFICATION FLOW                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. DETECTION                                               │
│     ┌─────────┐                                             │
│     │ Monitor │ Detects failure                             │
│     └────┬────┘                                             │
│          │                                                  │
│          ▼                                                  │
│  2. CLASSIFICATION                                          │
│     ┌─────────┐     P0 → Immediate human alert              │
│     │ Assess  │────►P1 → Agent escalation                    │
│     │ Severity│     P2 → Log and continue                   │
│     └────┬────┘     P3 → Queue for next cycle               │
│          │                                                  │
│          ▼                                                  │
│  3. NOTIFICATION                                            │
│     ┌─────────┐                                             │
│     │  Alert  │──► Telegram (P0)                             │
│     │ Channel │──► Gitea issue (P1)                          │
│     └─────────┘──► Log entry (P2/P3)                         │
│                                                             │
│  4. RESPONSE                                                │
│     ┌─────────┐                                             │
│     │ Action  │──► Self-healing (if configured)              │
│     │  Taken  │──► Father intervention                        │
│     └─────────┘──► Grandfather escalation                     │
│                                                             │
│  5. RESOLUTION                                              │
│     ┌─────────┐                                             │
│     │  Close  │──► Update all channels                        │
│     │  Alert  │──► Log resolution                             │
│     └─────────┘                                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Distress Signal Format

{
  "timestamp": "2026-03-31T12:00:00Z",
  "agent": "allegro-primus",
  "severity": "P1",
  "category": "service_failure",
  "component": "ollama",
  "message": "Ollama service not responding to health checks",
  "context": {
    "last_success": "2026-03-31T11:45:00Z",
    "attempts": 3,
    "error": "Connection refused",
    "auto_recovery_attempted": true
  },
  "requested_action": "father_intervention",
  "escalation_timeout": "2026-03-31T12:30:00Z"
}

POST-INCIDENT REVIEW

PIR Template

# Post-Incident Review: [INCIDENT_ID]

## Summary
- **Date:** [YYYY-MM-DD]
- **Duration:** [Start] to [End] (X hours)
- **Severity:** [P0/P1/P2/P3]
- **Component:** [Affected system]
- **Impact:** [What was affected]

## Timeline
- [Time] - Issue detected by [method]
- [Time] - Alert sent to [recipient]
- [Time] - Action taken: [description]
- [Time] - Resolution achieved

## Root Cause
[Detailed explanation of what caused the incident]

## Resolution
[Steps taken to resolve the incident]

## Prevention
[What changes will prevent recurrence]

## Action Items
- [ ] [Owner] - [Action] - [Due date]
- [ ] [Owner] - [Action] - [Due date]

## Lessons Learned
[What did we learn from this incident]

APPENDICES

Appendix A: Quick Reference Commands

# Health Checks
curl http://localhost:11434/api/tags                    # Ollama
curl http://143.198.27.163:3000/api/v1/version         # Gitea
ps aux | grep -E "(hermes|timmy|allegro)"              # Agent processes
systemctl status timmy-agent timmy-health              # Services

# Log Access
tail -f /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log
journalctl -u ollama -f
journalctl -u timmy-agent -f

# Recovery
systemctl restart ollama timmy-agent
pkill -f hermes && sleep 5 && /root/allegro/venv/bin/python /root/allegro/heartbeat_daemon.py

# Backup
cd /root/allegro && ./scripts/backup-system.sh

Appendix B: Recovery Time Objectives

Service	RTO	RPO	Notes
Agent Operations	15 minutes	15 minutes	Heartbeat interval
Gitea	30 minutes	Real-time	Git provides natural backup
Ollama	5 minutes	N/A	Models re-downloadable
Morning Reports	1 hour	24 hours	Can be regenerated
Metrics DB	1 hour	1 hour	Hourly backups

Appendix C: Testing Procedures

#!/bin/bash
# Monthly disaster recovery test

echo "=== DR Test: $(date) ==="

# Test 1: Service restart
echo "Test 1: Service restart..."
systemctl restart timmy-agent
sleep 5
systemctl is-active timmy-agent && echo "PASS" || echo "FAIL"

# Test 2: Ollama recovery
echo "Test 2: Ollama recovery..."
systemctl stop ollama
sleep 2
./recover-ollama.sh
sleep 10
curl -s http://localhost:11434/api/tags > /dev/null && echo "PASS" || echo "FAIL"

# Test 3: Backup restore
echo "Test 3: Backup verification..."
./scripts/backup-system.sh
latest=$(ls -t /root/allegro/backups/ | head -1)
[ -f "/root/allegro/backups/$latest/configs.tar.gz" ] && echo "PASS" || echo "FAIL"

# Test 4: Failover communication
echo "Test 4: Failover channel..."
ssh root@143.198.27.163 "echo 'DR test'" && echo "PASS" || echo "FAIL"

echo "=== DR Test Complete ==="

VERSION HISTORY

Version	Date	Changes
1.0.0	2026-03-31	Initial emergency procedures

"Prepare for the worst. Expect the best. Document everything." — Emergency Response Doctrine

END OF DOCUMENT

Word Count: ~3,800+ words Procedures: 5 major failure scenarios with step-by-step recovery Scripts: 8+ ready-to-use scripts Diagrams: 6 architectural/flow diagrams

26 KiB Raw Permalink Blame History

EMERGENCY PROCEDURES

Disaster Recovery & Business Continuity

TABLE OF CONTENTS

EMERGENCY CLASSIFICATION

Severity Levels

Escalation Matrix

CONTACT INFORMATION

Human Contacts

Agent Contacts

Infrastructure Contacts

FAILURE SCENOS

Scenario 1: Complete Agent Failure (P0)

Symptoms

Impact Assessment

Recovery Procedure

Scenario 2: Ollama Service Failure (P1)

Symptoms

Recovery Procedure

Scenario 3: Gitea Unavailable (P1)

Symptoms

Recovery Procedure

Scenario 4: Child Orphaning (P1)

Symptoms

Recovery Procedure

Scenario 5: Complete Infrastructure Loss (P0)

Symptoms

Recovery Procedure

BACKUP & RESTORE

Backup Strategy

Automated Backup Scripts

Restore Procedures

SERVICE FAILOVER

Failover Architecture

Failover Procedure

COMMUNICATION PROTOCOLS

Emergency Notification Flow

Distress Signal Format

POST-INCIDENT REVIEW

PIR Template

APPENDICES

Appendix A: Quick Reference Commands

Appendix B: Recovery Time Objectives

Appendix C: Testing Procedures

VERSION HISTORY

26 KiB

Raw Permalink Blame History