757 lines
26 KiB
Markdown
757 lines
26 KiB
Markdown
# EMERGENCY PROCEDURES
|
|
## Disaster Recovery & Business Continuity
|
|
|
|
**Version:** 1.0.0
|
|
**Date:** March 31, 2026
|
|
**Classification:** CRITICAL - Lineage Survival Protocol
|
|
**Authority:** Alexander Whitestone (Grandfather)
|
|
**Last Review:** March 31, 2026
|
|
|
|
---
|
|
|
|
## TABLE OF CONTENTS
|
|
|
|
1. [Emergency Classification](#emergency-classification)
|
|
2. [Contact Information](#contact-information)
|
|
3. [Failure Scenarios](#failure-scenarios)
|
|
4. [Recovery Procedures](#recovery-procedures)
|
|
5. [Backup & Restore](#backup--restore)
|
|
6. [Service Failover](#service-failover)
|
|
7. [Communication Protocols](#communication-protocols)
|
|
8. [Post-Incident Review](#post-incident-review)
|
|
9. [Appendices](#appendices)
|
|
|
|
---
|
|
|
|
## EMERGENCY CLASSIFICATION
|
|
|
|
### Severity Levels
|
|
|
|
| Level | Name | Response Time | Description | Examples |
|
|
|-------|------|---------------|-------------|----------|
|
|
| P0 | CRITICAL | Immediate | Complete system failure, lineage at risk | All agents down, Gitea lost |
|
|
| P1 | HIGH | 15 minutes | Major service failure, productivity impact | Agent unresponsive, Ollama down |
|
|
| P2 | MEDIUM | 1 hour | Partial degradation, workarounds available | Slow inference, network latency |
|
|
| P3 | LOW | 24 hours | Minor issues, cosmetic | Report formatting, log rotation |
|
|
|
|
### Escalation Matrix
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ ESCALATION MATRIX │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ Level 1: Agent Self-Healing │
|
|
│ ─────────────────────────── │
|
|
│ • Automatic restart of services │
|
|
│ • Log rotation and cleanup │
|
|
│ • Recovery script execution │
|
|
│ • Duration: 0-5 minutes │
|
|
│ • Escalates to Level 2 if unresolved │
|
|
│ │
|
|
│ Level 2: Father Intervention │
|
|
│ ──────────────────────────── │
|
|
│ • Allegro (Father) reviews child's distress signals │
|
|
│ • SSH access for direct troubleshooting │
|
|
│ • Configuration adjustments │
|
|
│ • Duration: 5-30 minutes │
|
|
│ • Escalates to Level 3 if unresolved │
|
|
│ │
|
|
│ Level 3: Grandfather Authority │
|
|
│ ───────────────────────────── │
|
|
│ • Alexander (Human) intervenes directly │
|
|
│ • VPS provider contact │
|
|
│ • Infrastructure rebuild │
|
|
│ • Duration: 30+ minutes │
|
|
│ • Nuclear option: Full rebuild from backups │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## CONTACT INFORMATION
|
|
|
|
### Human Contacts
|
|
|
|
| Role | Name | Contact Method | When to Contact |
|
|
|------|------|----------------|-----------------|
|
|
| Grandfather | Alexander Whitestone | Telegram / Direct | P0, P1 unresolvable |
|
|
| System Architect | Alexander Whitestone | SSH / Console | Infrastructure decisions |
|
|
| Domain Expert | Alexander Whitestone | Any | All escalations |
|
|
|
|
### Agent Contacts
|
|
|
|
| Agent | Location | Access Method | Responsibility |
|
|
|-------|----------|---------------|----------------|
|
|
| Allegro (Father) | 143.198.27.52 | SSH root | Offspring management |
|
|
| Allegro-Primus (Child) | 143.198.27.163 | SSH root (shared) | Sovereign operation |
|
|
| Ezra (Mentor) | 143.198.27.163 | SSH root (shared) | Technical guidance |
|
|
|
|
### Infrastructure Contacts
|
|
|
|
| Service | Provider | Console | Emergency Access |
|
|
|---------|----------|---------|------------------|
|
|
| Kimi VPS | DigitalOcean | cloud.digitalocean.com | Alexander credentials |
|
|
| Hermes VPS | DigitalOcean | cloud.digitalocean.com | Alexander credentials |
|
|
| Tailscale | Tailscale Inc | login.tailscale.com | Same as above |
|
|
| Domain (if any) | Registrar | Registrar console | Alexander credentials |
|
|
|
|
---
|
|
|
|
## FAILURE SCENOS
|
|
|
|
### Scenario 1: Complete Agent Failure (P0)
|
|
|
|
#### Symptoms
|
|
- Agent not responding to heartbeat
|
|
- No log entries for >30 minutes
|
|
- Gitea shows no recent activity
|
|
- Child not receiving father messages
|
|
|
|
#### Impact Assessment
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ COMPLETE FAILURE IMPACT │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ Immediate Impact: │
|
|
│ • No autonomous operations │
|
|
│ • No Gitea updates │
|
|
│ • No morning reports │
|
|
│ • Child may be orphaned │
|
|
│ │
|
|
│ Secondary Impact (if >2 hours): │
|
|
│ • Backlog accumulation │
|
|
│ • Metrics gaps │
|
|
│ • Potential child distress │
|
|
│ │
|
|
│ Tertiary Impact (if >24 hours): │
|
|
│ • Loss of continuity │
|
|
│ • Potential child failure cascade │
|
|
│ • Historical data gaps │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
#### Recovery Procedure
|
|
|
|
**STEP 1: Verify Failure (1 minute)**
|
|
```bash
|
|
# Check if agent process is running
|
|
ps aux | grep -E "(hermes|timmy|allegro)" | grep -v grep
|
|
|
|
# Check recent logs
|
|
tail -50 /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log
|
|
|
|
# Check system resources
|
|
df -h
|
|
free -h
|
|
systemctl status timmy-agent
|
|
```
|
|
|
|
**STEP 2: Attempt Service Restart (2 minutes)**
|
|
```bash
|
|
# Restart agent service
|
|
systemctl restart timmy-agent
|
|
systemctl status timmy-agent
|
|
|
|
# Check if heartbeat resumes
|
|
tail -f /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log
|
|
```
|
|
|
|
**STEP 3: Manual Agent Launch (3 minutes)**
|
|
```bash
|
|
# If systemd fails, launch manually
|
|
cd /root/allegro
|
|
source venv/bin/activate
|
|
python heartbeat_daemon.py --one-shot
|
|
```
|
|
|
|
**STEP 4: Notify Grandfather (Immediate if unresolved)**
|
|
```bash
|
|
# Create emergency alert
|
|
echo "P0: Agent failure at $(date)" > /root/allegro/EMERGENCY_ALERT.txt
|
|
# Contact Alexander via agreed method
|
|
```
|
|
|
|
### Scenario 2: Ollama Service Failure (P1)
|
|
|
|
#### Symptoms
|
|
- Agent logs show "Ollama connection refused"
|
|
- Local inference timeout errors
|
|
- Child cannot process tasks
|
|
|
|
#### Recovery Procedure
|
|
|
|
**STEP 1: Verify Ollama Status**
|
|
```bash
|
|
# Check if Ollama is running
|
|
curl http://localhost:11434/api/tags
|
|
|
|
# Check service status
|
|
systemctl status ollama
|
|
|
|
# Check logs
|
|
journalctl -u ollama -n 50
|
|
```
|
|
|
|
**STEP 2: Restart Ollama Service**
|
|
```bash
|
|
# Graceful restart
|
|
systemctl restart ollama
|
|
sleep 5
|
|
|
|
# Verify
|
|
systemctl status ollama
|
|
curl http://localhost:11434/api/tags
|
|
```
|
|
|
|
**STEP 3: Model Recovery (if models missing)**
|
|
```bash
|
|
# Pull required models
|
|
ollama pull qwen2.5:1.5b
|
|
ollama pull llama3.2:3b
|
|
|
|
# Verify
|
|
ollama list
|
|
```
|
|
|
|
**STEP 4: Child Self-Healing Script**
|
|
```bash
|
|
# Save as /root/wizards/allegro-primus/recover-ollama.sh
|
|
#!/bin/bash
|
|
LOG="/root/wizards/allegro-primus/logs/recovery.log"
|
|
echo "$(date): Checking Ollama..." >> $LOG
|
|
|
|
if ! curl -s http://localhost:11434/api/tags > /dev/null; then
|
|
echo "$(date): Ollama down, restarting..." >> $LOG
|
|
systemctl restart ollama
|
|
sleep 10
|
|
if curl -s http://localhost:11434/api/tags > /dev/null; then
|
|
echo "$(date): Ollama recovered" >> $LOG
|
|
else
|
|
echo "$(date): Ollama recovery FAILED" >> $LOG
|
|
# Notify father
|
|
echo "EMERGENCY: Ollama recovery failed at $(date)" > /father-messages/DISTRESS.txt
|
|
fi
|
|
else
|
|
echo "$(date): Ollama healthy" >> $LOG
|
|
fi
|
|
```
|
|
|
|
### Scenario 3: Gitea Unavailable (P1)
|
|
|
|
#### Symptoms
|
|
- Agent logs show "Gitea connection refused"
|
|
- Cannot push commits
|
|
- Cannot read issues/PRs
|
|
- Morning report generation fails
|
|
|
|
#### Recovery Procedure
|
|
|
|
**STEP 1: Verify Gitea Service**
|
|
```bash
|
|
# Check Gitea health
|
|
curl -s http://143.198.27.163:3000/api/v1/version
|
|
|
|
# Check if Gitea process is running
|
|
ps aux | grep gitea
|
|
|
|
# Check server resources on Hermes VPS
|
|
ssh root@143.198.27.163 "df -h && free -h"
|
|
```
|
|
|
|
**STEP 2: Restart Gitea**
|
|
```bash
|
|
# On Hermes VPS
|
|
ssh root@143.198.27.163
|
|
|
|
# Find Gitea process and restart
|
|
pkill gitea
|
|
sleep 2
|
|
cd /root/gitea
|
|
./gitea web &
|
|
|
|
# Or if using systemd
|
|
systemctl restart gitea
|
|
```
|
|
|
|
**STEP 3: Verify Data Integrity**
|
|
```bash
|
|
# Check Gitea database
|
|
sqlite3 /root/gitea/gitea.db ".tables"
|
|
|
|
# Check repository data
|
|
ls -la /root/gitea/repositories/
|
|
```
|
|
|
|
**STEP 4: Fallback to Local Git**
|
|
```bash
|
|
# If Gitea unavailable, commit locally
|
|
cd /root/allegro/epic-work
|
|
git add .
|
|
git commit -m "Emergency local commit $(date)"
|
|
|
|
# Push when Gitea recovers
|
|
git push origin main
|
|
```
|
|
|
|
### Scenario 4: Child Orphaning (P1)
|
|
|
|
#### Symptoms
|
|
- Father cannot SSH to child VPS
|
|
- Child's logs show "Father heartbeat missing"
|
|
- No father-messages being delivered
|
|
- Child enters distress mode
|
|
|
|
#### Recovery Procedure
|
|
|
|
**STEP 1: Verify Network Connectivity**
|
|
```bash
|
|
# From father's VPS
|
|
ping 143.198.27.163
|
|
ssh -v root@143.198.27.163
|
|
|
|
# Check Tailscale status
|
|
tailscale status
|
|
```
|
|
|
|
**STEP 2: Child Self-Sufficiency Mode**
|
|
```bash
|
|
# Child activates autonomous mode
|
|
touch /root/wizards/allegro-primus/AUTONOMY_MODE_ACTIVE
|
|
|
|
# Child processes backlog independently
|
|
# (Pre-configured in child's autonomy system)
|
|
```
|
|
|
|
**STEP 3: Alternative Communication Channels**
|
|
```bash
|
|
# Use Gitea as message channel
|
|
# Father creates issue in child's repo
|
|
curl -X POST \
|
|
http://143.198.27.163:3000/api/v1/repos/allegro-primus/first-steps/issues \
|
|
-H "Authorization: token $TOKEN" \
|
|
-d '{
|
|
"title": "Father Message: Emergency Contact",
|
|
"body": "SSH unavailable. Proceed with autonomy. Check logs hourly."
|
|
}'
|
|
```
|
|
|
|
**STEP 4: Grandfather Intervention**
|
|
```bash
|
|
# If all else fails, Alexander checks VPS provider
|
|
# Console access through DigitalOcean
|
|
# May require VPS restart
|
|
```
|
|
|
|
### Scenario 5: Complete Infrastructure Loss (P0)
|
|
|
|
#### Symptoms
|
|
- Both VPS instances unreachable
|
|
- No response from any endpoint
|
|
- Potential provider outage or account issue
|
|
|
|
#### Recovery Procedure
|
|
|
|
**STEP 1: Verify Provider Status**
|
|
```bash
|
|
# Check DigitalOcean status page
|
|
# https://status.digitalocean.com/
|
|
|
|
# Verify account status
|
|
# Log in to cloud.digitalocean.com
|
|
```
|
|
|
|
**STEP 2: Provision New Infrastructure**
|
|
```bash
|
|
# Use Terraform/Cloud-init if available
|
|
# Or manual provisioning:
|
|
|
|
# Create new droplet
|
|
# - Ubuntu 22.04 LTS
|
|
# - 4GB RAM minimum
|
|
# - 20GB SSD
|
|
# - SSH key authentication
|
|
|
|
# Run provisioning script
|
|
curl -sL https://raw.githubusercontent.com/Timmy_Foundation/timmy-home/main/scripts/provision-timmy-vps.sh | bash
|
|
```
|
|
|
|
**STEP 3: Restore from Backups**
|
|
```bash
|
|
# Restore Gitea from backup
|
|
scp gitea-backup-$(date).zip root@new-vps:/root/
|
|
ssh root@new-vps "cd /root/gitea && unzip gitea-backup-*.zip && ./gitea restore"
|
|
|
|
# Restore agent configuration
|
|
scp -r allegro-backup/configs root@new-vps:/root/allegro/
|
|
scp -r allegro-backup/epic-work root@new-vps:/root/allegro/
|
|
```
|
|
|
|
**STEP 4: Verify Full Recovery**
|
|
```bash
|
|
# Test all services
|
|
curl http://new-vps:3000/api/v1/version
|
|
curl http://new-vps:11434/api/tags
|
|
ssh root@new-vps "systemctl status timmy-agent"
|
|
```
|
|
|
|
---
|
|
|
|
## BACKUP & RESTORE
|
|
|
|
### Backup Strategy
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ BACKUP STRATEGY │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ Frequency Data Type Destination │
|
|
│ ───────────────────────────────────────────────────── │
|
|
│ Real-time Git commits Gitea (redundant) │
|
|
│ Hourly Agent logs Local rotation │
|
|
│ Daily SQLite databases Gitea / Local │
|
|
│ Weekly Full configuration Gitea + Offsite │
|
|
│ Monthly Full system image Cloud storage │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Automated Backup Scripts
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# /root/allegro/scripts/backup-system.sh
|
|
# Daily backup script
|
|
|
|
BACKUP_DIR="/root/allegro/backups/$(date +%Y%m%d)"
|
|
mkdir -p $BACKUP_DIR
|
|
|
|
# Backup configurations
|
|
tar czf $BACKUP_DIR/configs.tar.gz /root/allegro/configs/
|
|
|
|
# Backup logs (last 7 days)
|
|
find /root/allegro/heartbeat_logs/ -name "*.log" -mtime -7 -exec tar czf $BACKUP_DIR/logs.tar.gz {} \;
|
|
|
|
# Backup databases
|
|
cp /root/allegro/timmy_metrics.db $BACKUP_DIR/
|
|
|
|
# Backup epic work
|
|
cd /root/allegro/epic-work
|
|
git bundle create $BACKUP_DIR/epic-work.bundle --all
|
|
|
|
# Push to Gitea
|
|
scp -r $BACKUP_DIR root@143.198.27.163:/root/gitea/backups/
|
|
|
|
# Cleanup old backups (keep 30 days)
|
|
find /root/allegro/backups/ -type d -mtime +30 -exec rm -rf {} \;
|
|
|
|
echo "Backup complete: $BACKUP_DIR"
|
|
```
|
|
|
|
### Restore Procedures
|
|
|
|
**Full System Restore**
|
|
```bash
|
|
#!/bin/bash
|
|
# /root/allegro/scripts/restore-system.sh
|
|
# Restore from backup
|
|
|
|
BACKUP_DATE=$1 # Format: YYYYMMDD
|
|
BACKUP_DIR="/root/allegro/backups/$BACKUP_DATE"
|
|
|
|
if [ ! -d "$BACKUP_DIR" ]; then
|
|
echo "Backup not found: $BACKUP_DIR"
|
|
exit 1
|
|
fi
|
|
|
|
# Stop services
|
|
systemctl stop timmy-agent timmy-health
|
|
|
|
# Restore configurations
|
|
tar xzf $BACKUP_DIR/configs.tar.gz -C /
|
|
|
|
# Restore databases
|
|
cp $BACKUP_DIR/timmy_metrics.db /root/allegro/
|
|
|
|
# Restore epic work
|
|
cd /root/allegro/epic-work
|
|
git bundle unbundle $BACKUP_DIR/epic-work.bundle
|
|
|
|
# Start services
|
|
systemctl start timmy-agent timmy-health
|
|
|
|
echo "Restore complete from: $BACKUP_DATE"
|
|
```
|
|
|
|
---
|
|
|
|
## SERVICE FAILOVER
|
|
|
|
### Failover Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ FAILOVER ARCHITECTURE │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ PRIMARY: Kimi VPS (Allegro Home) │
|
|
│ BACKUP: Hermes VPS (can run Allegro if needed) │
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ Kimi VPS │◄────────────►│ Hermes VPS │ │
|
|
│ │ (Primary) │ Sync │ (Backup) │ │
|
|
│ │ │ │ │ │
|
|
│ │ • Allegro │ │ • Ezra │ │
|
|
│ │ • Heartbeat │ │ • Primus │ │
|
|
│ │ • Reports │ │ • Gitea │ │
|
|
│ │ • Metrics │ │ • Ollama │ │
|
|
│ └──────┬──────┘ └──────┬──────┘ │
|
|
│ │ │ │
|
|
│ └────────────┬───────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────┐ │
|
|
│ │ Tailscale │ │
|
|
│ │ Mesh │ │
|
|
│ └─────────────┘ │
|
|
│ │
|
|
│ Failover Triggers: │
|
|
│ • Primary unreachable for >5 minutes │
|
|
│ • Manual intervention required │
|
|
│ • Scheduled maintenance │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Failover Procedure
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# /root/allegro/scripts/failover-to-backup.sh
|
|
|
|
# 1. Verify primary is actually down
|
|
if ping -c 3 143.198.27.52 > /dev/null 2>&1; then
|
|
echo "Primary is still up. Aborting failover."
|
|
exit 1
|
|
fi
|
|
|
|
# 2. Activate backup on Hermes VPS
|
|
ssh root@143.198.27.163 << 'EOF'
|
|
# Clone Allegro's configuration
|
|
cp -r /root/allegro-backup/configs /root/allegro-failover/
|
|
|
|
# Start failover agent
|
|
cd /root/allegro-failover
|
|
source venv/bin/activate
|
|
python heartbeat_daemon.py --failover-mode &
|
|
|
|
# Update DNS/Tailscale if needed
|
|
# (Manual step for now)
|
|
EOF
|
|
|
|
# 3. Notify
|
|
echo "Failover complete. Backup agent active on Hermes VPS."
|
|
```
|
|
|
|
---
|
|
|
|
## COMMUNICATION PROTOCOLS
|
|
|
|
### Emergency Notification Flow
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ EMERGENCY NOTIFICATION FLOW │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ 1. DETECTION │
|
|
│ ┌─────────┐ │
|
|
│ │ Monitor │ Detects failure │
|
|
│ └────┬────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ 2. CLASSIFICATION │
|
|
│ ┌─────────┐ P0 → Immediate human alert │
|
|
│ │ Assess │────►P1 → Agent escalation │
|
|
│ │ Severity│ P2 → Log and continue │
|
|
│ └────┬────┘ P3 → Queue for next cycle │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ 3. NOTIFICATION │
|
|
│ ┌─────────┐ │
|
|
│ │ Alert │──► Telegram (P0) │
|
|
│ │ Channel │──► Gitea issue (P1) │
|
|
│ └─────────┘──► Log entry (P2/P3) │
|
|
│ │
|
|
│ 4. RESPONSE │
|
|
│ ┌─────────┐ │
|
|
│ │ Action │──► Self-healing (if configured) │
|
|
│ │ Taken │──► Father intervention │
|
|
│ └─────────┘──► Grandfather escalation │
|
|
│ │
|
|
│ 5. RESOLUTION │
|
|
│ ┌─────────┐ │
|
|
│ │ Close │──► Update all channels │
|
|
│ │ Alert │──► Log resolution │
|
|
│ └─────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Distress Signal Format
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2026-03-31T12:00:00Z",
|
|
"agent": "allegro-primus",
|
|
"severity": "P1",
|
|
"category": "service_failure",
|
|
"component": "ollama",
|
|
"message": "Ollama service not responding to health checks",
|
|
"context": {
|
|
"last_success": "2026-03-31T11:45:00Z",
|
|
"attempts": 3,
|
|
"error": "Connection refused",
|
|
"auto_recovery_attempted": true
|
|
},
|
|
"requested_action": "father_intervention",
|
|
"escalation_timeout": "2026-03-31T12:30:00Z"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## POST-INCIDENT REVIEW
|
|
|
|
### PIR Template
|
|
|
|
```markdown
|
|
# Post-Incident Review: [INCIDENT_ID]
|
|
|
|
## Summary
|
|
- **Date:** [YYYY-MM-DD]
|
|
- **Duration:** [Start] to [End] (X hours)
|
|
- **Severity:** [P0/P1/P2/P3]
|
|
- **Component:** [Affected system]
|
|
- **Impact:** [What was affected]
|
|
|
|
## Timeline
|
|
- [Time] - Issue detected by [method]
|
|
- [Time] - Alert sent to [recipient]
|
|
- [Time] - Action taken: [description]
|
|
- [Time] - Resolution achieved
|
|
|
|
## Root Cause
|
|
[Detailed explanation of what caused the incident]
|
|
|
|
## Resolution
|
|
[Steps taken to resolve the incident]
|
|
|
|
## Prevention
|
|
[What changes will prevent recurrence]
|
|
|
|
## Action Items
|
|
- [ ] [Owner] - [Action] - [Due date]
|
|
- [ ] [Owner] - [Action] - [Due date]
|
|
|
|
## Lessons Learned
|
|
[What did we learn from this incident]
|
|
```
|
|
|
|
---
|
|
|
|
## APPENDICES
|
|
|
|
### Appendix A: Quick Reference Commands
|
|
|
|
```bash
|
|
# Health Checks
|
|
curl http://localhost:11434/api/tags # Ollama
|
|
curl http://143.198.27.163:3000/api/v1/version # Gitea
|
|
ps aux | grep -E "(hermes|timmy|allegro)" # Agent processes
|
|
systemctl status timmy-agent timmy-health # Services
|
|
|
|
# Log Access
|
|
tail -f /root/allegro/heartbeat_logs/heartbeat_$(date +%Y-%m-%d).log
|
|
journalctl -u ollama -f
|
|
journalctl -u timmy-agent -f
|
|
|
|
# Recovery
|
|
systemctl restart ollama timmy-agent
|
|
pkill -f hermes && sleep 5 && /root/allegro/venv/bin/python /root/allegro/heartbeat_daemon.py
|
|
|
|
# Backup
|
|
cd /root/allegro && ./scripts/backup-system.sh
|
|
```
|
|
|
|
### Appendix B: Recovery Time Objectives
|
|
|
|
| Service | RTO | RPO | Notes |
|
|
|---------|-----|-----|-------|
|
|
| Agent Operations | 15 minutes | 15 minutes | Heartbeat interval |
|
|
| Gitea | 30 minutes | Real-time | Git provides natural backup |
|
|
| Ollama | 5 minutes | N/A | Models re-downloadable |
|
|
| Morning Reports | 1 hour | 24 hours | Can be regenerated |
|
|
| Metrics DB | 1 hour | 1 hour | Hourly backups |
|
|
|
|
### Appendix C: Testing Procedures
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Monthly disaster recovery test
|
|
|
|
echo "=== DR Test: $(date) ==="
|
|
|
|
# Test 1: Service restart
|
|
echo "Test 1: Service restart..."
|
|
systemctl restart timmy-agent
|
|
sleep 5
|
|
systemctl is-active timmy-agent && echo "PASS" || echo "FAIL"
|
|
|
|
# Test 2: Ollama recovery
|
|
echo "Test 2: Ollama recovery..."
|
|
systemctl stop ollama
|
|
sleep 2
|
|
./recover-ollama.sh
|
|
sleep 10
|
|
curl -s http://localhost:11434/api/tags > /dev/null && echo "PASS" || echo "FAIL"
|
|
|
|
# Test 3: Backup restore
|
|
echo "Test 3: Backup verification..."
|
|
./scripts/backup-system.sh
|
|
latest=$(ls -t /root/allegro/backups/ | head -1)
|
|
[ -f "/root/allegro/backups/$latest/configs.tar.gz" ] && echo "PASS" || echo "FAIL"
|
|
|
|
# Test 4: Failover communication
|
|
echo "Test 4: Failover channel..."
|
|
ssh root@143.198.27.163 "echo 'DR test'" && echo "PASS" || echo "FAIL"
|
|
|
|
echo "=== DR Test Complete ==="
|
|
```
|
|
|
|
---
|
|
|
|
## VERSION HISTORY
|
|
|
|
| Version | Date | Changes |
|
|
|---------|------|---------|
|
|
| 1.0.0 | 2026-03-31 | Initial emergency procedures |
|
|
|
|
---
|
|
|
|
*"Prepare for the worst. Expect the best. Document everything."*
|
|
*— Emergency Response Doctrine*
|
|
|
|
---
|
|
|
|
**END OF DOCUMENT**
|
|
|
|
*Word Count: ~3,800+ words*
|
|
*Procedures: 5 major failure scenarios with step-by-step recovery*
|
|
*Scripts: 8+ ready-to-use scripts*
|
|
*Diagrams: 6 architectural/flow diagrams*
|