Some checks failed
Smoke Test / smoke (pull_request) Failing after 28s
3.9 KiB
3.9 KiB
[PHASE-1] Survival - Keep the Lights On
Phase 1 is the manual-clicker stage of the fleet. The machines exist. The services exist. The human is still the automation loop.
Phase Definition
- Current state: Fleet is operational. Three VPS wizards run. Gitea hosts 16 repos. Agents burn through issues nightly.
- The problem: Everything important still depends on human vigilance. When an agent dies at 2 AM, nobody notices until morning.
- Resources tracked: Uptime, Capacity Utilization.
- Next phase: [PHASE-2] Automation - Self-Healing Infrastructure
What We Have
Infrastructure
- VPS hosts: Ezra (143.198.27.163), Allegro, Bezalel (167.99.126.228)
- Local Mac: M4 Max, orchestration hub, 50+ tmux panes
- RunPod GPU: L40S 48GB, intermittent (Cloudflare tunnel expired)
Services
- Gitea: forge.alexanderwhitestone.com -- 16 repos, 500+ open issues, branch protection enabled
- Ollama: 6 models loaded (~37GB), local inference
- Hermes: Agent orchestration, cron system (90+ jobs, 6 workers)
- Evennia: The Tower MUD world, federation capable
Agents
- Timmy: Local harness, primary orchestrator
- Bezalel, Ezra, Allegro: VPS workers dispatched via Gitea issues
- Code Claw, Gemini: Specialized workers
Current Resource Snapshot
| Resource | Value | Target | Status |
|---|---|---|---|
| Fleet operational | Yes | Yes | MET |
| Uptime (30d average) | ~78% | >= 95% | NOT MET |
| Days at 95%+ uptime | 0 | 30 | NOT MET |
| Capacity utilization | ~35% | > 60% | NOT MET |
Phase 2 trigger: NOT READY
What's Still Manual
Every one of these is a "click" that a human must make:
- Restart dead agents -- SSH into VPS, check process, restart hermes
- Health checks -- SSH to each VPS, verify disk/memory/services
- Dead pane recovery -- tmux pane dies, nobody notices, work stops
- Provider failover -- Nous API goes down, agents stop, human reconfigures
- PR triage -- 80% auto-merge, but 20% need human review
- Backlog management -- 500+ issues, burn loops help but need supervision
- Nightly retro -- manually run and push results
- Config drift -- agent runs on wrong model, human discovers later
The Gap to Phase 2
To unlock Phase 2 (Automation), we need:
| Requirement | Current | Gap |
|---|---|---|
| 30 days at 95% uptime | 0 days | Need deadman switch, auto-respawn, provider failover |
| Capacity > 60% | ~35% | Need more agents doing work, less idle time |
What closes the gap
- Deadman switch in cron (fleet-ops#168) -- detect dead agents within 5 minutes
- Auto-respawn (fleet-ops#173) -- restart dead tmux panes automatically
- Provider failover -- switch to fallback model/provider when primary fails
- Heartbeat monitoring -- read heartbeat files and alert on staleness
How to Run the Phase Report
# Render with default (zero) snapshot
python3 scripts/fleet_phase_status.py
# Render with real snapshot
python3 scripts/fleet_phase_status.py --snapshot configs/phase-1-snapshot.json
# Output as JSON
python3 scripts/fleet_phase_status.py --snapshot configs/phase-1-snapshot.json --json
# Write to file
python3 scripts/fleet_phase_status.py --snapshot configs/phase-1-snapshot.json --output docs/FLEET_PHASE_1_SURVIVAL.md
Manual Clicker Interpretation
Paperclips analogy: Phase 1 = Manual clicker. You ARE the automation. Every restart, every SSH, every check is a manual click.
The goal of Phase 1 is not to automate. It's to name what needs automating. Every manual click documented here is a Phase 2 ticket.
Notes
- Fleet is operational but fragile -- most recovery is manual
- Overnight burns work ~70% of the time; 30% need morning rescue
- The deadman switch exists but is not in cron
- Heartbeat files exist but no automated monitoring reads them
- Provider failover is manual -- Nous goes down = agents stop