Files
timmy-home/docs/FLEET_PHASE_1_SURVIVAL.md
2026-04-21 11:28:31 +00:00

3.9 KiB

[PHASE-1] Survival - Keep the Lights On

Phase 1 is the manual-clicker stage of the fleet. The machines exist. The services exist. The human is still the automation loop.

Phase Definition

  • Current state: Fleet is operational. Three VPS wizards run. Gitea hosts 16 repos. Agents burn through issues nightly.
  • The problem: Everything important still depends on human vigilance. When an agent dies at 2 AM, nobody notices until morning.
  • Resources tracked: Uptime, Capacity Utilization.
  • Next phase: [PHASE-2] Automation - Self-Healing Infrastructure

What We Have

Infrastructure

  • VPS hosts: Ezra (143.198.27.163), Allegro, Bezalel (167.99.126.228)
  • Local Mac: M4 Max, orchestration hub, 50+ tmux panes
  • RunPod GPU: L40S 48GB, intermittent (Cloudflare tunnel expired)

Services

  • Gitea: forge.alexanderwhitestone.com -- 16 repos, 500+ open issues, branch protection enabled
  • Ollama: 6 models loaded (~37GB), local inference
  • Hermes: Agent orchestration, cron system (90+ jobs, 6 workers)
  • Evennia: The Tower MUD world, federation capable

Agents

  • Timmy: Local harness, primary orchestrator
  • Bezalel, Ezra, Allegro: VPS workers dispatched via Gitea issues
  • Code Claw, Gemini: Specialized workers

Current Resource Snapshot

Resource Value Target Status
Fleet operational Yes Yes MET
Uptime (30d average) ~78% >= 95% NOT MET
Days at 95%+ uptime 0 30 NOT MET
Capacity utilization ~35% > 60% NOT MET

Phase 2 trigger: NOT READY

What's Still Manual

Every one of these is a "click" that a human must make:

  1. Restart dead agents -- SSH into VPS, check process, restart hermes
  2. Health checks -- SSH to each VPS, verify disk/memory/services
  3. Dead pane recovery -- tmux pane dies, nobody notices, work stops
  4. Provider failover -- Nous API goes down, agents stop, human reconfigures
  5. PR triage -- 80% auto-merge, but 20% need human review
  6. Backlog management -- 500+ issues, burn loops help but need supervision
  7. Nightly retro -- manually run and push results
  8. Config drift -- agent runs on wrong model, human discovers later

The Gap to Phase 2

To unlock Phase 2 (Automation), we need:

Requirement Current Gap
30 days at 95% uptime 0 days Need deadman switch, auto-respawn, provider failover
Capacity > 60% ~35% Need more agents doing work, less idle time

What closes the gap

  1. Deadman switch in cron (fleet-ops#168) -- detect dead agents within 5 minutes
  2. Auto-respawn (fleet-ops#173) -- restart dead tmux panes automatically
  3. Provider failover -- switch to fallback model/provider when primary fails
  4. Heartbeat monitoring -- read heartbeat files and alert on staleness

How to Run the Phase Report

# Render with default (zero) snapshot
python3 scripts/fleet_phase_status.py

# Render with real snapshot
python3 scripts/fleet_phase_status.py --snapshot configs/phase-1-snapshot.json

# Output as JSON
python3 scripts/fleet_phase_status.py --snapshot configs/phase-1-snapshot.json --json

# Write to file
python3 scripts/fleet_phase_status.py --snapshot configs/phase-1-snapshot.json --output docs/FLEET_PHASE_1_SURVIVAL.md

Manual Clicker Interpretation

Paperclips analogy: Phase 1 = Manual clicker. You ARE the automation. Every restart, every SSH, every check is a manual click.

The goal of Phase 1 is not to automate. It's to name what needs automating. Every manual click documented here is a Phase 2 ticket.

Notes

  • Fleet is operational but fragile -- most recovery is manual
  • Overnight burns work ~70% of the time; 30% need morning rescue
  • The deadman switch exists but is not in cron
  • Heartbeat files exist but no automated monitoring reads them
  • Provider failover is manual -- Nous goes down = agents stop