[PHASE-1] Survival - Keep the Lights On

Phase 1 is the manual-clicker stage of the fleet. The machines exist. The services exist. The human is still the automation loop.

Phase Definition

Current state: Fleet is operational. Three VPS wizards run. Gitea hosts 16 repos. Agents burn through issues nightly.
The problem: Everything important still depends on human vigilance. When an agent dies at 2 AM, nobody notices until morning.
Resources tracked: Uptime, Capacity Utilization.
Next phase: [PHASE-2] Automation - Self-Healing Infrastructure

What We Have

Infrastructure

VPS hosts: Ezra (143.198.27.163), Allegro, Bezalel (167.99.126.228)
Local Mac: M4 Max, orchestration hub, 50+ tmux panes
RunPod GPU: L40S 48GB, intermittent (Cloudflare tunnel expired)

Services

Gitea: forge.alexanderwhitestone.com -- 16 repos, 500+ open issues, branch protection enabled
Ollama: 6 models loaded (~37GB), local inference
Hermes: Agent orchestration, cron system (90+ jobs, 6 workers)
Evennia: The Tower MUD world, federation capable

Agents

Timmy: Local harness, primary orchestrator
Bezalel, Ezra, Allegro: VPS workers dispatched via Gitea issues
Code Claw, Gemini: Specialized workers

Current Resource Snapshot

Resource	Value	Target	Status
Fleet operational	Yes	Yes	MET
Uptime (30d average)	~78%	>= 95%	NOT MET
Days at 95%+ uptime	0	30	NOT MET
Capacity utilization	~35%	> 60%	NOT MET

Phase 2 trigger: NOT READY

What's Still Manual

Every one of these is a "click" that a human must make:

Restart dead agents -- SSH into VPS, check process, restart hermes
Health checks -- SSH to each VPS, verify disk/memory/services
Dead pane recovery -- tmux pane dies, nobody notices, work stops
Provider failover -- Nous API goes down, agents stop, human reconfigures
PR triage -- 80% auto-merge, but 20% need human review
Backlog management -- 500+ issues, burn loops help but need supervision
Nightly retro -- manually run and push results
Config drift -- agent runs on wrong model, human discovers later

The Gap to Phase 2

To unlock Phase 2 (Automation), we need:

Requirement	Current	Gap
30 days at 95% uptime	0 days	Need deadman switch, auto-respawn, provider failover
Capacity > 60%	~35%	Need more agents doing work, less idle time

What closes the gap

Deadman switch in cron (fleet-ops#168) -- detect dead agents within 5 minutes
Auto-respawn (fleet-ops#173) -- restart dead tmux panes automatically
Provider failover -- switch to fallback model/provider when primary fails
Heartbeat monitoring -- read heartbeat files and alert on staleness

How to Run the Phase Report

# Render with default (zero) snapshot
python3 scripts/fleet_phase_status.py

# Render with real snapshot
python3 scripts/fleet_phase_status.py --snapshot configs/phase-1-snapshot.json

# Output as JSON
python3 scripts/fleet_phase_status.py --snapshot configs/phase-1-snapshot.json --json

# Write to file
python3 scripts/fleet_phase_status.py --snapshot configs/phase-1-snapshot.json --output docs/FLEET_PHASE_1_SURVIVAL.md

Manual Clicker Interpretation

Paperclips analogy: Phase 1 = Manual clicker. You ARE the automation. Every restart, every SSH, every check is a manual click.

The goal of Phase 1 is not to automate. It's to name what needs automating. Every manual click documented here is a Phase 2 ticket.

Notes

Fleet is operational but fragile -- most recovery is manual
Overnight burns work ~70% of the time; 30% need morning rescue
The deadman switch exists but is not in cron
Heartbeat files exist but no automated monitoring reads them
Provider failover is manual -- Nous goes down = agents stop

3.9 KiB Raw Blame History