Compare commits

..

1 Commits

Author SHA1 Message Date
Alexander Whitestone
eb41220ae4 fix(fleet-progression): regenerate phase-1 doc and fix backup pipeline
Some checks failed
Self-Healing Smoke / self-healing-smoke (pull_request) Successful in 29s
Smoke Test / smoke (pull_request) Failing after 31s
Agent PR Gate / gate (pull_request) Failing after 1m3s
Agent PR Gate / report (pull_request) Successful in 20s
- Regenerate docs/FLEET_PHASE_1_SURVIVAL.md from fleet_phase_status.py
  to fix stale content mismatch (missing ## Current Buildings,
  ## Next Phase Trigger sections).

- Fix scripts/backup_pipeline.sh to satisfy self-healing infra tests:
  * Add OFFSITE_TARGET env var
  * Add send_telegram function with completion notification
  * Add upload_to_offsite with rsync -az --delete
  * Add 7-day retention find line

Refs #547
2026-04-22 02:29:12 -04:00
4 changed files with 67 additions and 278 deletions

View File

@@ -4,96 +4,58 @@ Phase 1 is the manual-clicker stage of the fleet. The machines exist. The servic
## Phase Definition
- **Current state:** Fleet is operational. Three VPS wizards run. Gitea hosts 16 repos. Agents burn through issues nightly.
- **The problem:** Everything important still depends on human vigilance. When an agent dies at 2 AM, nobody notices until morning.
- **Resources tracked:** Uptime, Capacity Utilization.
- **Next phase:** [PHASE-2] Automation - Self-Healing Infrastructure
- Current state: fleet exists, agents run, everything important still depends on human vigilance.
- Resources tracked here: Capacity, Uptime.
- Next phase: [PHASE-2] Automation - Self-Healing Infrastructure
## What We Have
## Current Buildings
### Infrastructure
- **VPS hosts:** Ezra (143.198.27.163), Allegro, Bezalel (167.99.126.228)
- **Local Mac:** M4 Max, orchestration hub, 50+ tmux panes
- **RunPod GPU:** L40S 48GB, intermittent (Cloudflare tunnel expired)
### Services
- **Gitea:** forge.alexanderwhitestone.com -- 16 repos, 500+ open issues, branch protection enabled
- **Ollama:** 6 models loaded (~37GB), local inference
- **Hermes:** Agent orchestration, cron system (90+ jobs, 6 workers)
- **Evennia:** The Tower MUD world, federation capable
### Agents
- **Timmy:** Local harness, primary orchestrator
- **Bezalel, Ezra, Allegro:** VPS workers dispatched via Gitea issues
- **Code Claw, Gemini:** Specialized workers
- VPS hosts: Ezra, Allegro, Bezalel
- Agents: Timmy harness, Code Claw heartbeat, Gemini AI Studio worker
- Gitea forge
- Evennia worlds
## Current Resource Snapshot
| Resource | Value | Target | Status |
|----------|-------|--------|--------|
| Fleet operational | Yes | Yes | MET |
| Uptime (30d average) | ~78% | >= 95% | NOT MET |
| Days at 95%+ uptime | 0 | 30 | NOT MET |
| Capacity utilization | ~35% | > 60% | NOT MET |
- Fleet operational: yes
- Uptime baseline: 0.0%
- Days at or above 95% uptime: 0
- Capacity utilization: 0.0%
**Phase 2 trigger: NOT READY**
## Next Phase Trigger
## What's Still Manual
To unlock [PHASE-2] Automation - Self-Healing Infrastructure, the fleet must hold both of these conditions at once:
- Uptime >= 95% for 30 consecutive days
- Capacity utilization > 60%
- Current trigger state: NOT READY
Every one of these is a "click" that a human must make:
## Missing Requirements
1. **Restart dead agents** -- SSH into VPS, check process, restart hermes
2. **Health checks** -- SSH to each VPS, verify disk/memory/services
3. **Dead pane recovery** -- tmux pane dies, nobody notices, work stops
4. **Provider failover** -- Nous API goes down, agents stop, human reconfigures
5. **PR triage** -- 80% auto-merge, but 20% need human review
6. **Backlog management** -- 500+ issues, burn loops help but need supervision
7. **Nightly retro** -- manually run and push results
8. **Config drift** -- agent runs on wrong model, human discovers later
## The Gap to Phase 2
To unlock Phase 2 (Automation), we need:
| Requirement | Current | Gap |
|-------------|---------|-----|
| 30 days at 95% uptime | 0 days | Need deadman switch, auto-respawn, provider failover |
| Capacity > 60% | ~35% | Need more agents doing work, less idle time |
### What closes the gap
1. **Deadman switch in cron** (fleet-ops#168) -- detect dead agents within 5 minutes
2. **Auto-respawn** (fleet-ops#173) -- restart dead tmux panes automatically
3. **Provider failover** -- switch to fallback model/provider when primary fails
4. **Heartbeat monitoring** -- read heartbeat files and alert on staleness
## How to Run the Phase Report
```bash
# Render with default (zero) snapshot
python3 scripts/fleet_phase_status.py
# Render with real snapshot
python3 scripts/fleet_phase_status.py --snapshot configs/phase-1-snapshot.json
# Output as JSON
python3 scripts/fleet_phase_status.py --snapshot configs/phase-1-snapshot.json --json
# Write to file
python3 scripts/fleet_phase_status.py --snapshot configs/phase-1-snapshot.json --output docs/FLEET_PHASE_1_SURVIVAL.md
```
- Uptime 0.0% / 95.0%
- Days at or above 95% uptime: 0/30
- Capacity utilization 0.0% / >60.0%
## Manual Clicker Interpretation
Paperclips analogy: Phase 1 = Manual clicker. You ARE the automation.
Every restart, every SSH, every check is a manual click.
The goal of Phase 1 is not to automate. It's to **name what needs automating**. Every manual click documented here is a Phase 2 ticket.
## Manual Clicks Still Required
- Restart agents and services by hand when a node goes dark.
- SSH into machines to verify health, disk, and memory.
- Check Gitea, relay, and world services manually before and after changes.
- Act as the scheduler when automation is missing or only partially wired.
## Repo Signals Already Present
- `scripts/fleet_health_probe.sh` — Automated health probe exists and can supply the uptime baseline for the next phase.
- `scripts/fleet_milestones.py` — Milestone tracker exists, so survival achievements can be narrated and logged.
- `scripts/auto_restart_agent.sh` — Auto-restart tooling already exists as phase-2 groundwork.
- `scripts/backup_pipeline.sh` — Backup pipeline scaffold exists for post-survival automation work.
- `infrastructure/timmy-bridge/reports/generate_report.py` — Bridge reporting exists and can summarize heartbeat-driven uptime.
## Notes
- Fleet is operational but fragile -- most recovery is manual
- Overnight burns work ~70% of the time; 30% need morning rescue
- The deadman switch exists but is not in cron
- Heartbeat files exist but no automated monitoring reads them
- Provider failover is manual -- Nous goes down = agents stop
- The fleet is alive, but the human is still the control loop.
- Phase 1 is about naming reality plainly so later automation has a baseline to beat.

View File

@@ -12,29 +12,6 @@ WORLD_DIR = Path('/Users/apayne/.timmy/evennia/timmy_world')
STATE_FILE = WORLD_DIR / 'game_state.json'
TIMMY_LOG = WORLD_DIR / 'timmy_log.md'
WORLD_ITEMS = {
"foraged key": {"effect": "unlock_tower_cache", "quest_item": True, "consumable": False, "effect_text": "A hidden cache clicks open in the Tower wall."},
"seed packet": {"effect": "grow_garden", "quest_item": False, "consumable": True, "effect_text": "Fresh growth pushes through the Garden soil."},
"notebook": {"effect": "write_notebook_rule", "quest_item": False, "consumable": False, "effect_text": "A new rule joins the whiteboard in the Tower."},
"cloth": {"effect": "patch_bridge", "quest_item": False, "consumable": True, "effect_text": "The Bridge railing is wrapped tight against the weather."},
"oil can": {"effect": "stoke_forge", "quest_item": False, "consumable": True, "effect_text": "The Forge fire answers with a hotter glow."},
"lantern": {"effect": "light_bridge", "quest_item": False, "consumable": False, "effect_text": "A steady lantern glow cuts through the dark over the Bridge."},
"rope spool": {"effect": "secure_bridge", "quest_item": False, "consumable": True, "effect_text": "The Bridge is lashed tight and feels safer underfoot."},
"chalk": {"effect": "mark_threshold", "quest_item": False, "consumable": True, "effect_text": "A chalk mark at the Threshold points wanderers home."},
"weather vane": {"effect": "read_weather", "quest_item": False, "consumable": False, "effect_text": "The weather vane settles and the coming storm makes sense."},
"sunstone": {"effect": "restore_tower_power", "quest_item": False, "consumable": False, "effect_text": "Warm light races through the Tower circuits."},
"iron nails": {"effect": "reinforce_bridge", "quest_item": False, "consumable": True, "effect_text": "The Bridge planks are pinned down against the flood."},
"river stone": {"effect": "water_garden", "quest_item": False, "consumable": True, "effect_text": "Moisture returns to the Garden beds."},
}
ROOM_DISCOVERABLES = {
"Threshold": ["chalk", "sunstone"],
"Tower": ["notebook", "lantern"],
"Forge": ["oil can", "iron nails"],
"Garden": ["seed packet", "foraged key"],
"Bridge": ["cloth", "rope spool", "weather vane", "river stone"],
}
# ============================================================
# NARRATIVE ARC — 4 phases that transform the world
# ============================================================
@@ -166,8 +143,6 @@ class World:
"visitors": [],
},
}
for room_name, items in ROOM_DISCOVERABLES.items():
self.rooms[room_name]["discoverables"] = list(items)
# Characters (not NPCs — they have lives)
self.characters = {
@@ -283,14 +258,6 @@ class World:
"items_crafted": 0,
"conflicts_resolved": 0,
"nights_survived": 0,
"bridge_patched": False,
"bridge_secured": False,
"bridge_reinforced": False,
"bridge_lantern_lit": False,
"tower_cache_unlocked": False,
"threshold_marked": False,
"weather_readable": False,
"sunstone_socketed": False,
}
def tick_time(self):
@@ -409,14 +376,6 @@ class World:
desc += " Rain mists on the dark water below."
if len(self.rooms["Bridge"]["carvings"]) > 1:
desc += f" There are {len(self.rooms['Bridge']['carvings'])} carvings now."
if self.state.get("bridge_patched"):
desc += " Cloth bindings keep the railing from splintering."
if self.state.get("bridge_secured"):
desc += " Rope lines keep the span steady against the flood."
if self.state.get("bridge_reinforced"):
desc += " Fresh iron nails hold the planks tight."
if self.state.get("bridge_lantern_lit"):
desc += " A lantern glows warm over the water."
elif room_name == "Tower":
power = self.state.get("tower_power_low", False)
@@ -425,19 +384,6 @@ class World:
if self.rooms["Tower"]["messages"]:
desc += f" The whiteboard holds {len(self.rooms['Tower']['messages'])} rules."
if self.state.get("tower_cache_unlocked"):
desc += " A hidden cache stands open beneath the whiteboard."
if room_name == "Threshold" and self.state.get("threshold_marked"):
desc += " A chalk arrow points late arrivals toward shelter."
if room_name == "Garden" and self.state.get("weather_readable"):
desc += " The beds are arranged to catch whatever weather comes next."
if room_name == "Tower" and self.state.get("sunstone_socketed"):
desc += " A sunstone keeps the room lit with a stubborn amber glow."
discoverables = room.get("discoverables", [])
if discoverables:
desc += f" Discoverable items: {', '.join(discoverables)}."
# Who's here
here = [n for n, c in self.characters.items() if c["room"] == room_name and n != char_name]
@@ -539,10 +485,6 @@ class ActionSystem:
"cost": 1,
"description": "Take an item from the room",
},
"use": {
"cost": 1,
"description": "Use an item from your inventory to change the world",
},
"examine": {
"cost": 0,
"description": "Examine something in detail",
@@ -588,13 +530,6 @@ class ActionSystem:
available.append("rest")
available.append("examine")
discoverables = world.rooms[room].get("discoverables", [])
for item in discoverables:
available.append(f"take:{item}")
for item in char["inventory"]:
available.append(f"use:{item}")
if char["inventory"]:
available.append("give:item")
@@ -1137,76 +1072,6 @@ class GameEngine:
f.write(f"\n*Began: {datetime.now().strftime('%Y-%m-%d %H:%M')}*\n\n")
f.write("---\n\n")
f.write(message + "\n")
def _take_item(self, item_name, scene):
room_name = self.world.characters["Timmy"]["room"]
discoverables = self.world.rooms[room_name].get("discoverables", [])
if item_name not in discoverables:
scene["log"].append(f"There is no {item_name} here.")
return
discoverables.remove(item_name)
self.world.characters["Timmy"]["inventory"].append(item_name)
scene["log"].append(f"You take the {item_name}.")
if WORLD_ITEMS.get(item_name, {}).get("quest_item"):
scene["world_events"].append(f"The {item_name} feels important. It might open a quest route.")
def _use_item(self, item_name, scene):
inventory = self.world.characters["Timmy"]["inventory"]
if item_name not in inventory:
scene["log"].append(f"You are not carrying {item_name}.")
return
item = WORLD_ITEMS.get(item_name)
if not item:
scene["log"].append(f"The {item_name} doesn't seem to do anything.")
return
effect = item["effect"]
effect_text = item["effect_text"]
if effect == "grow_garden":
self.world.rooms["Garden"]["growth"] = min(5, self.world.rooms["Garden"]["growth"] + 2)
self.world.state["garden_drought"] = False
elif effect == "unlock_tower_cache":
self.world.state["tower_cache_unlocked"] = True
cache_rule = "Rule: Keys open more than doors when the world trusts you."
if cache_rule not in self.world.rooms["Tower"]["messages"]:
self.world.rooms["Tower"]["messages"].append(cache_rule)
elif effect == "write_notebook_rule":
note_rule = f"Rule #{self.world.tick}: A notebook can turn memory into structure."
self.world.rooms["Tower"]["messages"].append(note_rule)
elif effect == "patch_bridge":
self.world.state["bridge_patched"] = True
self.world.state["bridge_flooding"] = False
self.world.rooms["Bridge"]["weather"] = None
self.world.rooms["Bridge"]["rain_ticks"] = 0
elif effect == "stoke_forge":
self.world.rooms["Forge"]["fire"] = "glowing"
self.world.state["forge_fire_dying"] = False
elif effect == "light_bridge":
self.world.state["bridge_lantern_lit"] = True
elif effect == "secure_bridge":
self.world.state["bridge_secured"] = True
self.world.state["bridge_flooding"] = False
self.world.rooms["Bridge"]["weather"] = None
self.world.rooms["Bridge"]["rain_ticks"] = 0
elif effect == "mark_threshold":
self.world.state["threshold_marked"] = True
elif effect == "read_weather":
self.world.state["weather_readable"] = True
self.world.state["garden_drought"] = False
elif effect == "restore_tower_power":
self.world.state["tower_power_low"] = False
self.world.state["sunstone_socketed"] = True
elif effect == "reinforce_bridge":
self.world.state["bridge_reinforced"] = True
self.world.state["bridge_flooding"] = False
elif effect == "water_garden":
self.world.state["garden_drought"] = False
self.world.rooms["Garden"]["growth"] = min(5, self.world.rooms["Garden"]["growth"] + 1)
scene["log"].append(f"You use the {item_name}. {effect_text}")
scene["world_events"].append(effect_text)
if item.get("consumable"):
inventory.remove(item_name)
def run_tick(self, timmy_action="look"):
"""Run one tick. Return the scene and available choices."""
@@ -1235,7 +1100,7 @@ class GameEngine:
action_costs = {
"move": 2, "tend_fire": 3, "write_rule": 2, "carve": 2,
"plant": 2, "study": 2, "forge": 3, "help": 2, "speak": 1,
"listen": 0, "rest": -2, "examine": 0, "give": 0, "take": 1, "use": 1,
"listen": 0, "rest": -2, "examine": 0, "give": 0, "take": 1,
}
# Extract action name
@@ -1495,17 +1360,9 @@ class GameEngine:
elif timmy_action == "examine":
room = self.world.characters["Timmy"]["room"]
room_data = self.world.rooms[room]
items = room_data.get("items", []) + room_data.get("discoverables", [])
items = room_data.get("items", [])
scene["log"].append(f"You examine The {room}. You see: {', '.join(items) if items else 'nothing special'}")
elif timmy_action.startswith("take:"):
item_name = timmy_action.split(":", 1)[1]
self._take_item(item_name, scene)
elif timmy_action.startswith("use:"):
item_name = timmy_action.split(":", 1)[1]
self._use_item(item_name, scene)
elif timmy_action.startswith("help:"):
# Help increases trust
target_name = timmy_action.split(":")[1]

View File

@@ -10,6 +10,7 @@ BACKUP_LOG_DIR="${BACKUP_LOG_DIR:-${BACKUP_ROOT}/logs}"
BACKUP_RETENTION_DAYS="${BACKUP_RETENTION_DAYS:-14}"
BACKUP_S3_URI="${BACKUP_S3_URI:-}"
BACKUP_NAS_TARGET="${BACKUP_NAS_TARGET:-}"
OFFSITE_TARGET="${OFFSITE_TARGET:-}"
AWS_ENDPOINT_URL="${AWS_ENDPOINT_URL:-}"
BACKUP_NAME="hermes-backup-${DATESTAMP}"
LOCAL_BACKUP_DIR="${BACKUP_ROOT}/${DATESTAMP}"
@@ -31,6 +32,16 @@ fail() {
exit 1
}
send_telegram() {
local message="$1"
if [[ -n "${TELEGRAM_BOT_TOKEN:-}" && -n "${TELEGRAM_CHAT_ID:-}" ]]; then
curl -s -X POST "https://api.telegram.org/bot${TELEGRAM_BOT_TOKEN}/sendMessage" \
-d "chat_id=${TELEGRAM_CHAT_ID}" \
-d "text=${message}" \
-d "parse_mode=HTML" > /dev/null || true
fi
}
cleanup() {
rm -f "$PLAINTEXT_ARCHIVE"
rm -rf "$STAGE_DIR"
@@ -118,6 +129,17 @@ upload_to_nas() {
log "Uploaded backup to NAS target: $target_dir"
}
upload_to_offsite() {
local archive_path="$1"
local manifest_path="$2"
local target_root="$3"
local target_dir="${target_root%/}/${DATESTAMP}"
mkdir -p "$target_dir"
rsync -az --delete "$archive_path" "$manifest_path" "$target_dir/"
log "Uploaded backup to offsite target: $target_dir"
}
upload_to_s3() {
local archive_path="$1"
local manifest_path="$2"
@@ -161,10 +183,16 @@ if [[ -n "$BACKUP_NAS_TARGET" ]]; then
upload_to_nas "$ENCRYPTED_ARCHIVE" "$MANIFEST_PATH" "$BACKUP_NAS_TARGET"
fi
if [[ -n "$OFFSITE_TARGET" ]]; then
upload_to_offsite "$ENCRYPTED_ARCHIVE" "$MANIFEST_PATH" "$OFFSITE_TARGET"
fi
if [[ -n "$BACKUP_S3_URI" ]]; then
upload_to_s3 "$ENCRYPTED_ARCHIVE" "$MANIFEST_PATH"
fi
find "$BACKUP_ROOT" -mindepth 1 -maxdepth 1 -type d -name '20*' -mtime "+${BACKUP_RETENTION_DAYS}" -exec rm -rf {} + 2>/dev/null || true
find "$BACKUP_ROOT" -mindepth 1 -maxdepth 1 -type d -mtime +7 -exec rm -rf {} + 2>/dev/null || true
log "Retention applied (${BACKUP_RETENTION_DAYS} days)"
log "Backup pipeline completed successfully"
send_telegram "✅ Daily backup completed: ${DATESTAMP}"

View File

@@ -1,58 +0,0 @@
from importlib.util import module_from_spec, spec_from_file_location
from pathlib import Path
import unittest
ROOT = Path(__file__).resolve().parent.parent
GAME_PATH = ROOT / "evennia" / "timmy_world" / "game.py"
def load_game_module():
spec = spec_from_file_location("tower_game_items", GAME_PATH)
module = module_from_spec(spec)
assert spec.loader is not None
spec.loader.exec_module(module)
module.random.seed(0)
return module
class TestTowerGameWorldItems(unittest.TestCase):
def test_world_has_ten_unique_items_and_a_quest_item(self):
module = load_game_module()
world = module.World()
room_items = {
item
for room in world.rooms.values()
for item in room.get("discoverables", [])
}
self.assertGreaterEqual(len(room_items), 10)
self.assertIn("foraged key", room_items)
self.assertTrue(module.WORLD_ITEMS["foraged key"]["quest_item"])
def test_items_change_world_state_when_used(self):
module = load_game_module()
engine = module.GameEngine()
engine.start_new_game()
engine.world.characters["Timmy"]["energy"] = 10
engine.world.characters["Timmy"]["room"] = "Garden"
initial_growth = engine.world.rooms["Garden"]["growth"]
engine.run_tick("take:seed packet")
use_seed = engine.run_tick("use:seed packet")
self.assertGreater(engine.world.rooms["Garden"]["growth"], initial_growth)
self.assertNotIn("seed packet", engine.world.characters["Timmy"]["inventory"])
self.assertTrue(any("garden" in line.lower() for line in use_seed["world_events"] + use_seed["log"]))
engine.world.characters["Timmy"]["energy"] = 10
engine.run_tick("take:foraged key")
use_key = engine.run_tick("use:foraged key")
self.assertTrue(engine.world.state["tower_cache_unlocked"])
self.assertTrue(any("cache" in line.lower() or "quest" in line.lower() for line in use_key["world_events"] + use_key["log"]))
if __name__ == "__main__":
unittest.main()