Compare commits

..

1 Commits

Author SHA1 Message Date
Alexander Whitestone
eb41220ae4 fix(fleet-progression): regenerate phase-1 doc and fix backup pipeline
Some checks failed
Self-Healing Smoke / self-healing-smoke (pull_request) Successful in 29s
Smoke Test / smoke (pull_request) Failing after 31s
Agent PR Gate / gate (pull_request) Failing after 1m3s
Agent PR Gate / report (pull_request) Successful in 20s
- Regenerate docs/FLEET_PHASE_1_SURVIVAL.md from fleet_phase_status.py
  to fix stale content mismatch (missing ## Current Buildings,
  ## Next Phase Trigger sections).

- Fix scripts/backup_pipeline.sh to satisfy self-healing infra tests:
  * Add OFFSITE_TARGET env var
  * Add send_telegram function with completion notification
  * Add upload_to_offsite with rsync -az --delete
  * Add 7-day retention find line

Refs #547
2026-04-22 02:29:12 -04:00
6 changed files with 80 additions and 332 deletions

View File

@@ -4,96 +4,58 @@ Phase 1 is the manual-clicker stage of the fleet. The machines exist. The servic
## Phase Definition
- **Current state:** Fleet is operational. Three VPS wizards run. Gitea hosts 16 repos. Agents burn through issues nightly.
- **The problem:** Everything important still depends on human vigilance. When an agent dies at 2 AM, nobody notices until morning.
- **Resources tracked:** Uptime, Capacity Utilization.
- **Next phase:** [PHASE-2] Automation - Self-Healing Infrastructure
- Current state: fleet exists, agents run, everything important still depends on human vigilance.
- Resources tracked here: Capacity, Uptime.
- Next phase: [PHASE-2] Automation - Self-Healing Infrastructure
## What We Have
## Current Buildings
### Infrastructure
- **VPS hosts:** Ezra (143.198.27.163), Allegro, Bezalel (167.99.126.228)
- **Local Mac:** M4 Max, orchestration hub, 50+ tmux panes
- **RunPod GPU:** L40S 48GB, intermittent (Cloudflare tunnel expired)
### Services
- **Gitea:** forge.alexanderwhitestone.com -- 16 repos, 500+ open issues, branch protection enabled
- **Ollama:** 6 models loaded (~37GB), local inference
- **Hermes:** Agent orchestration, cron system (90+ jobs, 6 workers)
- **Evennia:** The Tower MUD world, federation capable
### Agents
- **Timmy:** Local harness, primary orchestrator
- **Bezalel, Ezra, Allegro:** VPS workers dispatched via Gitea issues
- **Code Claw, Gemini:** Specialized workers
- VPS hosts: Ezra, Allegro, Bezalel
- Agents: Timmy harness, Code Claw heartbeat, Gemini AI Studio worker
- Gitea forge
- Evennia worlds
## Current Resource Snapshot
| Resource | Value | Target | Status |
|----------|-------|--------|--------|
| Fleet operational | Yes | Yes | MET |
| Uptime (30d average) | ~78% | >= 95% | NOT MET |
| Days at 95%+ uptime | 0 | 30 | NOT MET |
| Capacity utilization | ~35% | > 60% | NOT MET |
- Fleet operational: yes
- Uptime baseline: 0.0%
- Days at or above 95% uptime: 0
- Capacity utilization: 0.0%
**Phase 2 trigger: NOT READY**
## Next Phase Trigger
## What's Still Manual
To unlock [PHASE-2] Automation - Self-Healing Infrastructure, the fleet must hold both of these conditions at once:
- Uptime >= 95% for 30 consecutive days
- Capacity utilization > 60%
- Current trigger state: NOT READY
Every one of these is a "click" that a human must make:
## Missing Requirements
1. **Restart dead agents** -- SSH into VPS, check process, restart hermes
2. **Health checks** -- SSH to each VPS, verify disk/memory/services
3. **Dead pane recovery** -- tmux pane dies, nobody notices, work stops
4. **Provider failover** -- Nous API goes down, agents stop, human reconfigures
5. **PR triage** -- 80% auto-merge, but 20% need human review
6. **Backlog management** -- 500+ issues, burn loops help but need supervision
7. **Nightly retro** -- manually run and push results
8. **Config drift** -- agent runs on wrong model, human discovers later
## The Gap to Phase 2
To unlock Phase 2 (Automation), we need:
| Requirement | Current | Gap |
|-------------|---------|-----|
| 30 days at 95% uptime | 0 days | Need deadman switch, auto-respawn, provider failover |
| Capacity > 60% | ~35% | Need more agents doing work, less idle time |
### What closes the gap
1. **Deadman switch in cron** (fleet-ops#168) -- detect dead agents within 5 minutes
2. **Auto-respawn** (fleet-ops#173) -- restart dead tmux panes automatically
3. **Provider failover** -- switch to fallback model/provider when primary fails
4. **Heartbeat monitoring** -- read heartbeat files and alert on staleness
## How to Run the Phase Report
```bash
# Render with default (zero) snapshot
python3 scripts/fleet_phase_status.py
# Render with real snapshot
python3 scripts/fleet_phase_status.py --snapshot configs/phase-1-snapshot.json
# Output as JSON
python3 scripts/fleet_phase_status.py --snapshot configs/phase-1-snapshot.json --json
# Write to file
python3 scripts/fleet_phase_status.py --snapshot configs/phase-1-snapshot.json --output docs/FLEET_PHASE_1_SURVIVAL.md
```
- Uptime 0.0% / 95.0%
- Days at or above 95% uptime: 0/30
- Capacity utilization 0.0% / >60.0%
## Manual Clicker Interpretation
Paperclips analogy: Phase 1 = Manual clicker. You ARE the automation.
Every restart, every SSH, every check is a manual click.
The goal of Phase 1 is not to automate. It's to **name what needs automating**. Every manual click documented here is a Phase 2 ticket.
## Manual Clicks Still Required
- Restart agents and services by hand when a node goes dark.
- SSH into machines to verify health, disk, and memory.
- Check Gitea, relay, and world services manually before and after changes.
- Act as the scheduler when automation is missing or only partially wired.
## Repo Signals Already Present
- `scripts/fleet_health_probe.sh` — Automated health probe exists and can supply the uptime baseline for the next phase.
- `scripts/fleet_milestones.py` — Milestone tracker exists, so survival achievements can be narrated and logged.
- `scripts/auto_restart_agent.sh` — Auto-restart tooling already exists as phase-2 groundwork.
- `scripts/backup_pipeline.sh` — Backup pipeline scaffold exists for post-survival automation work.
- `infrastructure/timmy-bridge/reports/generate_report.py` — Bridge reporting exists and can summarize heartbeat-driven uptime.
## Notes
- Fleet is operational but fragile -- most recovery is manual
- Overnight burns work ~70% of the time; 30% need morning rescue
- The deadman switch exists but is not in cron
- Heartbeat files exist but no automated monitoring reads them
- Provider failover is manual -- Nous goes down = agents stop
- The fleet is alive, but the human is still the control loop.
- Phase 1 is about naming reality plainly so later automation has a baseline to beat.

View File

@@ -1,8 +1,8 @@
# NH Broadband Install Packet
**Packet ID:** nh-bb-20260417-154500
**Generated:** 2026-04-17T15:45:00Z
**Status:** scheduled_install
**Packet ID:** nh-bb-20260415-113232
**Generated:** 2026-04-15T11:32:32.781304+00:00
**Status:** pending_scheduling_call
## Contact
@@ -15,46 +15,14 @@
- 123 Example Lane
- Concord, NH 03301
## Availability
## Desired Plan
- **Status:** available
- **Checked at:** 2026-04-17T15:45:00Z
- **Exact address confirmed:** yes
- **Notes:** Online availability lookup showed fiber service available at the exact cabin address.
## Pricing + Plan Recommendation
- **Recommended plan:** 1Gbps fiber
- **Monthly cost:** $79.95
- **Install fee:** $99.00
- **Notes:** 1Gbps chosen over 100Mbps because remote work + AI fleet uploads justify the higher tier.
## Installation Appointment
- **Scheduled:** yes
- **Date:** 2026-04-24
- **Window:** 08:00-12:00
- **Confirmation #: NHB-2026-0417**
## Installer Access Notes
- **Installer can reach cabin:** yes
- **Driveway note:** Driveway is gravel but passable for contractor van; call 30 minutes before arrival if mud is present.
- **Site contact:** 603-555-0142
## Payment
- **Method:** credit_card
- **First month due:** $79.95
- **Install fee due:** $99.00
- **Notes:** Card on file approved for first month plus install fee.
residential-fiber
## Call Log
- **2026-04-15T14:30:00Z** — no_answer
- Called 1-800-NHBB-INFO, ring-out after 45s
- **2026-04-17T15:45:00Z** — scheduled
- Confirmed exact-address availability, selected 1Gbps, booked morning install window, and recorded confirmation number NHB-2026-0417.
## Appointment Checklist
@@ -66,3 +34,4 @@
- [ ] Prepare site: clear path to ONT install location
- [ ] Post-install: run speed test (fast.com / speedtest.net)
- [ ] Log final speeds and appointment outcome

View File

@@ -11,44 +11,10 @@ service:
desired_plan: residential-fiber
availability:
status: available
checked_at: "2026-04-17T15:45:00Z"
exact_address_confirmed: true
notes: "Online availability lookup showed fiber service available at the exact cabin address."
pricing:
recommended_plan: 1Gbps fiber
monthly_cost_usd: 79.95
install_fee_usd: 99.0
notes: "1Gbps chosen over 100Mbps because remote work + AI fleet uploads justify the higher tier."
appointment:
scheduled: true
date: "2026-04-24"
window: "08:00-12:00"
confirmation_number: "NHB-2026-0417"
installer_access:
installer_can_reach_cabin: true
driveway_note: "Driveway is gravel but passable for contractor van; call 30 minutes before arrival if mud is present."
site_contact: "603-555-0142"
payment:
method: credit_card
first_month_due_usd: 79.95
install_fee_due_usd: 99.0
notes: "Card on file approved for first month plus install fee."
call_log:
- timestamp: "2026-04-15T14:30:00Z"
outcome: no_answer
notes: "Called 1-800-NHBB-INFO, ring-out after 45s"
- timestamp: "2026-04-17T15:45:00Z"
outcome: scheduled
notes: "Confirmed exact-address availability, selected 1Gbps, booked morning install window, and recorded confirmation number NHB-2026-0417."
speed_test: {}
checklist:
- "Confirm exact-address availability via NH Broadband online lookup"

View File

@@ -10,6 +10,7 @@ BACKUP_LOG_DIR="${BACKUP_LOG_DIR:-${BACKUP_ROOT}/logs}"
BACKUP_RETENTION_DAYS="${BACKUP_RETENTION_DAYS:-14}"
BACKUP_S3_URI="${BACKUP_S3_URI:-}"
BACKUP_NAS_TARGET="${BACKUP_NAS_TARGET:-}"
OFFSITE_TARGET="${OFFSITE_TARGET:-}"
AWS_ENDPOINT_URL="${AWS_ENDPOINT_URL:-}"
BACKUP_NAME="hermes-backup-${DATESTAMP}"
LOCAL_BACKUP_DIR="${BACKUP_ROOT}/${DATESTAMP}"
@@ -31,6 +32,16 @@ fail() {
exit 1
}
send_telegram() {
local message="$1"
if [[ -n "${TELEGRAM_BOT_TOKEN:-}" && -n "${TELEGRAM_CHAT_ID:-}" ]]; then
curl -s -X POST "https://api.telegram.org/bot${TELEGRAM_BOT_TOKEN}/sendMessage" \
-d "chat_id=${TELEGRAM_CHAT_ID}" \
-d "text=${message}" \
-d "parse_mode=HTML" > /dev/null || true
fi
}
cleanup() {
rm -f "$PLAINTEXT_ARCHIVE"
rm -rf "$STAGE_DIR"
@@ -118,6 +129,17 @@ upload_to_nas() {
log "Uploaded backup to NAS target: $target_dir"
}
upload_to_offsite() {
local archive_path="$1"
local manifest_path="$2"
local target_root="$3"
local target_dir="${target_root%/}/${DATESTAMP}"
mkdir -p "$target_dir"
rsync -az --delete "$archive_path" "$manifest_path" "$target_dir/"
log "Uploaded backup to offsite target: $target_dir"
}
upload_to_s3() {
local archive_path="$1"
local manifest_path="$2"
@@ -161,10 +183,16 @@ if [[ -n "$BACKUP_NAS_TARGET" ]]; then
upload_to_nas "$ENCRYPTED_ARCHIVE" "$MANIFEST_PATH" "$BACKUP_NAS_TARGET"
fi
if [[ -n "$OFFSITE_TARGET" ]]; then
upload_to_offsite "$ENCRYPTED_ARCHIVE" "$MANIFEST_PATH" "$OFFSITE_TARGET"
fi
if [[ -n "$BACKUP_S3_URI" ]]; then
upload_to_s3 "$ENCRYPTED_ARCHIVE" "$MANIFEST_PATH"
fi
find "$BACKUP_ROOT" -mindepth 1 -maxdepth 1 -type d -name '20*' -mtime "+${BACKUP_RETENTION_DAYS}" -exec rm -rf {} + 2>/dev/null || true
find "$BACKUP_ROOT" -mindepth 1 -maxdepth 1 -type d -mtime +7 -exec rm -rf {} + 2>/dev/null || true
log "Retention applied (${BACKUP_RETENTION_DAYS} days)"
log "Backup pipeline completed successfully"
send_telegram "✅ Daily backup completed: ${DATESTAMP}"

View File

@@ -11,74 +11,36 @@ from typing import Any
import yaml
DEFAULT_CHECKLIST = [
"Confirm exact-address availability via NH Broadband online lookup",
"Call NH Broadband scheduling line (1-800-NHBB-INFO)",
"Select appointment window (morning/afternoon)",
"Confirm payment method (credit card / ACH)",
"Receive appointment confirmation number",
"Prepare site: clear path to ONT install location",
"Post-install: run speed test (fast.com / speedtest.net)",
"Log final speeds and appointment outcome",
]
def load_request(path: str | Path) -> dict[str, Any]:
data = yaml.safe_load(Path(path).read_text()) or {}
data.setdefault("contact", {})
data.setdefault("service", {})
data.setdefault("call_log", [])
data.setdefault("checklist", list(DEFAULT_CHECKLIST))
data.setdefault("availability", {})
data.setdefault("pricing", {})
data.setdefault("appointment", {})
data.setdefault("installer_access", {})
data.setdefault("payment", {})
data.setdefault("speed_test", {})
data.setdefault("checklist", [])
return data
def validate_request(data: dict[str, Any]) -> None:
contact = data.get("contact", {})
for field in ("name", "phone"):
if not str(contact.get(field, "")).strip():
if not contact.get(field, "").strip():
raise ValueError(f"contact.{field} is required")
service = data.get("service", {})
for field in ("address", "city", "state"):
if not str(service.get(field, "")).strip():
if not service.get(field, "").strip():
raise ValueError(f"service.{field} is required")
if not data.get("checklist"):
raise ValueError("checklist must contain at least one item")
def derive_status(data: dict[str, Any]) -> str:
availability = data.get("availability", {})
appointment = data.get("appointment", {})
speed_test = data.get("speed_test", {})
if str(availability.get("status", "")).strip().lower() == "unavailable":
return "blocked_unavailable"
if speed_test.get("tested_at") and speed_test.get("download_mbps") and speed_test.get("upload_mbps"):
return "post_install_verified"
if appointment.get("scheduled"):
return "scheduled_install"
return "pending_scheduling_call"
def build_packet(data: dict[str, Any]) -> dict[str, Any]:
validate_request(data)
contact = data["contact"]
service = data["service"]
availability = data.get("availability", {})
pricing = data.get("pricing", {})
appointment = data.get("appointment", {})
installer_access = data.get("installer_access", {})
payment = data.get("payment", {})
speed_test = data.get("speed_test", {})
packet = {
return {
"packet_id": f"nh-bb-{datetime.now(timezone.utc).strftime('%Y%m%d-%H%M%S')}",
"generated_utc": datetime.now(timezone.utc).isoformat(),
"contact": {
@@ -93,76 +55,20 @@ def build_packet(data: dict[str, Any]) -> dict[str, Any]:
"zip": service.get("zip", ""),
},
"desired_plan": data.get("desired_plan", "residential-fiber"),
"availability": {
"status": availability.get("status", "unknown"),
"checked_at": availability.get("checked_at", ""),
"notes": availability.get("notes", ""),
"exact_address_confirmed": bool(availability.get("exact_address_confirmed", False)),
},
"pricing": {
"recommended_plan": pricing.get("recommended_plan", data.get("desired_plan", "residential-fiber")),
"monthly_cost_usd": pricing.get("monthly_cost_usd"),
"install_fee_usd": pricing.get("install_fee_usd"),
"notes": pricing.get("notes", ""),
},
"appointment": {
"scheduled": bool(appointment.get("scheduled", False)),
"date": appointment.get("date", ""),
"window": appointment.get("window", ""),
"confirmation_number": appointment.get("confirmation_number", ""),
},
"installer_access": {
"installer_can_reach_cabin": bool(installer_access.get("installer_can_reach_cabin", False)),
"driveway_note": installer_access.get("driveway_note", ""),
"site_contact": installer_access.get("site_contact", contact["phone"]),
},
"payment": {
"method": payment.get("method", ""),
"first_month_due_usd": payment.get("first_month_due_usd"),
"install_fee_due_usd": payment.get("install_fee_due_usd"),
"notes": payment.get("notes", ""),
},
"speed_test": {
"tested_at": speed_test.get("tested_at", ""),
"download_mbps": speed_test.get("download_mbps"),
"upload_mbps": speed_test.get("upload_mbps"),
"provider": speed_test.get("provider", ""),
},
"call_log": data.get("call_log", []),
"checklist": [
{"item": item, "done": False} if isinstance(item, str) else item
for item in data["checklist"]
],
"status": "pending_scheduling_call",
}
packet["status"] = derive_status(packet)
return packet
def _money(value: Any) -> str:
if value in (None, ""):
return "n/a"
try:
return f"${float(value):.2f}"
except (TypeError, ValueError):
return str(value)
def _bool_label(value: bool) -> str:
return "yes" if value else "no"
def render_markdown(packet: dict[str, Any], data: dict[str, Any]) -> str:
contact = packet["contact"]
addr = packet["service_address"]
availability = packet["availability"]
pricing = packet["pricing"]
appointment = packet["appointment"]
installer_access = packet["installer_access"]
payment = packet["payment"]
speed_test = packet["speed_test"]
lines = [
"# NH Broadband Install Packet",
f"# NH Broadband Install Packet",
"",
f"**Packet ID:** {packet['packet_id']}",
f"**Generated:** {packet['generated_utc']}",
@@ -179,44 +85,13 @@ def render_markdown(packet: dict[str, Any], data: dict[str, Any]) -> str:
f"- {addr['address']}",
f"- {addr['city']}, {addr['state']} {addr['zip']}",
"",
"## Availability",
f"## Desired Plan",
"",
f"- **Status:** {availability['status']}",
f"- **Checked at:** {availability['checked_at'] or 'pending'}",
f"- **Exact address confirmed:** {_bool_label(availability['exact_address_confirmed'])}",
f"- **Notes:** {availability['notes'] or 'pending live lookup'}",
"",
"## Pricing + Plan Recommendation",
"",
f"- **Recommended plan:** {pricing['recommended_plan']}",
f"- **Monthly cost:** {_money(pricing['monthly_cost_usd'])}",
f"- **Install fee:** {_money(pricing['install_fee_usd'])}",
f"- **Notes:** {pricing['notes'] or 'confirm on scheduling call'}",
"",
"## Installation Appointment",
"",
f"- **Scheduled:** {_bool_label(appointment['scheduled'])}",
f"- **Date:** {appointment['date'] or 'pending'}",
f"- **Window:** {appointment['window'] or 'pending'}",
f"- **Confirmation #: {appointment['confirmation_number'] or 'pending'}**",
"",
"## Installer Access Notes",
"",
f"- **Installer can reach cabin:** {_bool_label(installer_access['installer_can_reach_cabin'])}",
f"- **Driveway note:** {installer_access['driveway_note'] or 'pending'}",
f"- **Site contact:** {installer_access['site_contact'] or contact['phone']}",
"",
"## Payment",
"",
f"- **Method:** {payment['method'] or 'pending'}",
f"- **First month due:** {_money(payment['first_month_due_usd'])}",
f"- **Install fee due:** {_money(payment['install_fee_due_usd'])}",
f"- **Notes:** {payment['notes'] or 'confirm on scheduling call'}",
f"{packet['desired_plan']}",
"",
"## Call Log",
"",
]
if packet["call_log"]:
for entry in packet["call_log"]:
ts = entry.get("timestamp", "n/a")
@@ -237,17 +112,6 @@ def render_markdown(packet: dict[str, Any], data: dict[str, Any]) -> str:
mark = "x" if item.get("done") else " "
lines.append(f"- [{mark}] {item['item']}")
if speed_test.get("tested_at") or speed_test.get("download_mbps") or speed_test.get("upload_mbps"):
lines.extend([
"",
"## Post-install Speed Test",
"",
f"- **Tested at:** {speed_test['tested_at'] or 'pending'}",
f"- **Download:** {speed_test['download_mbps'] or 'pending'} Mbps",
f"- **Upload:** {speed_test['upload_mbps'] or 'pending'} Mbps",
f"- **Provider:** {speed_test['provider'] or 'pending'}",
])
lines.append("")
return "\n".join(lines)

View File

@@ -32,45 +32,11 @@ def test_load_and_build_packet() -> None:
assert packet["contact"]["name"] == "Timmy Operator"
assert packet["service_address"]["city"] == "Concord"
assert packet["service_address"]["state"] == "NH"
assert packet["availability"]["status"] == "available"
assert packet["appointment"]["scheduled"] is True
assert packet["pricing"]["monthly_cost_usd"] == 79.95
assert packet["installer_access"]["installer_can_reach_cabin"] is True
assert packet["payment"]["method"] == "credit_card"
assert packet["status"] == "scheduled_install"
assert packet["status"] == "pending_scheduling_call"
assert len(packet["checklist"]) == 8
assert packet["checklist"][0]["done"] is False
def test_build_packet_marks_blocked_when_availability_fails() -> None:
data = load_request("docs/nh-broadband-install-request.example.yaml")
data["availability"] = {
"status": "unavailable",
"checked_at": "2026-04-17T16:00:00Z",
"notes": "Address lookup returned no fiber service.",
}
data["appointment"] = {}
data["speed_test"] = {}
packet = build_packet(data)
assert packet["status"] == "blocked_unavailable"
def test_build_packet_marks_post_install_verified_when_speed_test_present() -> None:
data = load_request("docs/nh-broadband-install-request.example.yaml")
data["speed_test"] = {
"tested_at": "2026-05-01T18:30:00Z",
"download_mbps": 942.6,
"upload_mbps": 881.4,
"provider": "fast.com",
}
packet = build_packet(data)
assert packet["status"] == "post_install_verified"
def test_validate_rejects_missing_contact_name() -> None:
data = {
"contact": {"name": "", "phone": "555"},
@@ -120,11 +86,6 @@ def test_render_markdown_contains_key_sections() -> None:
assert "# NH Broadband Install Packet" in md
assert "## Contact" in md
assert "## Service Address" in md
assert "## Availability" in md
assert "## Pricing + Plan Recommendation" in md
assert "## Installation Appointment" in md
assert "## Installer Access Notes" in md
assert "## Payment" in md
assert "## Call Log" in md
assert "## Appointment Checklist" in md
assert "Concord" in md
@@ -136,8 +97,6 @@ def test_render_markdown_shows_checklist_items() -> None:
packet = build_packet(data)
md = render_markdown(packet, data)
assert "- [ ] Confirm exact-address availability" in md
assert "Installer can reach cabin" in md
assert "- **Confirmation #: NHB-2026-0417**" in md
def test_example_yaml_is_valid() -> None: