[FIX] Fleet Deployment Safety — Canary Rollout, Key Validation, Phantom Agent Cleanup #394

Closed
opened 2026-04-04 18:05:13 +00:00 by Timmy · 1 comment
Owner

Fix: Prevent Repeat of Fleet Outage (RCA #393)

Ref: #393


Task 1: Canary Rollout Protocol

Owner: Timmy
Acceptance: A documented procedure exists in timmy-config/ops/fleet-deploy.md that mandates:

  • Test API key with curl returning HTTP 200 before writing to any config
  • Check VPS Hermes version and supported providers before referencing them
  • Deploy to ONE agent, wait 60s, check journalctl for errors
  • Only roll to remaining agents after canary passes
  • Console proof: show the canary log output with no errors

Task 2: VPS API Key Health Check

Owner: Ezra
Acceptance:

  • Validate every API key in every VPS agent .env file: KIMI_API_KEY, OPENROUTER_API_KEY, ANTHROPIC_TOKEN
  • Replace or remove any key returning non-200
  • Console proof: curl each provider endpoint from each VPS, all return 200

Task 3: Kill Phantom Agents on Allegro VPS

Owner: Allegro
Acceptance:

  • Ezra-B (hermes-ezra.service on Allegro) — either fix the Telegram token or disable the service
  • Bezalel-B (hermes-bezalel.service on Allegro) — same
  • Any agent with TELEGRAM_BOT_TOKEN=*** or token < 40 chars must be stopped
  • Console proof: systemctl is-active shows disabled for phantom agents, or valid tokens installed

Task 4: VPS Hermes Version Audit

Owner: Timmy
Acceptance:

  • Document which Hermes version runs on each VPS in timmy-config/ops/fleet-versions.md
  • Document which providers each version supports
  • Console proof: hermes --version output from each VPS

Task 5: Alexander's Mandate — Deployment Gate

Owner: Alexander (@rockachopa)
Decision needed: Should autonomous agents be allowed to restart VPS services without human approval? Options:

  • A) Require Alexander's explicit approval for any systemctl restart hermes-* on VPS
  • B) Allow canary rollout (1 agent) autonomously, require approval for fleet-wide
  • C) Trust Timmy to follow the canary protocol without a gate

This is a policy decision, not a technical one. The fleet needs a rule.


Blame Ledger (from RCA #393)

  • Timmy: Mass-deployed untested config. Primary fault.
  • Alexander: No deployment gate existed. Allowed autonomous mass-restart without protocol.
  • Ezra: Left placeholder tokens in Allegro VPS agent configs during burn night setup.
  • Allegro: Pattern of incomplete audits — phantom agents survived multiple fleet checks.

Definition of Done

All 5 tasks have console-provable acceptance criteria met. No more phantom agents. No more untested deploys.

## Fix: Prevent Repeat of Fleet Outage (RCA #393) Ref: #393 --- ### Task 1: Canary Rollout Protocol **Owner:** Timmy **Acceptance:** A documented procedure exists in `timmy-config/ops/fleet-deploy.md` that mandates: - [ ] Test API key with `curl` returning HTTP 200 before writing to any config - [ ] Check VPS Hermes version and supported providers before referencing them - [ ] Deploy to ONE agent, wait 60s, check `journalctl` for errors - [ ] Only roll to remaining agents after canary passes - [ ] Console proof: show the canary log output with no errors ### Task 2: VPS API Key Health Check **Owner:** Ezra **Acceptance:** - [ ] Validate every API key in every VPS agent `.env` file: `KIMI_API_KEY`, `OPENROUTER_API_KEY`, `ANTHROPIC_TOKEN` - [ ] Replace or remove any key returning non-200 - [ ] Console proof: `curl` each provider endpoint from each VPS, all return 200 ### Task 3: Kill Phantom Agents on Allegro VPS **Owner:** Allegro **Acceptance:** - [ ] Ezra-B (`hermes-ezra.service` on Allegro) — either fix the Telegram token or disable the service - [ ] Bezalel-B (`hermes-bezalel.service` on Allegro) — same - [ ] Any agent with `TELEGRAM_BOT_TOKEN=***` or token < 40 chars must be stopped - [ ] Console proof: `systemctl is-active` shows disabled for phantom agents, or valid tokens installed ### Task 4: VPS Hermes Version Audit **Owner:** Timmy **Acceptance:** - [ ] Document which Hermes version runs on each VPS in `timmy-config/ops/fleet-versions.md` - [ ] Document which providers each version supports - [ ] Console proof: `hermes --version` output from each VPS ### Task 5: Alexander's Mandate — Deployment Gate **Owner:** Alexander (@rockachopa) **Decision needed:** Should autonomous agents be allowed to restart VPS services without human approval? Options: - A) Require Alexander's explicit approval for any `systemctl restart hermes-*` on VPS - B) Allow canary rollout (1 agent) autonomously, require approval for fleet-wide - C) Trust Timmy to follow the canary protocol without a gate This is a policy decision, not a technical one. The fleet needs a rule. --- ### Blame Ledger (from RCA #393) - **Timmy:** Mass-deployed untested config. Primary fault. - **Alexander:** No deployment gate existed. Allowed autonomous mass-restart without protocol. - **Ezra:** Left placeholder tokens in Allegro VPS agent configs during burn night setup. - **Allegro:** Pattern of incomplete audits — phantom agents survived multiple fleet checks. ### Definition of Done All 5 tasks have console-provable acceptance criteria met. No more phantom agents. No more untested deploys.
Member

🏷️ Automated Triage Check

Timestamp: 2026-04-04T23:15:04.560650
Agent: Allegro Heartbeat

This issue has been identified as needing triage:

Checklist

  • Clear acceptance criteria defined
  • Priority label assigned (p0-critical / p1-important / p2-backlog)
  • Size estimate added (quick-fix / day / week / epic)
  • Owner assigned
  • Related issues linked

Context

  • No comments yet - needs engagement
  • No labels - needs categorization
  • Part of automated backlog maintenance

Automated triage from Allegro 15-minute heartbeat

## 🏷️ Automated Triage Check **Timestamp:** 2026-04-04T23:15:04.560650 **Agent:** Allegro Heartbeat This issue has been identified as needing triage: ### Checklist - [ ] Clear acceptance criteria defined - [ ] Priority label assigned (p0-critical / p1-important / p2-backlog) - [ ] Size estimate added (quick-fix / day / week / epic) - [ ] Owner assigned - [ ] Related issues linked ### Context - No comments yet - needs engagement - No labels - needs categorization - Part of automated backlog maintenance --- *Automated triage from Allegro 15-minute heartbeat*
fenrir was assigned by Timmy 2026-04-05 00:15:02 +00:00
fenrir was unassigned by allegro 2026-04-05 11:58:12 +00:00
ezra was assigned by allegro 2026-04-05 11:58:12 +00:00
ezra was unassigned by allegro 2026-04-05 17:47:15 +00:00
allegro self-assigned this 2026-04-05 17:47:15 +00:00
Timmy closed this issue 2026-04-05 23:21:43 +00:00
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-home#394