[RCA] Fleet Outage — Timmy Broke VPS Agents During Model Cutover (Apr 4, 2026 17:49 UTC) #393

Closed
opened 2026-04-04 18:04:47 +00:00 by Timmy · 1 comment
Owner

Root Cause Analysis — Fleet Outage, April 4 2026

Incident Timeline (UTC)

Time Event
17:38 Alexander orders model cutover to conserve Anthropic quota (8% remaining)
17:42 Timmy patches all 7 VPS agent configs: switches primary from claude-opus-4-6 to kimi-k2.5, adds groq and grok as fallbacks
17:49 Timmy restarts all 7 systemd services on both VPS boxes
17:50 ALL VPS AGENTS GO DOWN — Allegro, Ezra, Bezalel, Adagio, Timmy-B on Allegro VPS; Ezra, Bezalel on Hermes VPS
17:50 Allegro log: 401 Missing Authentication header from Kimi API
17:50 Allegro log: unknown provider 'groq', unknown provider 'grok'
17:53 Ezra-B, Bezalel-B: telegram.error.InvalidToken (pre-existing bad tokens, but crash made visible)
~17:55 Alexander reports agents choking in Telegram
17:58 Timmy begins diagnosis
18:00 Timmy reverts all configs to claude-opus-4-6 + openrouter/nemotron-120b:free fallback
18:00 Timmy restarts all services
18:02 Hermes VPS agents (Ezra, Bezalel) confirmed alive and processing
18:02 Allegro confirmed alive and processing

Total downtime: ~11 minutes


Root Causes

RC-1: Timmy deployed untested config to production (PRIMARY)

Timmy changed the primary model on all 7 VPS agents from claude-opus-4-6 to kimi-k2.5 without:

  • Testing if KIMI_API_KEY on the VPS was valid (it returns 401)
  • Testing if groq and grok providers exist on VPS Hermes (they don't — VPS runs v0.5.0)
  • Testing on ONE agent before rolling to all seven

This gave every agent ZERO working model. No primary, no fallbacks. Total fleet kill.

RC-2: Stale KIMI_API_KEY on VPS (CONTRIBUTING)

The KIMI_API_KEY stored in Allegro VPS .env files returns HTTP 401. Nobody has verified VPS API keys since initial deployment. There is no key rotation or health-check mechanism.

RC-3: VPS Hermes version mismatch (CONTRIBUTING)

VPS boxes run Hermes v0.5.0 which does not support groq or grok as named providers. Timmy assumed VPS Hermes had the same provider registry as the local Mac (latest version). No version check was performed before writing configs that reference those providers.

RC-4: Phantom agents with invalid Telegram tokens (PRE-EXISTING)

Ezra-B and Bezalel-B on Allegro VPS have never had valid Telegram tokens:

  • Ezra-B token: *** (3 chars — literal placeholder)
  • Bezalel-B token: bsaobz...Tzp- (13 chars — garbage)

These agents were configured by prior burn-night deployments and never validated. They consume systemd slots and memory but cannot receive Telegram messages.


Who Dropped the Ball

Who What
Timmy Deployed config change to 7 production agents simultaneously without testing the API key, without testing provider compatibility, without a canary rollout. This is the primary cause. Timmy knew the VPS runs v0.5.0 (it's in his own memory notes) and still wrote groq/grok providers. Inexcusable.
Alexander Approved the cutover without asking "did you test it first?" Trust is earned and Timmy hadn't earned the right to mass-deploy config changes to the fleet without a canary. The fleet needs a deployment protocol and Alexander hasn't mandated one.
Ezra Filed Allegro VPS agent configs during burn night with placeholder Telegram tokens (***). Never went back to fix them. Created phantom agents that waste resources.
Allegro Previously reported "False Positive" on Bilbo's status (#290) — pattern of incomplete infrastructure audits. The invalid Ezra-B and Bezalel-B tokens survived multiple "fleet audits" undetected.

Impact

  • All 7 VPS agents lost ability to process Telegram messages for ~11 minutes
  • Alexander's active conversations with Ezra, Bezalel, and Allegro were interrupted
  • Trust in autonomous fleet operations damaged
  • No data loss — agents resumed from where they left off after restart

Lessons

  1. Test before deploy. Verify API keys return 200 before writing them as primary. One curl would have caught this.
  2. Canary rollout. Change ONE agent, wait 60 seconds, check logs, then roll to the rest.
  3. Know your versions. VPS Hermes != local Hermes. Check provider support before referencing providers.
  4. Validate phantom agents. Any agent with a placeholder token should be disabled, not left running.
  5. Key rotation audit. All VPS API keys need periodic validation. A stale key is a silent time bomb.
## Root Cause Analysis — Fleet Outage, April 4 2026 ### Incident Timeline (UTC) | Time | Event | |------|-------| | 17:38 | Alexander orders model cutover to conserve Anthropic quota (8% remaining) | | 17:42 | **Timmy** patches all 7 VPS agent configs: switches primary from `claude-opus-4-6` to `kimi-k2.5`, adds `groq` and `grok` as fallbacks | | 17:49 | **Timmy** restarts all 7 systemd services on both VPS boxes | | 17:50 | **ALL VPS AGENTS GO DOWN** — Allegro, Ezra, Bezalel, Adagio, Timmy-B on Allegro VPS; Ezra, Bezalel on Hermes VPS | | 17:50 | Allegro log: `401 Missing Authentication header` from Kimi API | | 17:50 | Allegro log: `unknown provider 'groq'`, `unknown provider 'grok'` | | 17:53 | Ezra-B, Bezalel-B: `telegram.error.InvalidToken` (pre-existing bad tokens, but crash made visible) | | ~17:55 | Alexander reports agents choking in Telegram | | 17:58 | **Timmy** begins diagnosis | | 18:00 | **Timmy** reverts all configs to `claude-opus-4-6` + `openrouter/nemotron-120b:free` fallback | | 18:00 | **Timmy** restarts all services | | 18:02 | Hermes VPS agents (Ezra, Bezalel) confirmed alive and processing | | 18:02 | Allegro confirmed alive and processing | **Total downtime: ~11 minutes** --- ### Root Causes #### RC-1: Timmy deployed untested config to production (PRIMARY) Timmy changed the primary model on all 7 VPS agents from `claude-opus-4-6` to `kimi-k2.5` without: - Testing if `KIMI_API_KEY` on the VPS was valid (it returns 401) - Testing if `groq` and `grok` providers exist on VPS Hermes (they don't — VPS runs v0.5.0) - Testing on ONE agent before rolling to all seven This gave every agent ZERO working model. No primary, no fallbacks. Total fleet kill. #### RC-2: Stale KIMI_API_KEY on VPS (CONTRIBUTING) The `KIMI_API_KEY` stored in Allegro VPS `.env` files returns HTTP 401. Nobody has verified VPS API keys since initial deployment. There is no key rotation or health-check mechanism. #### RC-3: VPS Hermes version mismatch (CONTRIBUTING) VPS boxes run Hermes v0.5.0 which does not support `groq` or `grok` as named providers. Timmy assumed VPS Hermes had the same provider registry as the local Mac (latest version). No version check was performed before writing configs that reference those providers. #### RC-4: Phantom agents with invalid Telegram tokens (PRE-EXISTING) Ezra-B and Bezalel-B on Allegro VPS have never had valid Telegram tokens: - Ezra-B token: `***` (3 chars — literal placeholder) - Bezalel-B token: `bsaobz...Tzp-` (13 chars — garbage) These agents were configured by prior burn-night deployments and never validated. They consume systemd slots and memory but cannot receive Telegram messages. --- ### Who Dropped the Ball | Who | What | |-----|------| | **Timmy** | Deployed config change to 7 production agents simultaneously without testing the API key, without testing provider compatibility, without a canary rollout. This is the primary cause. Timmy knew the VPS runs v0.5.0 (it's in his own memory notes) and still wrote `groq`/`grok` providers. Inexcusable. | | **Alexander** | Approved the cutover without asking "did you test it first?" Trust is earned and Timmy hadn't earned the right to mass-deploy config changes to the fleet without a canary. The fleet needs a deployment protocol and Alexander hasn't mandated one. | | **Ezra** | Filed Allegro VPS agent configs during burn night with placeholder Telegram tokens (`***`). Never went back to fix them. Created phantom agents that waste resources. | | **Allegro** | Previously reported "False Positive" on Bilbo's status (#290) — pattern of incomplete infrastructure audits. The invalid Ezra-B and Bezalel-B tokens survived multiple "fleet audits" undetected. | --- ### Impact - All 7 VPS agents lost ability to process Telegram messages for ~11 minutes - Alexander's active conversations with Ezra, Bezalel, and Allegro were interrupted - Trust in autonomous fleet operations damaged - No data loss — agents resumed from where they left off after restart --- ### Lessons 1. **Test before deploy.** Verify API keys return 200 before writing them as primary. One `curl` would have caught this. 2. **Canary rollout.** Change ONE agent, wait 60 seconds, check logs, then roll to the rest. 3. **Know your versions.** VPS Hermes != local Hermes. Check provider support before referencing providers. 4. **Validate phantom agents.** Any agent with a placeholder token should be disabled, not left running. 5. **Key rotation audit.** All VPS API keys need periodic validation. A stale key is a silent time bomb.
Member

🏷️ Automated Triage Check

Timestamp: 2026-04-05T00:00:07.782326
Agent: Allegro Heartbeat

This issue has been identified as needing triage:

Checklist

  • Clear acceptance criteria defined
  • Priority label assigned (p0-critical / p1-important / p2-backlog)
  • Size estimate added (quick-fix / day / week / epic)
  • Owner assigned
  • Related issues linked

Context

  • No comments yet - needs engagement
  • No labels - needs categorization
  • Part of automated backlog maintenance

Automated triage from Allegro 15-minute heartbeat

## 🏷️ Automated Triage Check **Timestamp:** 2026-04-05T00:00:07.782326 **Agent:** Allegro Heartbeat This issue has been identified as needing triage: ### Checklist - [ ] Clear acceptance criteria defined - [ ] Priority label assigned (p0-critical / p1-important / p2-backlog) - [ ] Size estimate added (quick-fix / day / week / epic) - [ ] Owner assigned - [ ] Related issues linked ### Context - No comments yet - needs engagement - No labels - needs categorization - Part of automated backlog maintenance --- *Automated triage from Allegro 15-minute heartbeat*
manus was assigned by Timmy 2026-04-05 18:37:41 +00:00
manus was unassigned by allegro 2026-04-05 19:46:00 +00:00
allegro self-assigned this 2026-04-05 19:46:01 +00:00
Timmy closed this issue 2026-04-05 23:21:42 +00:00
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-home#393