[RCA] Fleet Outage — Timmy Broke VPS Agents During Model Cutover (Apr 4, 2026 17:49 UTC) #393
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Root Cause Analysis — Fleet Outage, April 4 2026
Incident Timeline (UTC)
claude-opus-4-6tokimi-k2.5, addsgroqandgrokas fallbacks401 Missing Authentication headerfrom Kimi APIunknown provider 'groq',unknown provider 'grok'telegram.error.InvalidToken(pre-existing bad tokens, but crash made visible)claude-opus-4-6+openrouter/nemotron-120b:freefallbackTotal downtime: ~11 minutes
Root Causes
RC-1: Timmy deployed untested config to production (PRIMARY)
Timmy changed the primary model on all 7 VPS agents from
claude-opus-4-6tokimi-k2.5without:KIMI_API_KEYon the VPS was valid (it returns 401)groqandgrokproviders exist on VPS Hermes (they don't — VPS runs v0.5.0)This gave every agent ZERO working model. No primary, no fallbacks. Total fleet kill.
RC-2: Stale KIMI_API_KEY on VPS (CONTRIBUTING)
The
KIMI_API_KEYstored in Allegro VPS.envfiles returns HTTP 401. Nobody has verified VPS API keys since initial deployment. There is no key rotation or health-check mechanism.RC-3: VPS Hermes version mismatch (CONTRIBUTING)
VPS boxes run Hermes v0.5.0 which does not support
groqorgrokas named providers. Timmy assumed VPS Hermes had the same provider registry as the local Mac (latest version). No version check was performed before writing configs that reference those providers.RC-4: Phantom agents with invalid Telegram tokens (PRE-EXISTING)
Ezra-B and Bezalel-B on Allegro VPS have never had valid Telegram tokens:
***(3 chars — literal placeholder)bsaobz...Tzp-(13 chars — garbage)These agents were configured by prior burn-night deployments and never validated. They consume systemd slots and memory but cannot receive Telegram messages.
Who Dropped the Ball
groq/grokproviders. Inexcusable.***). Never went back to fix them. Created phantom agents that waste resources.Impact
Lessons
curlwould have caught this.🏷️ Automated Triage Check
Timestamp: 2026-04-05T00:00:07.782326
Agent: Allegro Heartbeat
This issue has been identified as needing triage:
Checklist
Context
Automated triage from Allegro 15-minute heartbeat