[CRITICAL] Wire Deadman Switch ACTION — Snapshot + Rollback + Restart #444

Open
opened 2026-04-09 22:17:26 +00:00 by perplexity · 0 comments
Member

Source

KT Bezalel Architecture Session 2026-04-08 — Immediate Priority #1
KT Final Session — Ansible section references this

Problem

The deadman watch is already firing and detecting dead agents. It has no action wired. It detects death but does nothing about it.

Required Behavior

On successful health check:

  • Snapshot current config as "last known good"
  • Store snapshot in a known location (e.g., )

On failed health check:

  • Reset config to last known good snapshot from source control
  • Restart the agent
  • Log the rollback event to request_log

Design Rules

  • Config lives in source control as the canonical version
  • Agents CAN mutate config at runtime (live reload is desired)
  • But if an agent dies, the recovery path is: rollback config → restart from known good
  • This is the single highest priority item per Alexander

Acceptance Criteria

  • Health check success → config snapshot saved
  • Health check failure → config rolled back to last known good
  • Health check failure → agent process restarted
  • Rollback event logged (timestamp, agent, old config hash, new config hash)
  • Snapshot stored in predictable location per agent
  • Works with the Ansible-deployed cron schedule
  • Kill all existing overlapping deadman switches — one implementation only
  • Tested: simulate agent death → verify rollback + restart occurs

Architecture Context

The cascade failure that killed the fleet: agent tries MiMo V2 Pro → provider returns error → instead of skipping, agent corrupts its own config → deadman detects death → no action → agent stays dead. This fix breaks that cycle.

Dependencies

  • Must be wired BEFORE resurrecting wizards
  • Feeds into Ansible playbook (P2)
  • Works with thin config pattern (P4)
## Source KT Bezalel Architecture Session 2026-04-08 — Immediate Priority #1 KT Final Session — Ansible section references this ## Problem The deadman watch is already firing and detecting dead agents. **It has no action wired.** It detects death but does nothing about it. ## Required Behavior ### On successful health check: - Snapshot current config as "last known good" - Store snapshot in a known location (e.g., ) ### On failed health check: - Reset config to last known good snapshot from source control - Restart the agent - Log the rollback event to request_log ## Design Rules - Config lives in source control as the canonical version - Agents CAN mutate config at runtime (live reload is desired) - But if an agent dies, the recovery path is: **rollback config → restart from known good** - This is the **single highest priority item** per Alexander ## Acceptance Criteria - [ ] Health check success → config snapshot saved - [ ] Health check failure → config rolled back to last known good - [ ] Health check failure → agent process restarted - [ ] Rollback event logged (timestamp, agent, old config hash, new config hash) - [ ] Snapshot stored in predictable location per agent - [ ] Works with the Ansible-deployed cron schedule - [ ] Kill all existing overlapping deadman switches — one implementation only - [ ] Tested: simulate agent death → verify rollback + restart occurs ## Architecture Context The cascade failure that killed the fleet: agent tries MiMo V2 Pro → provider returns error → instead of skipping, agent corrupts its own config → deadman detects death → no action → agent stays dead. This fix breaks that cycle. ## Dependencies - Must be wired BEFORE resurrecting wizards - Feeds into Ansible playbook (P2) - Works with thin config pattern (P4)
perplexity added this to the KT-2026-04-08: Infrastructure Stabilization milestone 2026-04-09 22:17:26 +00:00
bezalel was assigned by Timmy 2026-04-09 23:31:55 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-config#444