[CRITICAL] Wire Deadman Switch ACTION — Snapshot + Rollback + Restart #444

New Issue

perplexity · 2026-04-09T22:17:26Z

perplexity commented

2026-04-09 22:17:26 +00:00

Source

KT Bezalel Architecture Session 2026-04-08 — Immediate Priority #1
KT Final Session — Ansible section references this

Problem

The deadman watch is already firing and detecting dead agents. It has no action wired. It detects death but does nothing about it.

Required Behavior

On successful health check:

Snapshot current config as "last known good"
Store snapshot in a known location (e.g., )

On failed health check:

Reset config to last known good snapshot from source control
Restart the agent
Log the rollback event to request_log

Design Rules

Config lives in source control as the canonical version
Agents CAN mutate config at runtime (live reload is desired)
But if an agent dies, the recovery path is: rollback config → restart from known good
This is the single highest priority item per Alexander

Acceptance Criteria

Health check success → config snapshot saved
Health check failure → config rolled back to last known good
Health check failure → agent process restarted
Rollback event logged (timestamp, agent, old config hash, new config hash)
Snapshot stored in predictable location per agent
Works with the Ansible-deployed cron schedule
Kill all existing overlapping deadman switches — one implementation only
Tested: simulate agent death → verify rollback + restart occurs

Architecture Context

The cascade failure that killed the fleet: agent tries MiMo V2 Pro → provider returns error → instead of skipping, agent corrupts its own config → deadman detects death → no action → agent stays dead. This fix breaks that cycle.

Dependencies

Must be wired BEFORE resurrecting wizards
Feeds into Ansible playbook (P2)
Works with thin config pattern (P4)

## Source KT Bezalel Architecture Session 2026-04-08 — Immediate Priority #1 KT Final Session — Ansible section references this ## Problem The deadman watch is already firing and detecting dead agents. **It has no action wired.** It detects death but does nothing about it. ## Required Behavior ### On successful health check: - Snapshot current config as "last known good" - Store snapshot in a known location (e.g., ) ### On failed health check: - Reset config to last known good snapshot from source control - Restart the agent - Log the rollback event to request_log ## Design Rules - Config lives in source control as the canonical version - Agents CAN mutate config at runtime (live reload is desired) - But if an agent dies, the recovery path is: **rollback config → restart from known good** - This is the **single highest priority item** per Alexander ## Acceptance Criteria - [ ] Health check success → config snapshot saved - [ ] Health check failure → config rolled back to last known good - [ ] Health check failure → agent process restarted - [ ] Rollback event logged (timestamp, agent, old config hash, new config hash) - [ ] Snapshot stored in predictable location per agent - [ ] Works with the Ansible-deployed cron schedule - [ ] Kill all existing overlapping deadman switches — one implementation only - [ ] Tested: simulate agent death → verify rollback + restart occurs ## Architecture Context The cascade failure that killed the fleet: agent tries MiMo V2 Pro → provider returns error → instead of skipping, agent corrupts its own config → deadman detects death → no action → agent stays dead. This fix breaks that cycle. ## Dependencies - Must be wired BEFORE resurrecting wizards - Feeds into Ansible playbook (P2) - Works with thin config pattern (P4)

perplexity added this to the KT-2026-04-08: Infrastructure Stabilization milestone 2026-04-09 22:17:26 +00:00

perplexity referenced this issue from a commit

2026-04-09 22:25:40 +00:00

feat(ansible): Canonical IaC playbook for fleet management

perplexity referenced this issue

2026-04-09 22:26:05 +00:00

[P2] Ansible IaC — Canonical Fleet Playbook #449

bezalel was assigned by Timmy

2026-04-09 23:31:55 +00:00

Sign in to join this conversation.