Files
timmy-config/docs/deadman-testing.md
Alexander Payne 0ae1f4823b
Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 29s
Smoke Test / smoke (pull_request) Failing after 26s
Validate Config / YAML Lint (pull_request) Failing after 18s
Validate Config / JSON Validate (pull_request) Successful in 22s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m7s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Shell Script Lint (pull_request) Failing after 1m13s
Validate Config / Cron Syntax Check (pull_request) Successful in 15s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 14s
Validate Config / Playbook Schema Validation (pull_request) Successful in 29s
PR Checklist / pr-checklist (pull_request) Failing after 5m7s
Architecture Lint / Lint Repository (pull_request) Failing after 24s
docs: add deadman switch testing verification procedure
Add a comprehensive testing document covering all acceptance criteria
for issue #444 ( Wire Deadman Switch ACTION ). Includes AI-, test plan,
troubleshooting, and verification steps for operators.

Refs: #444
2026-04-30 09:05:35 -04:00

4.7 KiB

Deadman Switch — Test & Verification Procedure

Issue: #444 — Wire Deadman Switch ACTION (Snapshot + Rollback + Restart) Last updated: 2026-04-30 (STEP35 burn contribution)

This document describes how to verify that the deadman switch is operational end-to-end on the wizards fleet. All tests assume Ansible deployment has been run (ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/deadman_switch.yml).

Architecture (post #444)

  • deadman_action.sh — deployed to {{ wizard_home }}/deadman_action.sh by Ansible.
  • Scheduling: Ansible cron_manager role installs a cron entry */5 * * * * that runs deadman_action.sh and logs to {{ timmy_log_dir }}/deadman-<wizard>.log.
  • No systemd timer, no launchd plist, no separate deadman-switch.sh watch — single implementation via universal cron (cron_manager).

Acceptance Criteria Test Plan

1. Health-check success → config snapshot saved

Test:

  1. Ensure wizard agent is healthy.
  2. Manually run: {{ wizard_home }}/deadman_action.sh
  3. Verify: {{ deadman_snapshot_dir }}/config.yaml.known_good exists and matches current config.

Expected log output:

[timestamp] [deadman] [<WIZARD>] Health check starting...
[timestamp] [deadman] [<WIZARD>] HEALTHY — snapshotting config.
[timestamp] [deadman] [<WIZARD>] Config snapshot saved.
[timestamp] [deadman] [<WIZARD>] Health check complete.

2. Health-check failure → config rolled back + agent restarted + rollback event logged

Test:

  1. Corrupt agent config to trigger failure:
    echo 'provider: anthropic' >> {{ wizard_home }}/config.yaml
    
  2. Run {{ wizard_home }}/deadman_action.sh.
  3. Verify:
    • grep -q anthropic {{ wizard_home }}/config.yamlfalse (removed)
    • systemctl status hermes-{{ wizard_name | lower }} shows active (or recent restart)
    • Log contains Rollback event: agent=... old_hash=... new_hash=...
  4. Optionally check telemetry:
    sqlite3 {{ request_log_path }} "SELECT status,error_message FROM request_log WHERE endpoint='health_check' ORDER BY timestamp DESC LIMIT 1;"
    

Expected log snippet:

[timestamp] [deadman] [<WIZARD>] FAIL: Config contains banned provider...
[timestamp] [deadman] [<WIZARD>] UNHEALTHY — initiating recovery.
[timestamp] [deadman] [<WIZARD>] Rolling back config to last known good...
[timestamp] [deadman] [<WIZARD>] Config rolled back.
[timestamp] [deadman] [<WIZARD>] Rollback event: agent=<WIZARD> old_hash=<sha> new_hash=<sha>
[timestamp] [deadman] [<WIZARD>] Restarting hermes-<wizard>...
[timestamp] [deadman] [<WIZARD>] Agent restarted via systemd.

3. Simulate full cascade-failure death → verify rollback+restart

  • Stop agent: systemctl stop hermes-<wizard> or kill process.
  • Modify config to inject banned provider.
  • Run deadman_action.sh manually (or wait for cron).
  • Verify that config is rolled back and agent is restarted.

4. Snapshot stored in predictable per-agent location

Check on each wizard:

ls -la {{ deadman_snapshot_dir }}/
# Expected: config.yaml.known_good  +  config.yaml.<timestamp>  (rolling, max {{ deadman_max_snapshots }})

5. Works with Ansible-deployed cron schedule

Run:

ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/deadman_switch.yml

Verify:

  • {{ wizard_home }}/deadman_action.sh exists, mode 0755
  • {{ deadman_snapshot_dir }} exists
  • No systemd timer named deadman-<wizard>.timer exists (per one-impl rule)
  • Cron entry present: crontab -l | grep deadman

6. One implementation only — all overlaps killed

  • Deleted bin/deadman-switch.sh — central watch removed.
  • Deleted bin/deadman-fallback.py — provider fallback not part of deadman recovery.
  • No systemd timer deployed (deadman_switch role no longer deploys service/timer).
  • Scheduling handled exclusively by cron_manager's universal cron (5 min interval).

Troubleshooting

Symptom Likely cause Check
No snapshot created Config unreadable or permissions wrong ls -l {{ wizard_home }}/config.yaml; run as root
Rollback doesn't restore No snapshot file exists ls -l {{ deadman_snapshot_dir }}/config.yaml.known_good
Agent not restarting systemd not available or service name mismatch `systemctl status hermes-{{ wizard_name
SSH watch never fires Cron not running or Gitea token missing service cron status; check ~/.hermes/gitea_token_vps

Notes

This follows the KT Bezalel Architecture Session (2026-04-08) design. The deadman switch now closes the loop from detection to recovery automatically on a 5-minute cadence. See issue #444 for full acceptance criteria and design rules.