Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 29s
Smoke Test / smoke (pull_request) Failing after 26s
Validate Config / YAML Lint (pull_request) Failing after 18s
Validate Config / JSON Validate (pull_request) Successful in 22s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m7s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Shell Script Lint (pull_request) Failing after 1m13s
Validate Config / Cron Syntax Check (pull_request) Successful in 15s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 14s
Validate Config / Playbook Schema Validation (pull_request) Successful in 29s
PR Checklist / pr-checklist (pull_request) Failing after 5m7s
Architecture Lint / Lint Repository (pull_request) Failing after 24s
Add a comprehensive testing document covering all acceptance criteria for issue #444 ( Wire Deadman Switch ACTION ). Includes AI-, test plan, troubleshooting, and verification steps for operators. Refs: #444
4.7 KiB
4.7 KiB
Deadman Switch — Test & Verification Procedure
Issue: #444 — Wire Deadman Switch ACTION (Snapshot + Rollback + Restart) Last updated: 2026-04-30 (STEP35 burn contribution)
This document describes how to verify that the deadman switch is operational
end-to-end on the wizards fleet. All tests assume Ansible deployment has been
run (ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/deadman_switch.yml).
Architecture (post #444)
deadman_action.sh— deployed to{{ wizard_home }}/deadman_action.shby Ansible.- Scheduling: Ansible
cron_managerrole installs a cron entry*/5 * * * *that runsdeadman_action.shand logs to{{ timmy_log_dir }}/deadman-<wizard>.log. - No systemd timer, no launchd plist, no separate
deadman-switch.shwatch — single implementation via universal cron (cron_manager).
Acceptance Criteria Test Plan
1. Health-check success → config snapshot saved
Test:
- Ensure wizard agent is healthy.
- Manually run:
{{ wizard_home }}/deadman_action.sh - Verify:
{{ deadman_snapshot_dir }}/config.yaml.known_goodexists and matches current config.
Expected log output:
[timestamp] [deadman] [<WIZARD>] Health check starting...
[timestamp] [deadman] [<WIZARD>] HEALTHY — snapshotting config.
[timestamp] [deadman] [<WIZARD>] Config snapshot saved.
[timestamp] [deadman] [<WIZARD>] Health check complete.
2. Health-check failure → config rolled back + agent restarted + rollback event logged
Test:
- Corrupt agent config to trigger failure:
echo 'provider: anthropic' >> {{ wizard_home }}/config.yaml - Run
{{ wizard_home }}/deadman_action.sh. - Verify:
grep -q anthropic {{ wizard_home }}/config.yaml→ false (removed)systemctl status hermes-{{ wizard_name | lower }}shows active (or recent restart)- Log contains
Rollback event: agent=... old_hash=... new_hash=...
- Optionally check telemetry:
sqlite3 {{ request_log_path }} "SELECT status,error_message FROM request_log WHERE endpoint='health_check' ORDER BY timestamp DESC LIMIT 1;"
Expected log snippet:
[timestamp] [deadman] [<WIZARD>] FAIL: Config contains banned provider...
[timestamp] [deadman] [<WIZARD>] UNHEALTHY — initiating recovery.
[timestamp] [deadman] [<WIZARD>] Rolling back config to last known good...
[timestamp] [deadman] [<WIZARD>] Config rolled back.
[timestamp] [deadman] [<WIZARD>] Rollback event: agent=<WIZARD> old_hash=<sha> new_hash=<sha>
[timestamp] [deadman] [<WIZARD>] Restarting hermes-<wizard>...
[timestamp] [deadman] [<WIZARD>] Agent restarted via systemd.
3. Simulate full cascade-failure death → verify rollback+restart
- Stop agent:
systemctl stop hermes-<wizard>or kill process. - Modify config to inject banned provider.
- Run
deadman_action.shmanually (or wait for cron). - Verify that config is rolled back and agent is restarted.
4. Snapshot stored in predictable per-agent location
Check on each wizard:
ls -la {{ deadman_snapshot_dir }}/
# Expected: config.yaml.known_good + config.yaml.<timestamp> (rolling, max {{ deadman_max_snapshots }})
5. Works with Ansible-deployed cron schedule
Run:
ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/deadman_switch.yml
Verify:
{{ wizard_home }}/deadman_action.shexists, mode 0755{{ deadman_snapshot_dir }}exists- No systemd timer named
deadman-<wizard>.timerexists (per one-impl rule) - Cron entry present:
crontab -l | grep deadman
6. One implementation only — all overlaps killed
- Deleted
bin/deadman-switch.sh— central watch removed. - Deleted
bin/deadman-fallback.py— provider fallback not part of deadman recovery. - No systemd timer deployed (deadman_switch role no longer deploys service/timer).
- Scheduling handled exclusively by
cron_manager's universal cron (5 min interval).
Troubleshooting
| Symptom | Likely cause | Check |
|---|---|---|
| No snapshot created | Config unreadable or permissions wrong | ls -l {{ wizard_home }}/config.yaml; run as root |
| Rollback doesn't restore | No snapshot file exists | ls -l {{ deadman_snapshot_dir }}/config.yaml.known_good |
| Agent not restarting | systemd not available or service name mismatch | `systemctl status hermes-{{ wizard_name |
| SSH watch never fires | Cron not running or Gitea token missing | service cron status; check ~/.hermes/gitea_token_vps |
Notes
This follows the KT Bezalel Architecture Session (2026-04-08) design. The deadman switch now closes the loop from detection to recovery automatically on a 5-minute cadence. See issue #444 for full acceptance criteria and design rules.