Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 29s
Smoke Test / smoke (pull_request) Failing after 26s
Validate Config / YAML Lint (pull_request) Failing after 18s
Validate Config / JSON Validate (pull_request) Successful in 22s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m7s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Shell Script Lint (pull_request) Failing after 1m13s
Validate Config / Cron Syntax Check (pull_request) Successful in 15s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 14s
Validate Config / Playbook Schema Validation (pull_request) Successful in 29s
PR Checklist / pr-checklist (pull_request) Failing after 5m7s
Architecture Lint / Lint Repository (pull_request) Failing after 24s
Add a comprehensive testing document covering all acceptance criteria for issue #444 ( Wire Deadman Switch ACTION ). Includes AI-, test plan, troubleshooting, and verification steps for operators. Refs: #444
113 lines
4.7 KiB
Markdown
113 lines
4.7 KiB
Markdown
# Deadman Switch — Test & Verification Procedure
|
|
|
|
**Issue:** #444 — Wire Deadman Switch ACTION (Snapshot + Rollback + Restart)
|
|
**Last updated:** 2026-04-30 (STEP35 burn contribution)
|
|
|
|
This document describes how to verify that the deadman switch is operational
|
|
end-to-end on the wizards fleet. All tests assume Ansible deployment has been
|
|
run (`ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/deadman_switch.yml`).
|
|
|
|
## Architecture (post #444)
|
|
|
|
- `deadman_action.sh` — deployed to `{{ wizard_home }}/deadman_action.sh` by Ansible.
|
|
- Scheduling: Ansible `cron_manager` role installs a cron entry `*/5 * * * *` that runs
|
|
`deadman_action.sh` and logs to `{{ timmy_log_dir }}/deadman-<wizard>.log`.
|
|
- No systemd timer, no launchd plist, no separate `deadman-switch.sh` watch — single
|
|
implementation via universal cron (cron_manager).
|
|
|
|
## Acceptance Criteria Test Plan
|
|
|
|
### 1. Health-check success → config snapshot saved
|
|
|
|
**Test:**
|
|
1. Ensure wizard agent is healthy.
|
|
2. Manually run: `{{ wizard_home }}/deadman_action.sh`
|
|
3. Verify: `{{ deadman_snapshot_dir }}/config.yaml.known_good` exists and matches current config.
|
|
|
|
Expected log output:
|
|
```
|
|
[timestamp] [deadman] [<WIZARD>] Health check starting...
|
|
[timestamp] [deadman] [<WIZARD>] HEALTHY — snapshotting config.
|
|
[timestamp] [deadman] [<WIZARD>] Config snapshot saved.
|
|
[timestamp] [deadman] [<WIZARD>] Health check complete.
|
|
```
|
|
|
|
### 2. Health-check failure → config rolled back + agent restarted + rollback event logged
|
|
|
|
**Test:**
|
|
1. Corrupt agent config to trigger failure:
|
|
```bash
|
|
echo 'provider: anthropic' >> {{ wizard_home }}/config.yaml
|
|
```
|
|
2. Run `{{ wizard_home }}/deadman_action.sh`.
|
|
3. Verify:
|
|
- `grep -q anthropic {{ wizard_home }}/config.yaml` → **false** (removed)
|
|
- `systemctl status hermes-{{ wizard_name | lower }}` shows active (or recent restart)
|
|
- Log contains `Rollback event: agent=... old_hash=... new_hash=...`
|
|
4. Optionally check telemetry:
|
|
```bash
|
|
sqlite3 {{ request_log_path }} "SELECT status,error_message FROM request_log WHERE endpoint='health_check' ORDER BY timestamp DESC LIMIT 1;"
|
|
```
|
|
|
|
Expected log snippet:
|
|
```
|
|
[timestamp] [deadman] [<WIZARD>] FAIL: Config contains banned provider...
|
|
[timestamp] [deadman] [<WIZARD>] UNHEALTHY — initiating recovery.
|
|
[timestamp] [deadman] [<WIZARD>] Rolling back config to last known good...
|
|
[timestamp] [deadman] [<WIZARD>] Config rolled back.
|
|
[timestamp] [deadman] [<WIZARD>] Rollback event: agent=<WIZARD> old_hash=<sha> new_hash=<sha>
|
|
[timestamp] [deadman] [<WIZARD>] Restarting hermes-<wizard>...
|
|
[timestamp] [deadman] [<WIZARD>] Agent restarted via systemd.
|
|
```
|
|
|
|
### 3. Simulate full cascade-failure death → verify rollback+restart
|
|
|
|
- Stop agent: `systemctl stop hermes-<wizard>` or kill process.
|
|
- Modify config to inject banned provider.
|
|
- Run `deadman_action.sh` manually (or wait for cron).
|
|
- Verify that config is rolled back and agent is restarted.
|
|
|
|
### 4. Snapshot stored in predictable per-agent location
|
|
|
|
**Check on each wizard:**
|
|
```bash
|
|
ls -la {{ deadman_snapshot_dir }}/
|
|
# Expected: config.yaml.known_good + config.yaml.<timestamp> (rolling, max {{ deadman_max_snapshots }})
|
|
```
|
|
|
|
### 5. Works with Ansible-deployed cron schedule
|
|
|
|
Run:
|
|
```bash
|
|
ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/deadman_switch.yml
|
|
```
|
|
Verify:
|
|
- `{{ wizard_home }}/deadman_action.sh` exists, mode 0755
|
|
- `{{ deadman_snapshot_dir }}` exists
|
|
- No systemd timer named `deadman-<wizard>.timer` exists (per one-impl rule)
|
|
- Cron entry present: `crontab -l | grep deadman`
|
|
|
|
### 6. One implementation only — all overlaps killed
|
|
|
|
- **Deleted** `bin/deadman-switch.sh` — central watch removed.
|
|
- **Deleted** `bin/deadman-fallback.py` — provider fallback not part of deadman recovery.
|
|
- No systemd timer deployed (deadman_switch role no longer deploys service/timer).
|
|
- Scheduling handled exclusively by `cron_manager`'s universal cron (5 min interval).
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
| Symptom | Likely cause | Check |
|
|
|---------|--------------|-------|
|
|
| No snapshot created | Config unreadable or permissions wrong | `ls -l {{ wizard_home }}/config.yaml`; run as root |
|
|
| Rollback doesn't restore | No snapshot file exists | `ls -l {{ deadman_snapshot_dir }}/config.yaml.known_good` |
|
|
| Agent not restarting | systemd not available or service name mismatch | `systemctl status hermes-{{ wizard_name | lower }}` |
|
|
| SSH watch never fires | Cron not running or Gitea token missing | `service cron status`; check `~/.hermes/gitea_token_vps` |
|
|
|
|
## Notes
|
|
|
|
This follows the KT Bezalel Architecture Session (2026-04-08) design. The deadman
|
|
switch now closes the loop from detection to recovery automatically on a 5-minute
|
|
cadence. See issue #444 for full acceptance criteria and design rules.
|