Timmy_Foundation/timmy-config

Fork 0

Files

Alexander Payne 0ae1f4823b

Architecture Lint / Linter Tests (pull_request) Successful in 29s

Details

Smoke Test / smoke (pull_request) Failing after 26s

Details

Validate Config / YAML Lint (pull_request) Failing after 18s

Details

Validate Config / JSON Validate (pull_request) Successful in 22s

Details

Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m7s

Details

Validate Config / Python Test Suite (pull_request) Has been skipped

Details

Validate Config / Shell Script Lint (pull_request) Failing after 1m13s

Details

Validate Config / Cron Syntax Check (pull_request) Successful in 15s

Details

Validate Config / Deploy Script Dry Run (pull_request) Successful in 14s

Details

Validate Config / Playbook Schema Validation (pull_request) Successful in 29s

Details

PR Checklist / pr-checklist (pull_request) Failing after 5m7s

Details

Architecture Lint / Lint Repository (pull_request) Failing after 24s

Details

docs: add deadman switch testing verification procedure

Add a comprehensive testing document covering all acceptance criteria
for issue #444 ( Wire Deadman Switch ACTION ). Includes AI-, test plan,
troubleshooting, and verification steps for operators.

Refs: #444

2026-04-30 09:05:35 -04:00

4.7 KiB

Raw Blame History

Deadman Switch — Test & Verification Procedure

Issue: #444 — Wire Deadman Switch ACTION (Snapshot + Rollback + Restart) Last updated: 2026-04-30 (STEP35 burn contribution)

This document describes how to verify that the deadman switch is operational end-to-end on the wizards fleet. All tests assume Ansible deployment has been run (ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/deadman_switch.yml).

Architecture (post #444)

deadman_action.sh — deployed to {{ wizard_home }}/deadman_action.sh by Ansible.
Scheduling: Ansible cron_manager role installs a cron entry */5 * * * * that runs deadman_action.sh and logs to {{ timmy_log_dir }}/deadman-<wizard>.log.
No systemd timer, no launchd plist, no separate deadman-switch.sh watch — single implementation via universal cron (cron_manager).

Acceptance Criteria Test Plan

1. Health-check success → config snapshot saved

Test:

Ensure wizard agent is healthy.
Manually run: {{ wizard_home }}/deadman_action.sh
Verify: {{ deadman_snapshot_dir }}/config.yaml.known_good exists and matches current config.

Expected log output:

[timestamp] [deadman] [<WIZARD>] Health check starting...
[timestamp] [deadman] [<WIZARD>] HEALTHY — snapshotting config.
[timestamp] [deadman] [<WIZARD>] Config snapshot saved.
[timestamp] [deadman] [<WIZARD>] Health check complete.

2. Health-check failure → config rolled back + agent restarted + rollback event logged

Test:

Corrupt agent config to trigger failure:

echo 'provider: anthropic' >> {{ wizard_home }}/config.yaml

Run {{ wizard_home }}/deadman_action.sh.
Verify:
- grep -q anthropic {{ wizard_home }}/config.yaml → false (removed)
- systemctl status hermes-{{ wizard_name | lower }} shows active (or recent restart)
- Log contains Rollback event: agent=... old_hash=... new_hash=...

Optionally check telemetry:

sqlite3 {{ request_log_path }} "SELECT status,error_message FROM request_log WHERE endpoint='health_check' ORDER BY timestamp DESC LIMIT 1;"

Expected log snippet:

[timestamp] [deadman] [<WIZARD>] FAIL: Config contains banned provider...
[timestamp] [deadman] [<WIZARD>] UNHEALTHY — initiating recovery.
[timestamp] [deadman] [<WIZARD>] Rolling back config to last known good...
[timestamp] [deadman] [<WIZARD>] Config rolled back.
[timestamp] [deadman] [<WIZARD>] Rollback event: agent=<WIZARD> old_hash=<sha> new_hash=<sha>
[timestamp] [deadman] [<WIZARD>] Restarting hermes-<wizard>...
[timestamp] [deadman] [<WIZARD>] Agent restarted via systemd.

3. Simulate full cascade-failure death → verify rollback+restart

Stop agent: systemctl stop hermes-<wizard> or kill process.
Modify config to inject banned provider.
Run deadman_action.sh manually (or wait for cron).
Verify that config is rolled back and agent is restarted.

4. Snapshot stored in predictable per-agent location

Check on each wizard:

ls -la {{ deadman_snapshot_dir }}/
# Expected: config.yaml.known_good  +  config.yaml.<timestamp>  (rolling, max {{ deadman_max_snapshots }})

5. Works with Ansible-deployed cron schedule

Run:

ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/deadman_switch.yml

Verify:

{{ wizard_home }}/deadman_action.sh exists, mode 0755
{{ deadman_snapshot_dir }} exists
No systemd timer named deadman-<wizard>.timer exists (per one-impl rule)
Cron entry present: crontab -l | grep deadman

6. One implementation only — all overlaps killed

Deleted bin/deadman-switch.sh — central watch removed.
Deleted bin/deadman-fallback.py — provider fallback not part of deadman recovery.
No systemd timer deployed (deadman_switch role no longer deploys service/timer).
Scheduling handled exclusively by cron_manager's universal cron (5 min interval).

Troubleshooting

Symptom	Likely cause	Check
No snapshot created	Config unreadable or permissions wrong	`ls -l {{ wizard_home }}/config.yaml`; run as root
Rollback doesn't restore	No snapshot file exists	`ls -l {{ deadman_snapshot_dir }}/config.yaml.known_good`
Agent not restarting	systemd not available or service name mismatch	`systemctl status hermes-{{ wizard_name
SSH watch never fires	Cron not running or Gitea token missing	`service cron status`; check `~/.hermes/gitea_token_vps`

Notes

This follows the KT Bezalel Architecture Session (2026-04-08) design. The deadman switch now closes the loop from detection to recovery automatically on a 5-minute cadence. See issue #444 for full acceptance criteria and design rules.

4.7 KiB Raw Blame History

Deadman Switch — Test & Verification Procedure

Architecture (post #444)

Acceptance Criteria Test Plan

1. Health-check success → config snapshot saved

2. Health-check failure → config rolled back + agent restarted + rollback event logged

3. Simulate full cascade-failure death → verify rollback+restart

4. Snapshot stored in predictable per-agent location

5. Works with Ansible-deployed cron schedule

6. One implementation only — all overlaps killed

Troubleshooting

Notes

4.7 KiB

Raw Blame History