[claude] Poka-yoke runner health: provision + health probe scripts (#1097) #1101
Reference in New Issue
Block a user
Delete Branch "claude/issue-1097"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes #1097
What this does
Makes unregistered Gitea runners impossible to miss by implementing the full poka-yoke triad (Prevention / Detection / Correction).
Prevention
scripts/provision-runner.sh— idempotent VPS provisioning script that installsact_runner, registers it with Gitea (exits non-zero on failure), enables the systemd service, and installs the health-probe timer in one pass. Runner registration is now a hard gate — you cannot finish provisioning without a registered runner.Detection
scripts/runner-health-probe.sh— queries Gitea admin API for active runner count every 5 min. Reports count to systemd journal every run (always visible viajournalctl -u runner-health-probe). On zero runners, writes a P1 alert marker to/var/lib/act_runner/alerts/p1-zero-runners.alert.scripts/systemd/runner-health-probe.timer— fires 2 min after boot, then every 5 min (OnUnitActiveSec=5min).Persistent=truecatches missed runs after downtime.Correction
act_runnerservice on zero-runner detection, re-queries to verify recovery, logs success or failure with actionable next steps. A 10-min cooldown prevents restart storms.Acceptance criteria
act_runnerregistration stepDeployment
To write the probe token separately:
Add three deliverables to make unregistered Gitea runners impossible to miss: 1. scripts/provision-runner.sh - Downloads and installs act_runner binary at a pinned version - Registers the runner with Gitea (fails non-zero if registration fails) - Creates and enables act_runner systemd service - Installs runner-health-probe timer as part of the same provisioning pass - Prints a poka-yoke checklist summary at the end 2. scripts/runner-health-probe.sh - Queries Gitea admin API for active runner count every run - Reports count to systemd journal (visible via journalctl) - On zero runners: writes P1 alert marker to /var/lib/act_runner/alerts/, attempts to restart act_runner service, re-queries to verify recovery, logs recovery or continued failure with actionable next steps - Cooldown (10 min) prevents restart storm 3. scripts/systemd/runner-health-probe.{service,timer} - Systemd oneshot service wrapping the probe script - Timer fires 2 min after boot then every 5 min (OnUnitActiveSec=5min) - Persistent=true so missed runs fire on recovery from downtime Fixes #1097