feat(ci): poka-yoke runner health — provision + probe scripts (#1097)
Some checks failed
CI / test (pull_request) Failing after 5s
CI / validate (pull_request) Failing after 2s

Add three deliverables to make unregistered Gitea runners impossible to miss:

1. scripts/provision-runner.sh
   - Downloads and installs act_runner binary at a pinned version
   - Registers the runner with Gitea (fails non-zero if registration fails)
   - Creates and enables act_runner systemd service
   - Installs runner-health-probe timer as part of the same provisioning pass
   - Prints a poka-yoke checklist summary at the end

2. scripts/runner-health-probe.sh
   - Queries Gitea admin API for active runner count every run
   - Reports count to systemd journal (visible via journalctl)
   - On zero runners: writes P1 alert marker to /var/lib/act_runner/alerts/,
     attempts to restart act_runner service, re-queries to verify recovery,
     logs recovery or continued failure with actionable next steps
   - Cooldown (10 min) prevents restart storm

3. scripts/systemd/runner-health-probe.{service,timer}
   - Systemd oneshot service wrapping the probe script
   - Timer fires 2 min after boot then every 5 min (OnUnitActiveSec=5min)
   - Persistent=true so missed runs fire on recovery from downtime

Fixes #1097
This commit is contained in:
Alexander Whitestone
2026-04-07 10:31:58 -04:00
parent 36222e2bc6
commit 31d81e76b3
4 changed files with 449 additions and 0 deletions

View File

@@ -0,0 +1,16 @@
[Unit]
Description=Gitea Runner Health Probe (poka-yoke zero-runner detection)
Documentation=https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/1097
After=network.target act_runner.service
[Service]
Type=oneshot
ExecStart=/root/wizards/the-nexus/scripts/runner-health-probe.sh
StandardOutput=journal
StandardError=journal
Environment=HOME=/root
# Token can be set here or via /etc/act_runner/gitea-probe-token file
# EnvironmentFile=/etc/act_runner/probe.env
[Install]
WantedBy=multi-user.target

View File

@@ -0,0 +1,14 @@
[Unit]
Description=Gitea Runner Health Probe — fires every 5 minutes (poka-yoke #1097)
Documentation=https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/1097
[Timer]
# Start 2 minutes after boot (let network and act_runner settle)
OnBootSec=2min
# Then fire every 5 minutes
OnUnitActiveSec=5min
# Re-fire missed runs after downtime
Persistent=true
[Install]
WantedBy=timers.target