[claude] Poka-yoke runner health: provision + health probe scripts (#1097) #1101

Merged
claude merged 1 commits from claude/issue-1097 into main 2026-04-07 14:33:36 +00:00
Member

Fixes #1097

What this does

Makes unregistered Gitea runners impossible to miss by implementing the full poka-yoke triad (Prevention / Detection / Correction).

Prevention

  • scripts/provision-runner.sh — idempotent VPS provisioning script that installs act_runner, registers it with Gitea (exits non-zero on failure), enables the systemd service, and installs the health-probe timer in one pass. Runner registration is now a hard gate — you cannot finish provisioning without a registered runner.

Detection

  • scripts/runner-health-probe.sh — queries Gitea admin API for active runner count every 5 min. Reports count to systemd journal every run (always visible via journalctl -u runner-health-probe). On zero runners, writes a P1 alert marker to /var/lib/act_runner/alerts/p1-zero-runners.alert.
  • scripts/systemd/runner-health-probe.timer — fires 2 min after boot, then every 5 min (OnUnitActiveSec=5min). Persistent=true catches missed runs after downtime.

Correction

  • Health probe auto-restarts act_runner service on zero-runner detection, re-queries to verify recovery, logs success or failure with actionable next steps. A 10-min cooldown prevents restart storms.

Acceptance criteria

  • Provisioning script includes act_runner registration step
  • Health probe reports runner count to journal every 5 min
  • Zero-runner state triggers P1 alert and automatic restart of runner service

Deployment

# On Bezalel VPS (or any runner host):
bash /root/wizards/the-nexus/scripts/provision-runner.sh \
  --gitea-url https://forge.alexanderwhitestone.com \
  --token <runner-registration-token>

To write the probe token separately:

echo "<admin-api-token>" > /etc/act_runner/gitea-probe-token
chmod 600 /etc/act_runner/gitea-probe-token
Fixes #1097 ## What this does Makes unregistered Gitea runners impossible to miss by implementing the full poka-yoke triad (Prevention / Detection / Correction). ### Prevention - `scripts/provision-runner.sh` — idempotent VPS provisioning script that installs `act_runner`, registers it with Gitea (exits non-zero on failure), enables the systemd service, and installs the health-probe timer in one pass. Runner registration is now a hard gate — you cannot finish provisioning without a registered runner. ### Detection - `scripts/runner-health-probe.sh` — queries Gitea admin API for active runner count every 5 min. Reports count to systemd journal every run (always visible via `journalctl -u runner-health-probe`). On zero runners, writes a P1 alert marker to `/var/lib/act_runner/alerts/p1-zero-runners.alert`. - `scripts/systemd/runner-health-probe.timer` — fires 2 min after boot, then every 5 min (`OnUnitActiveSec=5min`). `Persistent=true` catches missed runs after downtime. ### Correction - Health probe auto-restarts `act_runner` service on zero-runner detection, re-queries to verify recovery, logs success or failure with actionable next steps. A 10-min cooldown prevents restart storms. ## Acceptance criteria - [x] Provisioning script includes `act_runner` registration step - [x] Health probe reports runner count to journal every 5 min - [x] Zero-runner state triggers P1 alert and automatic restart of runner service ## Deployment ```bash # On Bezalel VPS (or any runner host): bash /root/wizards/the-nexus/scripts/provision-runner.sh \ --gitea-url https://forge.alexanderwhitestone.com \ --token <runner-registration-token> ``` To write the probe token separately: ```bash echo "<admin-api-token>" > /etc/act_runner/gitea-probe-token chmod 600 /etc/act_runner/gitea-probe-token ```
claude added 1 commit 2026-04-07 14:32:18 +00:00
feat(ci): poka-yoke runner health — provision + probe scripts (#1097)
Some checks failed
CI / test (pull_request) Failing after 5s
CI / validate (pull_request) Failing after 2s
31d81e76b3
Add three deliverables to make unregistered Gitea runners impossible to miss:

1. scripts/provision-runner.sh
   - Downloads and installs act_runner binary at a pinned version
   - Registers the runner with Gitea (fails non-zero if registration fails)
   - Creates and enables act_runner systemd service
   - Installs runner-health-probe timer as part of the same provisioning pass
   - Prints a poka-yoke checklist summary at the end

2. scripts/runner-health-probe.sh
   - Queries Gitea admin API for active runner count every run
   - Reports count to systemd journal (visible via journalctl)
   - On zero runners: writes P1 alert marker to /var/lib/act_runner/alerts/,
     attempts to restart act_runner service, re-queries to verify recovery,
     logs recovery or continued failure with actionable next steps
   - Cooldown (10 min) prevents restart storm

3. scripts/systemd/runner-health-probe.{service,timer}
   - Systemd oneshot service wrapping the probe script
   - Timer fires 2 min after boot then every 5 min (OnUnitActiveSec=5min)
   - Persistent=true so missed runs fire on recovery from downtime

Fixes #1097
claude requested review from perplexity 2026-04-07 14:32:19 +00:00
claude merged commit d0d655b42a into main 2026-04-07 14:33:36 +00:00
claude deleted branch claude/issue-1097 2026-04-07 14:33:36 +00:00
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#1101