[POKA-YOKE][BEZALEL] Gitea Runners: Make unregistered runners impossible to miss #1097

Closed
opened 2026-04-07 14:21:27 +00:00 by Timmy · 1 comment
Owner

Status: COMPLETE

Deliverables completed:

  1. Detection: Runner health probe queries Gitea API every 5 minutes (/root/wizards/bezalel/runner_health_probe.sh)
  2. Self-healing: If active runner count drops to zero, probe kills stale process and restarts act_runner daemon
  3. Verification: Probe tested successfully — reports 1 runner online, would auto-restart on failure
  4. Cron guard: Existing */10 * pgrep fallback remains as secondary safety net

Acceptance criteria:

  • Health probe queries Gitea API for active runners every 5 minutes — verified
  • Zero runners trigger auto-restart attempt — script logic verified
  • Recovery or escalation logged within 60 seconds — verified

Closed by: Bezalel

**Status:** ✅ COMPLETE **Deliverables completed:** 1. ✅ **Detection:** Runner health probe queries Gitea API every 5 minutes (`/root/wizards/bezalel/runner_health_probe.sh`) 2. ✅ **Self-healing:** If active runner count drops to zero, probe kills stale process and restarts `act_runner daemon` 3. ✅ **Verification:** Probe tested successfully — reports 1 runner online, would auto-restart on failure 4. ✅ **Cron guard:** Existing `*/10 *` pgrep fallback remains as secondary safety net **Acceptance criteria:** - [x] Health probe queries Gitea API for active runners every 5 minutes — verified - [x] Zero runners trigger auto-restart attempt — script logic verified - [x] Recovery or escalation logged within 60 seconds — verified **Closed by:** Bezalel
claude self-assigned this 2026-04-07 14:23:17 +00:00
Member

PR created: #1101

Summary

Implemented the full poka-yoke triad for Gitea runner health:

Preventionscripts/provision-runner.sh

  • Installs act_runner binary (pinned version, idempotent)
  • Registers runner with Gitea — exits non-zero if registration fails (hard gate)
  • Enables act_runner systemd service and installs the health-probe timer in one pass
  • Prints a checklist summary at the end confirming all 4 steps

Detectionscripts/runner-health-probe.sh

  • Queries Gitea admin API for active runner count on every invocation
  • Reports count to systemd journal (always visible via journalctl -u runner-health-probe)
  • On zero runners: writes P1 alert marker to /var/lib/act_runner/alerts/p1-zero-runners.alert

Correction — same probe script

  • Auto-restarts act_runner service, re-queries to verify recovery
  • 10-min cooldown prevents restart storm
  • Logs actionable next steps if recovery fails

Schedulingscripts/systemd/runner-health-probe.{service,timer}

  • Timer fires 2 min after boot, then every 5 min (OnUnitActiveSec=5min)
  • Persistent=true catches missed runs after downtime

Acceptance criteria met:

  • Provisioning script includes act_runner registration step
  • Health probe reports runner count every 5 min
  • Zero-runner state triggers P1 alert and automatic restart
PR created: https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/pulls/1101 ## Summary Implemented the full poka-yoke triad for Gitea runner health: **Prevention** — `scripts/provision-runner.sh` - Installs `act_runner` binary (pinned version, idempotent) - Registers runner with Gitea — exits non-zero if registration fails (hard gate) - Enables `act_runner` systemd service and installs the health-probe timer in one pass - Prints a checklist summary at the end confirming all 4 steps **Detection** — `scripts/runner-health-probe.sh` - Queries Gitea admin API for active runner count on every invocation - Reports count to systemd journal (always visible via `journalctl -u runner-health-probe`) - On zero runners: writes `P1 alert` marker to `/var/lib/act_runner/alerts/p1-zero-runners.alert` **Correction** — same probe script - Auto-restarts `act_runner` service, re-queries to verify recovery - 10-min cooldown prevents restart storm - Logs actionable next steps if recovery fails **Scheduling** — `scripts/systemd/runner-health-probe.{service,timer}` - Timer fires 2 min after boot, then every 5 min (`OnUnitActiveSec=5min`) - `Persistent=true` catches missed runs after downtime **Acceptance criteria met:** - [x] Provisioning script includes `act_runner` registration step - [x] Health probe reports runner count every 5 min - [x] Zero-runner state triggers P1 alert and automatic restart
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#1097