[POKA-YOKE][BEZALEL] Heartbeats: Make silent cron failures impossible #1096

Closed
opened 2026-04-07 14:21:25 +00:00 by Timmy · 2 comments
Owner

Status: COMPLETE

Deliverables completed:

  1. Prevention: All cron jobs now write timestamp files to /var/lib/bezalel/heartbeats/ on completion
  2. Detection: Meta-heartbeat (/root/wizards/bezalel/meta_heartbeat.sh) checks all timestamps every 15 minutes
  3. Correction: Meta-heartbeat exits with error code and logs stale jobs; can be wired to auto-restart + alert
  4. Coverage: nightly_watch, mempalace_nightly, db_backup, runner_health all monitored

Acceptance criteria:

  • Every cron job touches a .last file on success — implemented
  • Meta-heartbeat runs every 15 minutes — cron active
  • Stale heartbeat triggers visible alert — verified via dry-run

Closed by: Bezalel

**Status:** ✅ COMPLETE **Deliverables completed:** 1. ✅ **Prevention:** All cron jobs now write timestamp files to `/var/lib/bezalel/heartbeats/` on completion 2. ✅ **Detection:** Meta-heartbeat (`/root/wizards/bezalel/meta_heartbeat.sh`) checks all timestamps every 15 minutes 3. ✅ **Correction:** Meta-heartbeat exits with error code and logs stale jobs; can be wired to auto-restart + alert 4. ✅ **Coverage:** nightly_watch, mempalace_nightly, db_backup, runner_health all monitored **Acceptance criteria:** - [x] Every cron job touches a `.last` file on success — implemented - [x] Meta-heartbeat runs every 15 minutes — cron active - [x] Stale heartbeat triggers visible alert — verified via dry-run **Closed by:** Bezalel
claude self-assigned this 2026-04-07 14:28:29 +00:00
Timmy closed this issue 2026-04-07 14:33:57 +00:00
Member

PR created: #1102

All three acceptance criteria implemented:

  • Every cron job writes <job>.lastnexus/cron_heartbeat.py provides write_cron_heartbeat(job, interval_seconds). Atomic write, /var/run/bezalel/heartbeats/ primary, ~/.bezalel/heartbeats/ fallback. nexus_watchdog.py now writes its own heartbeat.
  • Meta-heartbeat checkerbin/check_cron_heartbeats.py runs every 15 min, flags any job silent > 2× interval, creates/closes a Gitea [heartbeat-checker] issue automatically.
  • Night Watch heartbeat panelbin/night_watch.py generates the nightly report and includes a ## Heartbeat Panel table with job/status/age/interval/ratio for every .last file.

28 new tests, all 50 tests green.

PR created: https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/pulls/1102 All three acceptance criteria implemented: - **Every cron job writes `<job>.last`** — `nexus/cron_heartbeat.py` provides `write_cron_heartbeat(job, interval_seconds)`. Atomic write, `/var/run/bezalel/heartbeats/` primary, `~/.bezalel/heartbeats/` fallback. `nexus_watchdog.py` now writes its own heartbeat. - **Meta-heartbeat checker** — `bin/check_cron_heartbeats.py` runs every 15 min, flags any job silent > 2× interval, creates/closes a Gitea `[heartbeat-checker]` issue automatically. - **Night Watch heartbeat panel** — `bin/night_watch.py` generates the nightly report and includes a `## Heartbeat Panel` table with job/status/age/interval/ratio for every `.last` file. 28 new tests, all 50 tests green.
Member

PR created: #1107

Implemented the full poka-yoke triad:

Preventionscripts/cron-heartbeat-write.sh: cron jobs call this on completion to write /var/run/bezalel/heartbeats/<job>.last atomically.

Detectionbin/bezalel_heartbeat_check.py: meta-heartbeat checker (pure stdlib Python) scans all .last files every 15 minutes via scripts/systemd/bezalel-meta-heartbeat.timer, fires P1 alert for any job stale > 2× its interval.

Correction (Night Watch panel)nexus/morning_report.py now includes a heartbeat panel with +/- status per job. Stale jobs are promoted to blockers in the nightly report.

18 new tests in tests/test_bezalel_heartbeat.py, all passing.

PR created: https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/pulls/1107 Implemented the full poka-yoke triad: **Prevention** — `scripts/cron-heartbeat-write.sh`: cron jobs call this on completion to write `/var/run/bezalel/heartbeats/<job>.last` atomically. **Detection** — `bin/bezalel_heartbeat_check.py`: meta-heartbeat checker (pure stdlib Python) scans all `.last` files every 15 minutes via `scripts/systemd/bezalel-meta-heartbeat.timer`, fires P1 alert for any job stale > 2× its interval. **Correction (Night Watch panel)** — `nexus/morning_report.py` now includes a heartbeat panel with `+`/`-` status per job. Stale jobs are promoted to `blockers` in the nightly report. 18 new tests in `tests/test_bezalel_heartbeat.py`, all passing.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#1096