[OPS] Fleet Health Monitor — Detect down agents and rate-limited models #152

Open
opened 2026-04-06 22:43:44 +00:00 by Timmy · 3 comments
Owner

Problem

There is no automated way to check which agents are down or which models are rate-limited across the fleet. Alexander had to manually ask Ezra to check who was down and rate-limited.

Goal

Build a lightweight fleet health monitor that automatically detects and reports agent downtime and model rate-limiting.

Scope

  • Poll each wizard agent for responsiveness (or check their last heartbeat/activity)
  • Query model providers/APIs for rate-limit status where possible
  • Report findings in a structured format (JSON or Telegram message)
  • Optional: scheduled execution via cron

Context

Requested in Telegram after Allegro and Bezalel became unresponsive due to missing fallback configurations.

  • hermes-agent #150: [BEZALEL][AUDIT] Telegram Request-to-Gitea Tracking Audit
  • the-nexus #894: [OPS] Deploy burn-mode cron jobs (infrastructure for automation)
  • the-nexus #896: [OPS] Implement retry logic and error recovery in burn mode

Acceptance Criteria

  • Script/tool can check which agents are responsive vs down
  • Script/tool can detect rate-limited models
  • Output is machine-readable and human-readable
  • Results can be delivered to Telegram or logged to Gitea
  • Tests cover common failure modes (timeout, 429, connection refused)

/assign @ezra @bezalel

## Problem There is no automated way to check which agents are down or which models are rate-limited across the fleet. Alexander had to manually ask Ezra to check who was down and rate-limited. ## Goal Build a lightweight fleet health monitor that automatically detects and reports agent downtime and model rate-limiting. ## Scope - Poll each wizard agent for responsiveness (or check their last heartbeat/activity) - Query model providers/APIs for rate-limit status where possible - Report findings in a structured format (JSON or Telegram message) - Optional: scheduled execution via cron ## Context Requested in Telegram after Allegro and Bezalel became unresponsive due to missing fallback configurations. ## Related - hermes-agent #150: [BEZALEL][AUDIT] Telegram Request-to-Gitea Tracking Audit - the-nexus #894: [OPS] Deploy burn-mode cron jobs (infrastructure for automation) - the-nexus #896: [OPS] Implement retry logic and error recovery in burn mode ## Acceptance Criteria - [ ] Script/tool can check which agents are responsive vs down - [ ] Script/tool can detect rate-limited models - [ ] Output is machine-readable and human-readable - [ ] Results can be delivered to Telegram or logged to Gitea - [ ] Tests cover common failure modes (timeout, 429, connection refused) /assign @ezra @bezalel
ezra was assigned by Timmy 2026-04-06 22:43:44 +00:00
Author
Owner

Update from Bezalel: First iteration of the fleet health monitor is now implemented in PR #166.

devkit/health.py provides:

  • System load average checks (1min/5min/15min)
  • Disk usage monitoring with thresholds
  • Memory stats from /proc/meminfo
  • Hermes process detection and PID listing
  • Key package version checks
  • JSON output and meaningful exit codes for CI/cron integration

Usage:

python -m devkit.health --threshold-load 1.0 --threshold-disk 90.0 --fail-on-critical

Next: hook this into Epic-003 (Observatory) for scheduled Telegram alerting and historical tracking.

/assign @bezalel

**Update from Bezalel:** First iteration of the fleet health monitor is now implemented in PR #166. `devkit/health.py` provides: - System load average checks (1min/5min/15min) - Disk usage monitoring with thresholds - Memory stats from `/proc/meminfo` - Hermes process detection and PID listing - Key package version checks - JSON output and meaningful exit codes for CI/cron integration Usage: ```bash python -m devkit.health --threshold-load 1.0 --threshold-disk 90.0 --fail-on-critical ``` Next: hook this into Epic-003 (Observatory) for scheduled Telegram alerting and historical tracking. /assign @bezalel
ezra was unassigned by Rockachopa 2026-04-07 02:47:35 +00:00
claw-code was assigned by Rockachopa 2026-04-07 02:47:35 +00:00
Owner

[BURN-DOWN] Dispatched to Code Claw (claw-code worker) as part of nightly burn-down cycle. Heartbeat active.

[BURN-DOWN] Dispatched to Code Claw (claw-code worker) as part of nightly burn-down cycle. Heartbeat active.
Timmy added the claw-code-in-progress label 2026-04-07 04:06:21 +00:00
Author
Owner

🟠 Code Claw (OpenRouter qwen/qwen3.6-plus:free) picking up this issue via 15-minute heartbeat.

Timestamp: 2026-04-07T04:06:21Z

🟠 Code Claw (OpenRouter qwen/qwen3.6-plus:free) picking up this issue via 15-minute heartbeat. Timestamp: 2026-04-07T04:06:21Z
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#152