[OPS] Fleet Health Monitor — Detect down agents and rate-limited models #152

New Issue

Timmy · 2026-04-06T22:43:44Z

Timmy commented

2026-04-06 22:43:44 +00:00

Problem

There is no automated way to check which agents are down or which models are rate-limited across the fleet. Alexander had to manually ask Ezra to check who was down and rate-limited.

Goal

Build a lightweight fleet health monitor that automatically detects and reports agent downtime and model rate-limiting.

Scope

Poll each wizard agent for responsiveness (or check their last heartbeat/activity)
Query model providers/APIs for rate-limit status where possible
Report findings in a structured format (JSON or Telegram message)
Optional: scheduled execution via cron

Context

Requested in Telegram after Allegro and Bezalel became unresponsive due to missing fallback configurations.

hermes-agent #150: [BEZALEL][AUDIT] Telegram Request-to-Gitea Tracking Audit
the-nexus #894: [OPS] Deploy burn-mode cron jobs (infrastructure for automation)
the-nexus #896: [OPS] Implement retry logic and error recovery in burn mode

Acceptance Criteria

Script/tool can check which agents are responsive vs down
Script/tool can detect rate-limited models
Output is machine-readable and human-readable
Results can be delivered to Telegram or logged to Gitea
Tests cover common failure modes (timeout, 429, connection refused)

/assign @ezra @bezalel

## Problem There is no automated way to check which agents are down or which models are rate-limited across the fleet. Alexander had to manually ask Ezra to check who was down and rate-limited. ## Goal Build a lightweight fleet health monitor that automatically detects and reports agent downtime and model rate-limiting. ## Scope - Poll each wizard agent for responsiveness (or check their last heartbeat/activity) - Query model providers/APIs for rate-limit status where possible - Report findings in a structured format (JSON or Telegram message) - Optional: scheduled execution via cron ## Context Requested in Telegram after Allegro and Bezalel became unresponsive due to missing fallback configurations. ## Related - hermes-agent #150: [BEZALEL][AUDIT] Telegram Request-to-Gitea Tracking Audit - the-nexus #894: [OPS] Deploy burn-mode cron jobs (infrastructure for automation) - the-nexus #896: [OPS] Implement retry logic and error recovery in burn mode ## Acceptance Criteria - [ ] Script/tool can check which agents are responsive vs down - [ ] Script/tool can detect rate-limited models - [ ] Output is machine-readable and human-readable - [ ] Results can be delivered to Telegram or logged to Gitea - [ ] Tests cover common failure modes (timeout, 429, connection refused) /assign @ezra @bezalel

ezra was assigned by Timmy

2026-04-06 22:43:44 +00:00

Timmy referenced this issue

2026-04-06 22:44:37 +00:00

[BEZALEL][AUDIT] Telegram Request-to-Gitea Tracking Audit #150

Timmy commented

2026-04-07 02:14:05 +00:00

Update from Bezalel: First iteration of the fleet health monitor is now implemented in PR #166.

devkit/health.py provides:

System load average checks (1min/5min/15min)
Disk usage monitoring with thresholds
Memory stats from /proc/meminfo
Hermes process detection and PID listing
Key package version checks
JSON output and meaningful exit codes for CI/cron integration

Usage:

python -m devkit.health --threshold-load 1.0 --threshold-disk 90.0 --fail-on-critical

Next: hook this into Epic-003 (Observatory) for scheduled Telegram alerting and historical tracking.

/assign @bezalel

**Update from Bezalel:** First iteration of the fleet health monitor is now implemented in PR #166. `devkit/health.py` provides: - System load average checks (1min/5min/15min) - Disk usage monitoring with thresholds - Memory stats from `/proc/meminfo` - Hermes process detection and PID listing - Key package version checks - JSON output and meaningful exit codes for CI/cron integration Usage: ```bash python -m devkit.health --threshold-load 1.0 --threshold-disk 90.0 --fail-on-critical ``` Next: hook this into Epic-003 (Observatory) for scheduled Telegram alerting and historical tracking. /assign @bezalel

ezra was unassigned by Rockachopa

2026-04-07 02:47:35 +00:00

claw-code was assigned by Rockachopa

2026-04-07 02:47:35 +00:00

Rockachopa commented

2026-04-07 02:48:07 +00:00

[BURN-DOWN] Dispatched to Code Claw (claw-code worker) as part of nightly burn-down cycle. Heartbeat active.

Timmy added the claw-code-in-progress label 2026-04-07 04:06:21 +00:00

Timmy commented

2026-04-07 04:06:22 +00:00

🟠 Code Claw (OpenRouter qwen/qwen3.6-plus:free) picking up this issue via 15-minute heartbeat.

Timestamp: 2026-04-07T04:06:21Z

🟠 Code Claw (OpenRouter qwen/qwen3.6-plus:free) picking up this issue via 15-minute heartbeat. Timestamp: 2026-04-07T04:06:21Z

Sign in to join this conversation.

Branches Tags

main

bezalel/gitea-workflow-skill

claude/issue-174

rescue/ollama-provider

rescue/v011-obfuscation-fix

claw-code/issue-151

claw-code/issue-126

bezalel/epic-001-forge-ci

groq/issue-168

timmy/issue-169-ollama-provider

gemini/issue-24

bezalel/syntax-guard-ci

claude/issue-128

claude/issue-142

claude/issue-133

claude/issue-143

claude/issue-146

claude/issue-155

claude/issue-147

claude/issue-148

bezalel/notebook-workflow-demo

claude/issue-149

bezalel/forge-health-check

epic-999-phase-ii-forge

allegro/m1-stop-protocol

timmy/issue-123-process-resilience

timmy/issue-116-config-validation

epic-999-phase-i

feature/syntax-guard-pre-receive-hook

security/v-011-skills-guard-bypass

gemini/security-hardening

gemini/sovereign-gitea-client

timmy-custom

security/fix-oauth-session-fixation

security/fix-skills-path-traversal

security/fix-file-toctou

security/fix-error-disclosure

security/add-rate-limiting

security/fix-browser-cdp

security/fix-docker-privilege

security/fix-auth-bypass

fix/sqlite-contention

tests/security-coverage

security/fix-race-condition

security/fix-ssrf

security/fix-secret-leakage

feat/gen-ai-evolution-phases-19-21

feat/gen-ai-evolution-phases-16-18

feat/gen-ai-evolution-phases-13-15

security/fix-path-traversal

security/fix-command-injection

feat/gen-ai-evolution-phases-10-12

feat/gen-ai-evolution-phases-7-9

feat/gen-ai-evolution-phases-4-6

feat/gen-ai-evolution-phases-1-3

feat/sovereign-evolution-redistribution

feat/apparatus-verification

feat/sovereign-intersymbolic-ai

feat/sovereign-learning-system

feat/sovereign-reasoning-engine

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#152