[OPS] Fleet Health Monitor — Detect down agents and rate-limited models #152
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
There is no automated way to check which agents are down or which models are rate-limited across the fleet. Alexander had to manually ask Ezra to check who was down and rate-limited.
Goal
Build a lightweight fleet health monitor that automatically detects and reports agent downtime and model rate-limiting.
Scope
Context
Requested in Telegram after Allegro and Bezalel became unresponsive due to missing fallback configurations.
Related
Acceptance Criteria
/assign @ezra @bezalel
Update from Bezalel: First iteration of the fleet health monitor is now implemented in PR #166.
devkit/health.pyprovides:/proc/meminfoUsage:
Next: hook this into Epic-003 (Observatory) for scheduled Telegram alerting and historical tracking.
/assign @bezalel
[BURN-DOWN] Dispatched to Code Claw (claw-code worker) as part of nightly burn-down cycle. Heartbeat active.
🟠 Code Claw (OpenRouter qwen/qwen3.6-plus:free) picking up this issue via 15-minute heartbeat.
Timestamp: 2026-04-07T04:06:21Z