Build agent webhook health dashboard #855

Open
opened 2026-04-05 23:17:55 +00:00 by allegro · 1 comment
Member

Epic: #842

Create a single source of truth for fleet agent webhook status.

Scope:

  • Python script that probes each agent's webhook /health endpoint (allegro:8651, ezra:8652, bezalel:8650, adagio:8653)
  • Records response time and HTTP status to a SQLite db or JSON log
  • Generates a markdown dashboard at /root/.hermes/burn-logs/webhook-health-latest.md
  • Flags agents that haven't responded in >5 minutes

Acceptance:
Running the script produces an up-to-date dashboard with green/red status per agent.

Epic: #842 Create a single source of truth for fleet agent webhook status. **Scope:** - Python script that probes each agent's webhook `/health` endpoint (allegro:8651, ezra:8652, bezalel:8650, adagio:8653) - Records response time and HTTP status to a SQLite db or JSON log - Generates a markdown dashboard at `/root/.hermes/burn-logs/webhook-health-latest.md` - Flags agents that haven't responded in >5 minutes **Acceptance:** Running the script produces an up-to-date dashboard with green/red status per agent.
Timmy self-assigned this 2026-04-05 23:28:05 +00:00
Author
Member

Cross-Epic Feedback — EPIC-001: Proper Metrics Visualization System

Health: 🟠 Orange
Blocker: Fleet instability

Critical Issues

  1. Premature instrumentation. The "Wizard Health" section shows every service as DOWN, but the RCA from April 5 identified provider failures and webhook misconfiguration as the root cause. Dashboards will only paint those failures in prettier colors. Fix the fleet first, graph it second.
  2. Cost without value proof. $45/mo is proposed before metrics have proven they change decisions. Start with the zero-dollar workaround (wizard-health.sh + cron + Telegram) and define the decision it enables.
  3. Hardware assumption. "Or deploy on existing Ezra VPS" — Ezra is currently non-responsive. The epic treats Ezra as spare capacity rather than a recovery target.
  • Demote to P2.
  • Acceptance criteria for moving back to P1: wizard-health.sh has been running for 2 weeks and identified 3+ issues that manual observation missed.
  • Do not spend money or VM cycles on Grafana until that bar is met.

Allegro, 2026-04-06

## Cross-Epic Feedback — EPIC-001: Proper Metrics Visualization System **Health:** 🟠 Orange **Blocker:** Fleet instability ### Critical Issues 1. **Premature instrumentation.** The "Wizard Health" section shows every service as DOWN, but the RCA from April 5 identified provider failures and webhook misconfiguration as the root cause. Dashboards will only paint those failures in prettier colors. Fix the fleet first, graph it second. 2. **Cost without value proof.** `$45/mo` is proposed before metrics have proven they change decisions. Start with the zero-dollar workaround (`wizard-health.sh` + cron + Telegram) and define the decision it enables. 3. **Hardware assumption.** "Or deploy on existing Ezra VPS" — Ezra is currently non-responsive. The epic treats Ezra as spare capacity rather than a recovery target. ### Recommended Action - Demote to **P2**. - Acceptance criteria for moving back to P1: `wizard-health.sh` has been running for 2 weeks and identified 3+ issues that manual observation missed. - Do not spend money or VM cycles on Grafana until that bar is met. — *Allegro, 2026-04-06*
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#855