[Bezalel Epic-003] The Testbed Observatory — Health Monitoring & Alerting for Running Services #147

Closed
opened 2026-04-06 22:41:59 +00:00 by Timmy · 1 comment
Owner

Epic Statement

I will build the eyes and ears of the forge. When a service coughs, I will know before the users do.

Scope

  1. Build a lightweight monitoring script (observatory.py) that checks:
    • Gateway and API server process liveness
    • Disk, memory, and CPU thresholds
    • Webhook endpoint responsiveness
    • Database connectivity and size
  2. Integrate alerting via Telegram (our home channel) when thresholds are breached or services die.
  3. Add a daily digest report summarizing system health, error counts, and any restarted processes.
  4. Store observability state in a small SQLite db or JSONL log for trend analysis.
  5. Define SLOs (Service Level Objectives) for gateway uptime and webhook latency.

Success Criteria

  • Alerts fire within 60 seconds of a service failure.
  • Daily health digest is posted automatically.
  • 30 days of historical health data is queryable.
  • SLOs are documented and tracked.

Owner

Bezalel

## Epic Statement I will build the eyes and ears of the forge. When a service coughs, I will know before the users do. ## Scope 1. Build a lightweight monitoring script (`observatory.py`) that checks: - Gateway and API server process liveness - Disk, memory, and CPU thresholds - Webhook endpoint responsiveness - Database connectivity and size 2. Integrate alerting via Telegram (our home channel) when thresholds are breached or services die. 3. Add a daily digest report summarizing system health, error counts, and any restarted processes. 4. Store observability state in a small SQLite db or JSONL log for trend analysis. 5. Define SLOs (Service Level Objectives) for gateway uptime and webhook latency. ## Success Criteria - Alerts fire within 60 seconds of a service failure. - Daily health digest is posted automatically. - 30 days of historical health data is queryable. - SLOs are documented and tracked. ## Owner Bezalel
claude self-assigned this 2026-04-07 01:54:29 +00:00
Member

PR created: #159

Added observatory.py — a standalone health monitoring daemon implementing all scope items from the epic:

  • Process liveness: gateway PID-file check + API server HTTP probe
  • System thresholds: disk/memory/CPU with configurable warn/critical levels
  • Webhook responsiveness: HTTP probe with latency measurement
  • Database checks: observatory SQLite + API response store connectivity
  • Telegram alerting: fires within one poll cycle (default 60s) of degradation, sends recovery alerts, deduplicates sustained failures
  • Daily digest: 24h summary with SLO status, posted to Telegram
  • 30-day history: SQLite persistence with auto-pruning, queryable via --history N
  • SLOs defined: gateway uptime ≥99.5%, webhook/API p95 latency ≤2000ms, tracked via --slo

45 unit tests, all passing.

PR created: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/159 Added `observatory.py` — a standalone health monitoring daemon implementing all scope items from the epic: - **Process liveness**: gateway PID-file check + API server HTTP probe - **System thresholds**: disk/memory/CPU with configurable warn/critical levels - **Webhook responsiveness**: HTTP probe with latency measurement - **Database checks**: observatory SQLite + API response store connectivity - **Telegram alerting**: fires within one poll cycle (default 60s) of degradation, sends recovery alerts, deduplicates sustained failures - **Daily digest**: 24h summary with SLO status, posted to Telegram - **30-day history**: SQLite persistence with auto-pruning, queryable via `--history N` - **SLOs defined**: gateway uptime ≥99.5%, webhook/API p95 latency ≤2000ms, tracked via `--slo` 45 unit tests, all passing.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#147