[claude] The Testbed Observatory — Health Monitoring & Alerting (#147) #159

Merged

claude merged 1 commits from claude/issue-147 into main

2026-04-07 02:00:41 +00:00

Author	SHA1	Message	Date
Alexander Whitestone	9fa0a59761	feat: add Observatory health monitoring & alerting for running services Some checks failed Nix / nix (macos-latest) (pull_request) Waiting to run Details Dependency Audit / Audit Python dependencies (pull_request) Failing after 4s Details Docker Build and Publish / build-and-push (pull_request) Failing after 19s Details Nix / nix (ubuntu-latest) (pull_request) Failing after 2s Details Secret Scan / Scan for secrets (pull_request) Failing after 2s Details Supply Chain Audit / Scan PR for supply chain risks (pull_request) Failing after 2s Details Tests / test (pull_request) Failing after 6s Details Implements Bezalel Epic-003 — a lightweight monitoring script that: - Checks gateway and API server process liveness - Monitors disk, memory, and CPU thresholds (warn/critical levels) - Probes webhook and API server HTTP endpoints for responsiveness - Verifies SQLite database connectivity and size - Sends Telegram alerts when checks degrade or recover (within 60s) - Posts daily digest reports summarising 24h health, SLO status, error counts - Persists 30 days of health snapshots in SQLite (~/.hermes/observatory.db) - Tracks alerts_sent for trend analysis - Defines and tracks SLOs: gateway uptime ≥99.5%, webhook p95 latency ≤2s Usage: python observatory.py --check # one-shot check python observatory.py --daemon # continuous 60s poll python observatory.py --digest # print daily digest python observatory.py --send-digest # send digest via Telegram python observatory.py --slo # print SLO report python observatory.py --history 20 # show last 20 records Config via env: OBSERVATORY_ALERT_CHAT_ID, TELEGRAM_BOT_TOKEN, etc. Adds psutil optional dependency group in pyproject.toml. 45 unit tests covering checks, persistence, alerting, digest, SLOs, and CLI. Fixes #147 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 21:59:14 -04:00

Author

SHA1

Message

Date

Alexander Whitestone

9fa0a59761

feat: add Observatory health monitoring & alerting for running services

Nix / nix (macos-latest) (pull_request) Waiting to run

Details

Dependency Audit / Audit Python dependencies (pull_request) Failing after 4s

Details

Docker Build and Publish / build-and-push (pull_request) Failing after 19s

Details

Nix / nix (ubuntu-latest) (pull_request) Failing after 2s

Details

Secret Scan / Scan for secrets (pull_request) Failing after 2s

Details

Supply Chain Audit / Scan PR for supply chain risks (pull_request) Failing after 2s

Details

Tests / test (pull_request) Failing after 6s

Details

Implements Bezalel Epic-003 — a lightweight monitoring script that:

- Checks gateway and API server process liveness
- Monitors disk, memory, and CPU thresholds (warn/critical levels)
- Probes webhook and API server HTTP endpoints for responsiveness
- Verifies SQLite database connectivity and size
- Sends Telegram alerts when checks degrade or recover (within 60s)
- Posts daily digest reports summarising 24h health, SLO status, error counts
- Persists 30 days of health snapshots in SQLite (~/.hermes/observatory.db)
- Tracks alerts_sent for trend analysis
- Defines and tracks SLOs: gateway uptime ≥99.5%, webhook p95 latency ≤2s

Usage:
  python observatory.py --check        # one-shot check
  python observatory.py --daemon       # continuous 60s poll
  python observatory.py --digest       # print daily digest
  python observatory.py --send-digest  # send digest via Telegram
  python observatory.py --slo          # print SLO report
  python observatory.py --history 20   # show last 20 records

Config via env: OBSERVATORY_ALERT_CHAT_ID, TELEGRAM_BOT_TOKEN, etc.
Adds psutil optional dependency group in pyproject.toml.
45 unit tests covering checks, persistence, alerting, digest, SLOs, and CLI.

Fixes #147

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-06 21:59:14 -04:00

[claude] The Testbed Observatory — Health Monitoring & Alerting (#147) #159

1 Commits