Alexander Whitestone
|
9fa0a59761
|
feat: add Observatory health monitoring & alerting for running services
Nix / nix (macos-latest) (pull_request) Waiting to run
Dependency Audit / Audit Python dependencies (pull_request) Failing after 4s
Docker Build and Publish / build-and-push (pull_request) Failing after 19s
Nix / nix (ubuntu-latest) (pull_request) Failing after 2s
Secret Scan / Scan for secrets (pull_request) Failing after 2s
Supply Chain Audit / Scan PR for supply chain risks (pull_request) Failing after 2s
Tests / test (pull_request) Failing after 6s
Implements Bezalel Epic-003 — a lightweight monitoring script that:
- Checks gateway and API server process liveness
- Monitors disk, memory, and CPU thresholds (warn/critical levels)
- Probes webhook and API server HTTP endpoints for responsiveness
- Verifies SQLite database connectivity and size
- Sends Telegram alerts when checks degrade or recover (within 60s)
- Posts daily digest reports summarising 24h health, SLO status, error counts
- Persists 30 days of health snapshots in SQLite (~/.hermes/observatory.db)
- Tracks alerts_sent for trend analysis
- Defines and tracks SLOs: gateway uptime ≥99.5%, webhook p95 latency ≤2s
Usage:
python observatory.py --check # one-shot check
python observatory.py --daemon # continuous 60s poll
python observatory.py --digest # print daily digest
python observatory.py --send-digest # send digest via Telegram
python observatory.py --slo # print SLO report
python observatory.py --history 20 # show last 20 records
Config via env: OBSERVATORY_ALERT_CHAT_ID, TELEGRAM_BOT_TOKEN, etc.
Adds psutil optional dependency group in pyproject.toml.
45 unit tests covering checks, persistence, alerting, digest, SLOs, and CLI.
Fixes #147
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-06 21:59:14 -04:00 |
|