Files
Alexander Whitestone 9fa0a59761
Some checks failed
Dependency Audit / Audit Python dependencies (pull_request) Failing after 4s
Docker Build and Publish / build-and-push (pull_request) Failing after 19s
Nix / nix (ubuntu-latest) (pull_request) Failing after 2s
Secret Scan / Scan for secrets (pull_request) Failing after 2s
Supply Chain Audit / Scan PR for supply chain risks (pull_request) Failing after 2s
Tests / test (pull_request) Failing after 6s
Nix / nix (macos-latest) (pull_request) Has been cancelled
feat: add Observatory health monitoring & alerting for running services
Implements Bezalel Epic-003 — a lightweight monitoring script that:

- Checks gateway and API server process liveness
- Monitors disk, memory, and CPU thresholds (warn/critical levels)
- Probes webhook and API server HTTP endpoints for responsiveness
- Verifies SQLite database connectivity and size
- Sends Telegram alerts when checks degrade or recover (within 60s)
- Posts daily digest reports summarising 24h health, SLO status, error counts
- Persists 30 days of health snapshots in SQLite (~/.hermes/observatory.db)
- Tracks alerts_sent for trend analysis
- Defines and tracks SLOs: gateway uptime ≥99.5%, webhook p95 latency ≤2s

Usage:
  python observatory.py --check        # one-shot check
  python observatory.py --daemon       # continuous 60s poll
  python observatory.py --digest       # print daily digest
  python observatory.py --send-digest  # send digest via Telegram
  python observatory.py --slo          # print SLO report
  python observatory.py --history 20   # show last 20 records

Config via env: OBSERVATORY_ALERT_CHAT_ID, TELEGRAM_BOT_TOKEN, etc.
Adds psutil optional dependency group in pyproject.toml.
45 unit tests covering checks, persistence, alerting, digest, SLOs, and CLI.

Fixes #147

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 21:59:14 -04:00
..