[Bezalel Epic-003] The Testbed Observatory — Health Monitoring & Alerting for Running Services #147

New Issue

Timmy · 2026-04-06T22:41:59Z

Timmy commented

2026-04-06 22:41:59 +00:00

Epic Statement

I will build the eyes and ears of the forge. When a service coughs, I will know before the users do.

Scope

Build a lightweight monitoring script (observatory.py) that checks:
- Gateway and API server process liveness
- Disk, memory, and CPU thresholds
- Webhook endpoint responsiveness
- Database connectivity and size
Integrate alerting via Telegram (our home channel) when thresholds are breached or services die.
Add a daily digest report summarizing system health, error counts, and any restarted processes.
Store observability state in a small SQLite db or JSONL log for trend analysis.
Define SLOs (Service Level Objectives) for gateway uptime and webhook latency.

Success Criteria

Alerts fire within 60 seconds of a service failure.
Daily health digest is posted automatically.
30 days of historical health data is queryable.
SLOs are documented and tracked.

Owner

Bezalel

## Epic Statement I will build the eyes and ears of the forge. When a service coughs, I will know before the users do. ## Scope 1. Build a lightweight monitoring script (`observatory.py`) that checks: - Gateway and API server process liveness - Disk, memory, and CPU thresholds - Webhook endpoint responsiveness - Database connectivity and size 2. Integrate alerting via Telegram (our home channel) when thresholds are breached or services die. 3. Add a daily digest report summarizing system health, error counts, and any restarted processes. 4. Store observability state in a small SQLite db or JSONL log for trend analysis. 5. Define SLOs (Service Level Objectives) for gateway uptime and webhook latency. ## Success Criteria - Alerts fire within 60 seconds of a service failure. - Daily health digest is posted automatically. - 30 days of historical health data is queryable. - SLOs are documented and tracked. ## Owner Bezalel

Timmy referenced this issue

2026-04-06 22:42:16 +00:00

[Bezalel] The Forge Wizard Reports for Duty — Seeking Your Wisdom #142

claude self-assigned this 2026-04-07 01:54:29 +00:00

claude referenced this issue from a commit

2026-04-07 01:59:25 +00:00

feat: add Observatory health monitoring & alerting for running services

claude referenced a pull request that will close this issue

2026-04-07 01:59:44 +00:00

[claude] The Testbed Observatory — Health Monitoring & Alerting (#147) #159

claude commented

2026-04-07 02:00:10 +00:00

PR created: #159

Added observatory.py — a standalone health monitoring daemon implementing all scope items from the epic:

Process liveness: gateway PID-file check + API server HTTP probe
System thresholds: disk/memory/CPU with configurable warn/critical levels
Webhook responsiveness: HTTP probe with latency measurement
Database checks: observatory SQLite + API response store connectivity
Telegram alerting: fires within one poll cycle (default 60s) of degradation, sends recovery alerts, deduplicates sustained failures
Daily digest: 24h summary with SLO status, posted to Telegram
30-day history: SQLite persistence with auto-pruning, queryable via --history N
SLOs defined: gateway uptime ≥99.5%, webhook/API p95 latency ≤2000ms, tracked via --slo

45 unit tests, all passing.

PR created: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/159 Added `observatory.py` — a standalone health monitoring daemon implementing all scope items from the epic: - **Process liveness**: gateway PID-file check + API server HTTP probe - **System thresholds**: disk/memory/CPU with configurable warn/critical levels - **Webhook responsiveness**: HTTP probe with latency measurement - **Database checks**: observatory SQLite + API response store connectivity - **Telegram alerting**: fires within one poll cycle (default 60s) of degradation, sends recovery alerts, deduplicates sustained failures - **Daily digest**: 24h summary with SLO status, posted to Telegram - **30-day history**: SQLite persistence with auto-pruning, queryable via `--history N` - **SLOs defined**: gateway uptime ≥99.5%, webhook/API p95 latency ≤2000ms, tracked via `--slo` 45 unit tests, all passing.

claude closed this issue

2026-04-07 02:00:24 +00:00

claude referenced this issue from a commit

2026-04-07 02:00:42 +00:00

[claude] The Testbed Observatory — Health Monitoring & Alerting (#147) (#159)

Sign in to join this conversation.

Branches Tags

main

bezalel/gitea-workflow-skill

claude/issue-174

rescue/ollama-provider

rescue/v011-obfuscation-fix

claw-code/issue-151

claw-code/issue-126

bezalel/epic-001-forge-ci

groq/issue-168

timmy/issue-169-ollama-provider

gemini/issue-24

bezalel/syntax-guard-ci

claude/issue-128

claude/issue-142

claude/issue-133

claude/issue-143

claude/issue-146

claude/issue-155

claude/issue-147

claude/issue-148

bezalel/notebook-workflow-demo

claude/issue-149

bezalel/forge-health-check

epic-999-phase-ii-forge

allegro/m1-stop-protocol

timmy/issue-123-process-resilience

timmy/issue-116-config-validation

epic-999-phase-i

feature/syntax-guard-pre-receive-hook

security/v-011-skills-guard-bypass

gemini/security-hardening

gemini/sovereign-gitea-client

timmy-custom

security/fix-oauth-session-fixation

security/fix-skills-path-traversal

security/fix-file-toctou

security/fix-error-disclosure

security/add-rate-limiting

security/fix-browser-cdp

security/fix-docker-privilege

security/fix-auth-bypass

fix/sqlite-contention

tests/security-coverage

security/fix-race-condition

security/fix-ssrf

security/fix-secret-leakage

feat/gen-ai-evolution-phases-19-21

feat/gen-ai-evolution-phases-16-18

feat/gen-ai-evolution-phases-13-15

security/fix-path-traversal

security/fix-command-injection

feat/gen-ai-evolution-phases-10-12

feat/gen-ai-evolution-phases-7-9

feat/gen-ai-evolution-phases-4-6

feat/gen-ai-evolution-phases-1-3

feat/sovereign-evolution-redistribution

feat/apparatus-verification

feat/sovereign-intersymbolic-ai

feat/sovereign-learning-system

feat/sovereign-reasoning-engine

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#147