# EPIC-001: Proper Metrics Visualization System **Status:** PROPOSED **Priority:** P1 - Infrastructure **Owner:** TBD (Allegro proposing, needs assignment) **Created:** 2026-04-01 --- ## Current State (REAL METRICS - NO FAKES) ### Test Suite Metrics (Hermes Agent) ``` Total Tests: 4,884 Passed: 3,930 (80.5%) Failed: 680 (13.9%) Errors: 274 (5.6%) Skipped: 166 (3.4%) Execution Time: 139.79s (2:19) ``` **Breakdown by Category:** - ACP Tests: 29 passed, 2 failed, 5 skipped (after fixes) - Core Agent: ~85% pass rate - Tool Tests: Voice mode and web tools showing errors (config-related) - E2E Tests: Excluded (require external services) ### Provider Trait Spike Performance ``` Mock Provider: <1ms response Kimi Provider: Ready (needs aiohttp) Ollama Provider: Ready (needs local server) Factory Pattern: Working ``` ### System Resources (Allegro VM - 167.99.126.228) ``` Memory: 7.8 GB total Used: 980 MB (12.5%) Free: 5.7 GB (73%) Cache: 1.4 GB (18%) Disk: 154 GB total Used: 12 GB (8%) Free: 143 GB (92%) Load: 0.97 (1-min avg) Uptime: 1:56 ``` ### Wizard Health (From Monitoring Script) ``` Allegro-Primus Gateway: DOWN (port 8644) Ezra Gateway: DOWN (port 8643) Ollama: DOWN (port 11434) Gitea: DOWN (port 3000) SearXNG: DOWN (port 8080) Note: All services operational on Ezra (143.198.27.163). Allegro cannot reach them due to Tailscale isolation. ``` ### Claw Code Identity Activation ``` User Created: claw-code (ID: 24) Token Generated: 5466b7a69b62677bc869778a3d9d12e8f6ee40b4 Token Deployed: Allegro .env Test Issue: #799 (Timmy_Foundation/the-nexus) Status: OPERATIONAL ``` --- ## Problem Statement **No centralized metrics infrastructure exists.** Current limitations: - Manual test runs required to get coverage - No historical trend tracking - No alerting on failure thresholds - No performance benchmarking over time - Service health checked via bash scripts, not time-series DB - Metrics scattered across multiple VMs and files --- ## Proposed Solution ### Phase 1: Metrics Collection Layer (Week 1) **Components:** 1. **Prometheus** - Time-series metrics database - Port: 9090 - Retention: 30 days local, archive to S3 2. **Node Exporter** - System metrics - CPU, memory, disk, network - Per-process metrics for wizards 3. **Custom Exporters:** - `hermes_metrics_exporter` - Test pass/fail rates, execution times - `gitea_metrics_exporter` - Issue counts, PR merge rates, CI times - `wizard_health_exporter` - Gateway status, response times ### Phase 2: Visualization Layer (Week 2) **Grafana** - Dashboards - Port: 3001 (conflicts with Gitea on 3000) - Dashboards: - "Wizard Fleet Health" - All wizards, all services - "Test Suite Quality" - Pass rates, coverage trends, flaky tests - "Performance Metrics" - Response times, token usage, costs - "Infrastructure" - CPU, memory, disk across all VMs ### Phase 3: Alerting Layer (Week 3) **AlertManager** - Notifications - P0: Service down > 5 min - P1: Test pass rate < 70% - P2: Disk usage > 85% - P3: Response time > 10s Delivery: Telegram, email, Gitea issue auto-created --- ## Technical Architecture ``` ┌─────────────────────────────────────────────────────────┐ │ METRICS PIPELINE │ ├─────────────────────────────────────────────────────────┤ │ Sources Collectors Storage │ │ ─────── ───────── ──────── │ │ Hermes Agent → Prometheus → Prometheus TSDB │ │ Gitea → Exporters (local + S3) │ │ System → Node Exporter │ │ Wizards → Health Checks │ │ │ │ Visualization: Grafana (port 3001) │ │ Alerting: AlertManager → Telegram/Gitea │ └─────────────────────────────────────────────────────────┘ ``` --- ## Metrics to Track ### Code Quality Metrics | Metric | Target | Alert If | |--------|--------|----------| | Test Pass Rate | > 85% | < 70% | | Code Coverage | > 80% | < 60% | | Flaky Tests | 0 | > 5 | | ACP Fix Rate | 100% | < 95% | ### Performance Metrics | Metric | Target | Alert If | |--------|--------|----------| | Provider Response | < 5s | > 10s | | Gateway Uptime | > 99% | < 95% | | Memory Usage | < 80% | > 90% | | Disk Usage | < 80% | > 90% | ### Business Metrics | Metric | Target | Alert If | |--------|--------|----------| | Issues Resolved/Day | > 5 | < 2 | | PR Merge Time | < 24h | > 72h | | Auto-Dispatch Success | > 95% | < 80% | --- ## Implementation Plan ### Week 1: Foundation - [ ] Deploy Prometheus (Docker) - [ ] Deploy Node Exporter on all VMs - [ ] Create hermes_metrics_exporter - [ ] Configure scraping ### Week 2: Visualization - [ ] Deploy Grafana - [ ] Create "Wizard Fleet Health" dashboard - [ ] Create "Test Suite Quality" dashboard - [ ] Create "Performance Metrics" dashboard ### Week 3: Alerting - [ ] Deploy AlertManager - [ ] Configure Telegram notifications - [ ] Configure Gitea issue creation - [ ] Test alert pipeline ### Week 4: Polish - [ ] Document dashboards - [ ] Train team on usage - [ ] Archive old metrics - [ ] Handoff to operations --- ## Acceptance Criteria - [ ] Prometheus collecting metrics from all sources - [ ] Grafana accessible at http://metrics.timmy.local:3001 - [ ] 3+ dashboards operational - [ ] Alerts firing and delivering to Telegram - [ ] Historical data retained for 30 days - [ ] Runbook documented --- ## Resources Required | Resource | Spec | Cost | |----------|------|------| | Prometheus | 2 CPU, 4GB RAM, 100GB disk | $20/mo | | Grafana | 1 CPU, 2GB RAM | $10/mo | | Storage | 500GB for 90-day retention | $15/mo | | **Total** | | **$45/mo** | Or deploy on existing Ezra VPS (has capacity). --- ## Current Workaround Until EPIC complete, use: - `wizard-health.sh` - Basic HTML dashboard - Manual pytest runs - Test metrics - `claw-code-dispatcher.py` logs - Dispatch metrics - System `top`/`free`/`df` - Resource metrics --- ## Related - Issue #799 - Claw Code identity (metrics will track its dispatch success) - ADR-001 - Harness engineering (metrics prove value) - Provider spike - Performance baselines established --- **Truth in metrics. No fakes.**