6.8 KiB
6.8 KiB
EPIC-001: Proper Metrics Visualization System
Status: PROPOSED
Priority: P1 - Infrastructure
Owner: TBD (Allegro proposing, needs assignment)
Created: 2026-04-01
Current State (REAL METRICS - NO FAKES)
Test Suite Metrics (Hermes Agent)
Total Tests: 4,884
Passed: 3,930 (80.5%)
Failed: 680 (13.9%)
Errors: 274 (5.6%)
Skipped: 166 (3.4%)
Execution Time: 139.79s (2:19)
Breakdown by Category:
- ACP Tests: 29 passed, 2 failed, 5 skipped (after fixes)
- Core Agent: ~85% pass rate
- Tool Tests: Voice mode and web tools showing errors (config-related)
- E2E Tests: Excluded (require external services)
Provider Trait Spike Performance
Mock Provider: <1ms response
Kimi Provider: Ready (needs aiohttp)
Ollama Provider: Ready (needs local server)
Factory Pattern: Working
System Resources (Allegro VM - 167.99.126.228)
Memory: 7.8 GB total
Used: 980 MB (12.5%)
Free: 5.7 GB (73%)
Cache: 1.4 GB (18%)
Disk: 154 GB total
Used: 12 GB (8%)
Free: 143 GB (92%)
Load: 0.97 (1-min avg)
Uptime: 1:56
Wizard Health (From Monitoring Script)
Allegro-Primus Gateway: DOWN (port 8644)
Ezra Gateway: DOWN (port 8643)
Ollama: DOWN (port 11434)
Gitea: DOWN (port 3000)
SearXNG: DOWN (port 8080)
Note: All services operational on Ezra (143.198.27.163).
Allegro cannot reach them due to Tailscale isolation.
Claw Code Identity Activation
User Created: claw-code (ID: 24)
Token Generated: 5466b7a69b62677bc869778a3d9d12e8f6ee40b4
Token Deployed: Allegro .env
Test Issue: #799 (Timmy_Foundation/the-nexus)
Status: OPERATIONAL
Problem Statement
No centralized metrics infrastructure exists.
Current limitations:
- Manual test runs required to get coverage
- No historical trend tracking
- No alerting on failure thresholds
- No performance benchmarking over time
- Service health checked via bash scripts, not time-series DB
- Metrics scattered across multiple VMs and files
Proposed Solution
Phase 1: Metrics Collection Layer (Week 1)
Components:
-
Prometheus - Time-series metrics database
- Port: 9090
- Retention: 30 days local, archive to S3
-
Node Exporter - System metrics
- CPU, memory, disk, network
- Per-process metrics for wizards
-
Custom Exporters:
hermes_metrics_exporter- Test pass/fail rates, execution timesgitea_metrics_exporter- Issue counts, PR merge rates, CI timeswizard_health_exporter- Gateway status, response times
Phase 2: Visualization Layer (Week 2)
Grafana - Dashboards
- Port: 3001 (conflicts with Gitea on 3000)
- Dashboards:
- "Wizard Fleet Health" - All wizards, all services
- "Test Suite Quality" - Pass rates, coverage trends, flaky tests
- "Performance Metrics" - Response times, token usage, costs
- "Infrastructure" - CPU, memory, disk across all VMs
Phase 3: Alerting Layer (Week 3)
AlertManager - Notifications
- P0: Service down > 5 min
- P1: Test pass rate < 70%
- P2: Disk usage > 85%
- P3: Response time > 10s
Delivery: Telegram, email, Gitea issue auto-created
Technical Architecture
┌─────────────────────────────────────────────────────────┐
│ METRICS PIPELINE │
├─────────────────────────────────────────────────────────┤
│ Sources Collectors Storage │
│ ─────── ───────── ──────── │
│ Hermes Agent → Prometheus → Prometheus TSDB │
│ Gitea → Exporters (local + S3) │
│ System → Node Exporter │
│ Wizards → Health Checks │
│ │
│ Visualization: Grafana (port 3001) │
│ Alerting: AlertManager → Telegram/Gitea │
└─────────────────────────────────────────────────────────┘
Metrics to Track
Code Quality Metrics
| Metric | Target | Alert If |
|---|---|---|
| Test Pass Rate | > 85% | < 70% |
| Code Coverage | > 80% | < 60% |
| Flaky Tests | 0 | > 5 |
| ACP Fix Rate | 100% | < 95% |
Performance Metrics
| Metric | Target | Alert If |
|---|---|---|
| Provider Response | < 5s | > 10s |
| Gateway Uptime | > 99% | < 95% |
| Memory Usage | < 80% | > 90% |
| Disk Usage | < 80% | > 90% |
Business Metrics
| Metric | Target | Alert If |
|---|---|---|
| Issues Resolved/Day | > 5 | < 2 |
| PR Merge Time | < 24h | > 72h |
| Auto-Dispatch Success | > 95% | < 80% |
Implementation Plan
Week 1: Foundation
- Deploy Prometheus (Docker)
- Deploy Node Exporter on all VMs
- Create hermes_metrics_exporter
- Configure scraping
Week 2: Visualization
- Deploy Grafana
- Create "Wizard Fleet Health" dashboard
- Create "Test Suite Quality" dashboard
- Create "Performance Metrics" dashboard
Week 3: Alerting
- Deploy AlertManager
- Configure Telegram notifications
- Configure Gitea issue creation
- Test alert pipeline
Week 4: Polish
- Document dashboards
- Train team on usage
- Archive old metrics
- Handoff to operations
Acceptance Criteria
- Prometheus collecting metrics from all sources
- Grafana accessible at http://metrics.timmy.local:3001
- 3+ dashboards operational
- Alerts firing and delivering to Telegram
- Historical data retained for 30 days
- Runbook documented
Resources Required
| Resource | Spec | Cost |
|---|---|---|
| Prometheus | 2 CPU, 4GB RAM, 100GB disk | $20/mo |
| Grafana | 1 CPU, 2GB RAM | $10/mo |
| Storage | 500GB for 90-day retention | $15/mo |
| Total | $45/mo |
Or deploy on existing Ezra VPS (has capacity).
Current Workaround
Until EPIC complete, use:
wizard-health.sh- Basic HTML dashboard- Manual pytest runs - Test metrics
claw-code-dispatcher.pylogs - Dispatch metrics- System
top/free/df- Resource metrics
Related
- Issue #799 - Claw Code identity (metrics will track its dispatch success)
- ADR-001 - Harness engineering (metrics prove value)
- Provider spike - Performance baselines established
Truth in metrics. No fakes.