Files
timmy-config/wizards/allegro/docs/EPIC-001-metrics-visualization-system.md
2026-04-01 23:00:10 +00:00

246 lines
6.8 KiB
Markdown

# EPIC-001: Proper Metrics Visualization System
**Status:** PROPOSED
**Priority:** P1 - Infrastructure
**Owner:** TBD (Allegro proposing, needs assignment)
**Created:** 2026-04-01
---
## Current State (REAL METRICS - NO FAKES)
### Test Suite Metrics (Hermes Agent)
```
Total Tests: 4,884
Passed: 3,930 (80.5%)
Failed: 680 (13.9%)
Errors: 274 (5.6%)
Skipped: 166 (3.4%)
Execution Time: 139.79s (2:19)
```
**Breakdown by Category:**
- ACP Tests: 29 passed, 2 failed, 5 skipped (after fixes)
- Core Agent: ~85% pass rate
- Tool Tests: Voice mode and web tools showing errors (config-related)
- E2E Tests: Excluded (require external services)
### Provider Trait Spike Performance
```
Mock Provider: <1ms response
Kimi Provider: Ready (needs aiohttp)
Ollama Provider: Ready (needs local server)
Factory Pattern: Working
```
### System Resources (Allegro VM - 167.99.126.228)
```
Memory: 7.8 GB total
Used: 980 MB (12.5%)
Free: 5.7 GB (73%)
Cache: 1.4 GB (18%)
Disk: 154 GB total
Used: 12 GB (8%)
Free: 143 GB (92%)
Load: 0.97 (1-min avg)
Uptime: 1:56
```
### Wizard Health (From Monitoring Script)
```
Allegro-Primus Gateway: DOWN (port 8644)
Ezra Gateway: DOWN (port 8643)
Ollama: DOWN (port 11434)
Gitea: DOWN (port 3000)
SearXNG: DOWN (port 8080)
Note: All services operational on Ezra (143.198.27.163).
Allegro cannot reach them due to Tailscale isolation.
```
### Claw Code Identity Activation
```
User Created: claw-code (ID: 24)
Token Generated: 5466b7a69b62677bc869778a3d9d12e8f6ee40b4
Token Deployed: Allegro .env
Test Issue: #799 (Timmy_Foundation/the-nexus)
Status: OPERATIONAL
```
---
## Problem Statement
**No centralized metrics infrastructure exists.**
Current limitations:
- Manual test runs required to get coverage
- No historical trend tracking
- No alerting on failure thresholds
- No performance benchmarking over time
- Service health checked via bash scripts, not time-series DB
- Metrics scattered across multiple VMs and files
---
## Proposed Solution
### Phase 1: Metrics Collection Layer (Week 1)
**Components:**
1. **Prometheus** - Time-series metrics database
- Port: 9090
- Retention: 30 days local, archive to S3
2. **Node Exporter** - System metrics
- CPU, memory, disk, network
- Per-process metrics for wizards
3. **Custom Exporters:**
- `hermes_metrics_exporter` - Test pass/fail rates, execution times
- `gitea_metrics_exporter` - Issue counts, PR merge rates, CI times
- `wizard_health_exporter` - Gateway status, response times
### Phase 2: Visualization Layer (Week 2)
**Grafana** - Dashboards
- Port: 3001 (conflicts with Gitea on 3000)
- Dashboards:
- "Wizard Fleet Health" - All wizards, all services
- "Test Suite Quality" - Pass rates, coverage trends, flaky tests
- "Performance Metrics" - Response times, token usage, costs
- "Infrastructure" - CPU, memory, disk across all VMs
### Phase 3: Alerting Layer (Week 3)
**AlertManager** - Notifications
- P0: Service down > 5 min
- P1: Test pass rate < 70%
- P2: Disk usage > 85%
- P3: Response time > 10s
Delivery: Telegram, email, Gitea issue auto-created
---
## Technical Architecture
```
┌─────────────────────────────────────────────────────────┐
│ METRICS PIPELINE │
├─────────────────────────────────────────────────────────┤
│ Sources Collectors Storage │
│ ─────── ───────── ──────── │
│ Hermes Agent → Prometheus → Prometheus TSDB │
│ Gitea → Exporters (local + S3) │
│ System → Node Exporter │
│ Wizards → Health Checks │
│ │
│ Visualization: Grafana (port 3001) │
│ Alerting: AlertManager → Telegram/Gitea │
└─────────────────────────────────────────────────────────┘
```
---
## Metrics to Track
### Code Quality Metrics
| Metric | Target | Alert If |
|--------|--------|----------|
| Test Pass Rate | > 85% | < 70% |
| Code Coverage | > 80% | < 60% |
| Flaky Tests | 0 | > 5 |
| ACP Fix Rate | 100% | < 95% |
### Performance Metrics
| Metric | Target | Alert If |
|--------|--------|----------|
| Provider Response | < 5s | > 10s |
| Gateway Uptime | > 99% | < 95% |
| Memory Usage | < 80% | > 90% |
| Disk Usage | < 80% | > 90% |
### Business Metrics
| Metric | Target | Alert If |
|--------|--------|----------|
| Issues Resolved/Day | > 5 | < 2 |
| PR Merge Time | < 24h | > 72h |
| Auto-Dispatch Success | > 95% | < 80% |
---
## Implementation Plan
### Week 1: Foundation
- [ ] Deploy Prometheus (Docker)
- [ ] Deploy Node Exporter on all VMs
- [ ] Create hermes_metrics_exporter
- [ ] Configure scraping
### Week 2: Visualization
- [ ] Deploy Grafana
- [ ] Create "Wizard Fleet Health" dashboard
- [ ] Create "Test Suite Quality" dashboard
- [ ] Create "Performance Metrics" dashboard
### Week 3: Alerting
- [ ] Deploy AlertManager
- [ ] Configure Telegram notifications
- [ ] Configure Gitea issue creation
- [ ] Test alert pipeline
### Week 4: Polish
- [ ] Document dashboards
- [ ] Train team on usage
- [ ] Archive old metrics
- [ ] Handoff to operations
---
## Acceptance Criteria
- [ ] Prometheus collecting metrics from all sources
- [ ] Grafana accessible at http://metrics.timmy.local:3001
- [ ] 3+ dashboards operational
- [ ] Alerts firing and delivering to Telegram
- [ ] Historical data retained for 30 days
- [ ] Runbook documented
---
## Resources Required
| Resource | Spec | Cost |
|----------|------|------|
| Prometheus | 2 CPU, 4GB RAM, 100GB disk | $20/mo |
| Grafana | 1 CPU, 2GB RAM | $10/mo |
| Storage | 500GB for 90-day retention | $15/mo |
| **Total** | | **$45/mo** |
Or deploy on existing Ezra VPS (has capacity).
---
## Current Workaround
Until EPIC complete, use:
- `wizard-health.sh` - Basic HTML dashboard
- Manual pytest runs - Test metrics
- `claw-code-dispatcher.py` logs - Dispatch metrics
- System `top`/`free`/`df` - Resource metrics
---
## Related
- Issue #799 - Claw Code identity (metrics will track its dispatch success)
- ADR-001 - Harness engineering (metrics prove value)
- Provider spike - Performance baselines established
---
**Truth in metrics. No fakes.**