timmy-config/wizards/allegro/docs/EPIC-001-metrics-visualization-system.md

# EPIC-001: Proper Metrics Visualization System

**Status:** PROPOSED
**Priority:** P1 - Infrastructure
**Owner:** TBD (Allegro proposing, needs assignment)
**Created:** 2026-04-01

---

## Current State (REAL METRICS - NO FAKES)

### Test Suite Metrics (Hermes Agent)
```
Total Tests:    4,884
Passed:         3,930 (80.5%)
Failed:           680 (13.9%)
Errors:           274 (5.6%)
Skipped:          166 (3.4%)
Execution Time: 139.79s (2:19)
```

**Breakdown by Category:**
- ACP Tests: 29 passed, 2 failed, 5 skipped (after fixes)
- Core Agent: ~85% pass rate
- Tool Tests: Voice mode and web tools showing errors (config-related)
- E2E Tests: Excluded (require external services)

### Provider Trait Spike Performance
```
Mock Provider:    <1ms response
Kimi Provider:    Ready (needs aiohttp)
Ollama Provider:  Ready (needs local server)
Factory Pattern:  Working
```

### System Resources (Allegro VM - 167.99.126.228)
```
Memory:     7.8 GB total
  Used:     980 MB (12.5%)
  Free:     5.7 GB (73%)
  Cache:    1.4 GB (18%)

Disk:       154 GB total
  Used:     12 GB (8%)
  Free:     143 GB (92%)

Load:       0.97 (1-min avg)
Uptime:     1:56
```

### Wizard Health (From Monitoring Script)
```
Allegro-Primus Gateway: DOWN (port 8644)
Ezra Gateway:          DOWN (port 8643)
Ollama:                DOWN (port 11434)
Gitea:                 DOWN (port 3000)
SearXNG:               DOWN (port 8080)

Note: All services operational on Ezra (143.198.27.163).
Allegro cannot reach them due to Tailscale isolation.
```

### Claw Code Identity Activation
```
User Created:    claw-code (ID: 24)
Token Generated: 5466b7a69b62677bc869778a3d9d12e8f6ee40b4
Token Deployed:  Allegro .env
Test Issue:      #799 (Timmy_Foundation/the-nexus)
Status:          OPERATIONAL
```

---

## Problem Statement

**No centralized metrics infrastructure exists.**

Current limitations:
- Manual test runs required to get coverage
- No historical trend tracking
- No alerting on failure thresholds
- No performance benchmarking over time
- Service health checked via bash scripts, not time-series DB
- Metrics scattered across multiple VMs and files

---

## Proposed Solution

### Phase 1: Metrics Collection Layer (Week 1)

**Components:**
1. **Prometheus** - Time-series metrics database
   - Port: 9090
   - Retention: 30 days local, archive to S3

2. **Node Exporter** - System metrics
   - CPU, memory, disk, network
   - Per-process metrics for wizards

3. **Custom Exporters:**
   - `hermes_metrics_exporter` - Test pass/fail rates, execution times
   - `gitea_metrics_exporter` - Issue counts, PR merge rates, CI times
   - `wizard_health_exporter` - Gateway status, response times

### Phase 2: Visualization Layer (Week 2)

**Grafana** - Dashboards
- Port: 3001 (conflicts with Gitea on 3000)
- Dashboards:
  - "Wizard Fleet Health" - All wizards, all services
  - "Test Suite Quality" - Pass rates, coverage trends, flaky tests
  - "Performance Metrics" - Response times, token usage, costs
  - "Infrastructure" - CPU, memory, disk across all VMs

### Phase 3: Alerting Layer (Week 3)

**AlertManager** - Notifications
- P0: Service down > 5 min
- P1: Test pass rate < 70%
- P2: Disk usage > 85%
- P3: Response time > 10s

Delivery: Telegram, email, Gitea issue auto-created

---

## Technical Architecture

```
┌─────────────────────────────────────────────────────────┐
│                    METRICS PIPELINE                     │
├─────────────────────────────────────────────────────────┤
│  Sources          Collectors           Storage          │
│  ───────          ─────────           ────────          │
│  Hermes Agent  →  Prometheus      →   Prometheus TSDB   │
│  Gitea         →  Exporters          (local + S3)       │
│  System        →  Node Exporter                         │
│  Wizards       →  Health Checks                         │
│                                                         │
│  Visualization: Grafana (port 3001)                     │
│  Alerting: AlertManager → Telegram/Gitea                │
└─────────────────────────────────────────────────────────┘
```

---

## Metrics to Track

### Code Quality Metrics
| Metric | Target | Alert If |
|--------|--------|----------|
| Test Pass Rate | > 85% | < 70% |
| Code Coverage | > 80% | < 60% |
| Flaky Tests | 0 | > 5 |
| ACP Fix Rate | 100% | < 95% |

### Performance Metrics
| Metric | Target | Alert If |
|--------|--------|----------|
| Provider Response | < 5s | > 10s |
| Gateway Uptime | > 99% | < 95% |
| Memory Usage | < 80% | > 90% |
| Disk Usage | < 80% | > 90% |

### Business Metrics
| Metric | Target | Alert If |
|--------|--------|----------|
| Issues Resolved/Day | > 5 | < 2 |
| PR Merge Time | < 24h | > 72h |
| Auto-Dispatch Success | > 95% | < 80% |

---

## Implementation Plan

### Week 1: Foundation
- [ ] Deploy Prometheus (Docker)
- [ ] Deploy Node Exporter on all VMs
- [ ] Create hermes_metrics_exporter
- [ ] Configure scraping

### Week 2: Visualization
- [ ] Deploy Grafana
- [ ] Create "Wizard Fleet Health" dashboard
- [ ] Create "Test Suite Quality" dashboard
- [ ] Create "Performance Metrics" dashboard

### Week 3: Alerting
- [ ] Deploy AlertManager
- [ ] Configure Telegram notifications
- [ ] Configure Gitea issue creation
- [ ] Test alert pipeline

### Week 4: Polish
- [ ] Document dashboards
- [ ] Train team on usage
- [ ] Archive old metrics
- [ ] Handoff to operations

---

## Acceptance Criteria

- [ ] Prometheus collecting metrics from all sources
- [ ] Grafana accessible at http://metrics.timmy.local:3001
- [ ] 3+ dashboards operational
- [ ] Alerts firing and delivering to Telegram
- [ ] Historical data retained for 30 days
- [ ] Runbook documented

---

## Resources Required

| Resource | Spec | Cost |
|----------|------|------|
| Prometheus | 2 CPU, 4GB RAM, 100GB disk | $20/mo |
| Grafana | 1 CPU, 2GB RAM | $10/mo |
| Storage | 500GB for 90-day retention | $15/mo |
| **Total** | | **$45/mo** |

Or deploy on existing Ezra VPS (has capacity).

---

## Current Workaround

Until EPIC complete, use:
- `wizard-health.sh` - Basic HTML dashboard
- Manual pytest runs - Test metrics
- `claw-code-dispatcher.py` logs - Dispatch metrics
- System `top`/`free`/`df` - Resource metrics

---

## Related

- Issue #799 - Claw Code identity (metrics will track its dispatch success)
- ADR-001 - Harness engineering (metrics prove value)
- Provider spike - Performance baselines established

---

**Truth in metrics. No fakes.**