246 lines
6.8 KiB
Markdown
246 lines
6.8 KiB
Markdown
# EPIC-001: Proper Metrics Visualization System
|
|
|
|
**Status:** PROPOSED
|
|
**Priority:** P1 - Infrastructure
|
|
**Owner:** TBD (Allegro proposing, needs assignment)
|
|
**Created:** 2026-04-01
|
|
|
|
---
|
|
|
|
## Current State (REAL METRICS - NO FAKES)
|
|
|
|
### Test Suite Metrics (Hermes Agent)
|
|
```
|
|
Total Tests: 4,884
|
|
Passed: 3,930 (80.5%)
|
|
Failed: 680 (13.9%)
|
|
Errors: 274 (5.6%)
|
|
Skipped: 166 (3.4%)
|
|
Execution Time: 139.79s (2:19)
|
|
```
|
|
|
|
**Breakdown by Category:**
|
|
- ACP Tests: 29 passed, 2 failed, 5 skipped (after fixes)
|
|
- Core Agent: ~85% pass rate
|
|
- Tool Tests: Voice mode and web tools showing errors (config-related)
|
|
- E2E Tests: Excluded (require external services)
|
|
|
|
### Provider Trait Spike Performance
|
|
```
|
|
Mock Provider: <1ms response
|
|
Kimi Provider: Ready (needs aiohttp)
|
|
Ollama Provider: Ready (needs local server)
|
|
Factory Pattern: Working
|
|
```
|
|
|
|
### System Resources (Allegro VM - 167.99.126.228)
|
|
```
|
|
Memory: 7.8 GB total
|
|
Used: 980 MB (12.5%)
|
|
Free: 5.7 GB (73%)
|
|
Cache: 1.4 GB (18%)
|
|
|
|
Disk: 154 GB total
|
|
Used: 12 GB (8%)
|
|
Free: 143 GB (92%)
|
|
|
|
Load: 0.97 (1-min avg)
|
|
Uptime: 1:56
|
|
```
|
|
|
|
### Wizard Health (From Monitoring Script)
|
|
```
|
|
Allegro-Primus Gateway: DOWN (port 8644)
|
|
Ezra Gateway: DOWN (port 8643)
|
|
Ollama: DOWN (port 11434)
|
|
Gitea: DOWN (port 3000)
|
|
SearXNG: DOWN (port 8080)
|
|
|
|
Note: All services operational on Ezra (143.198.27.163).
|
|
Allegro cannot reach them due to Tailscale isolation.
|
|
```
|
|
|
|
### Claw Code Identity Activation
|
|
```
|
|
User Created: claw-code (ID: 24)
|
|
Token Generated: 5466b7a69b62677bc869778a3d9d12e8f6ee40b4
|
|
Token Deployed: Allegro .env
|
|
Test Issue: #799 (Timmy_Foundation/the-nexus)
|
|
Status: OPERATIONAL
|
|
```
|
|
|
|
---
|
|
|
|
## Problem Statement
|
|
|
|
**No centralized metrics infrastructure exists.**
|
|
|
|
Current limitations:
|
|
- Manual test runs required to get coverage
|
|
- No historical trend tracking
|
|
- No alerting on failure thresholds
|
|
- No performance benchmarking over time
|
|
- Service health checked via bash scripts, not time-series DB
|
|
- Metrics scattered across multiple VMs and files
|
|
|
|
---
|
|
|
|
## Proposed Solution
|
|
|
|
### Phase 1: Metrics Collection Layer (Week 1)
|
|
|
|
**Components:**
|
|
1. **Prometheus** - Time-series metrics database
|
|
- Port: 9090
|
|
- Retention: 30 days local, archive to S3
|
|
|
|
2. **Node Exporter** - System metrics
|
|
- CPU, memory, disk, network
|
|
- Per-process metrics for wizards
|
|
|
|
3. **Custom Exporters:**
|
|
- `hermes_metrics_exporter` - Test pass/fail rates, execution times
|
|
- `gitea_metrics_exporter` - Issue counts, PR merge rates, CI times
|
|
- `wizard_health_exporter` - Gateway status, response times
|
|
|
|
### Phase 2: Visualization Layer (Week 2)
|
|
|
|
**Grafana** - Dashboards
|
|
- Port: 3001 (conflicts with Gitea on 3000)
|
|
- Dashboards:
|
|
- "Wizard Fleet Health" - All wizards, all services
|
|
- "Test Suite Quality" - Pass rates, coverage trends, flaky tests
|
|
- "Performance Metrics" - Response times, token usage, costs
|
|
- "Infrastructure" - CPU, memory, disk across all VMs
|
|
|
|
### Phase 3: Alerting Layer (Week 3)
|
|
|
|
**AlertManager** - Notifications
|
|
- P0: Service down > 5 min
|
|
- P1: Test pass rate < 70%
|
|
- P2: Disk usage > 85%
|
|
- P3: Response time > 10s
|
|
|
|
Delivery: Telegram, email, Gitea issue auto-created
|
|
|
|
---
|
|
|
|
## Technical Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ METRICS PIPELINE │
|
|
├─────────────────────────────────────────────────────────┤
|
|
│ Sources Collectors Storage │
|
|
│ ─────── ───────── ──────── │
|
|
│ Hermes Agent → Prometheus → Prometheus TSDB │
|
|
│ Gitea → Exporters (local + S3) │
|
|
│ System → Node Exporter │
|
|
│ Wizards → Health Checks │
|
|
│ │
|
|
│ Visualization: Grafana (port 3001) │
|
|
│ Alerting: AlertManager → Telegram/Gitea │
|
|
└─────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Metrics to Track
|
|
|
|
### Code Quality Metrics
|
|
| Metric | Target | Alert If |
|
|
|--------|--------|----------|
|
|
| Test Pass Rate | > 85% | < 70% |
|
|
| Code Coverage | > 80% | < 60% |
|
|
| Flaky Tests | 0 | > 5 |
|
|
| ACP Fix Rate | 100% | < 95% |
|
|
|
|
### Performance Metrics
|
|
| Metric | Target | Alert If |
|
|
|--------|--------|----------|
|
|
| Provider Response | < 5s | > 10s |
|
|
| Gateway Uptime | > 99% | < 95% |
|
|
| Memory Usage | < 80% | > 90% |
|
|
| Disk Usage | < 80% | > 90% |
|
|
|
|
### Business Metrics
|
|
| Metric | Target | Alert If |
|
|
|--------|--------|----------|
|
|
| Issues Resolved/Day | > 5 | < 2 |
|
|
| PR Merge Time | < 24h | > 72h |
|
|
| Auto-Dispatch Success | > 95% | < 80% |
|
|
|
|
---
|
|
|
|
## Implementation Plan
|
|
|
|
### Week 1: Foundation
|
|
- [ ] Deploy Prometheus (Docker)
|
|
- [ ] Deploy Node Exporter on all VMs
|
|
- [ ] Create hermes_metrics_exporter
|
|
- [ ] Configure scraping
|
|
|
|
### Week 2: Visualization
|
|
- [ ] Deploy Grafana
|
|
- [ ] Create "Wizard Fleet Health" dashboard
|
|
- [ ] Create "Test Suite Quality" dashboard
|
|
- [ ] Create "Performance Metrics" dashboard
|
|
|
|
### Week 3: Alerting
|
|
- [ ] Deploy AlertManager
|
|
- [ ] Configure Telegram notifications
|
|
- [ ] Configure Gitea issue creation
|
|
- [ ] Test alert pipeline
|
|
|
|
### Week 4: Polish
|
|
- [ ] Document dashboards
|
|
- [ ] Train team on usage
|
|
- [ ] Archive old metrics
|
|
- [ ] Handoff to operations
|
|
|
|
---
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] Prometheus collecting metrics from all sources
|
|
- [ ] Grafana accessible at http://metrics.timmy.local:3001
|
|
- [ ] 3+ dashboards operational
|
|
- [ ] Alerts firing and delivering to Telegram
|
|
- [ ] Historical data retained for 30 days
|
|
- [ ] Runbook documented
|
|
|
|
---
|
|
|
|
## Resources Required
|
|
|
|
| Resource | Spec | Cost |
|
|
|----------|------|------|
|
|
| Prometheus | 2 CPU, 4GB RAM, 100GB disk | $20/mo |
|
|
| Grafana | 1 CPU, 2GB RAM | $10/mo |
|
|
| Storage | 500GB for 90-day retention | $15/mo |
|
|
| **Total** | | **$45/mo** |
|
|
|
|
Or deploy on existing Ezra VPS (has capacity).
|
|
|
|
---
|
|
|
|
## Current Workaround
|
|
|
|
Until EPIC complete, use:
|
|
- `wizard-health.sh` - Basic HTML dashboard
|
|
- Manual pytest runs - Test metrics
|
|
- `claw-code-dispatcher.py` logs - Dispatch metrics
|
|
- System `top`/`free`/`df` - Resource metrics
|
|
|
|
---
|
|
|
|
## Related
|
|
|
|
- Issue #799 - Claw Code identity (metrics will track its dispatch success)
|
|
- ADR-001 - Harness engineering (metrics prove value)
|
|
- Provider spike - Performance baselines established
|
|
|
|
---
|
|
|
|
**Truth in metrics. No fakes.**
|