Files
timmy-config/wizards/allegro/docs/EPIC-001-metrics-visualization-system.md
2026-04-01 23:00:10 +00:00

6.8 KiB

EPIC-001: Proper Metrics Visualization System

Status: PROPOSED
Priority: P1 - Infrastructure
Owner: TBD (Allegro proposing, needs assignment)
Created: 2026-04-01


Current State (REAL METRICS - NO FAKES)

Test Suite Metrics (Hermes Agent)

Total Tests:    4,884
Passed:         3,930 (80.5%)
Failed:           680 (13.9%)
Errors:           274 (5.6%)
Skipped:          166 (3.4%)
Execution Time: 139.79s (2:19)

Breakdown by Category:

  • ACP Tests: 29 passed, 2 failed, 5 skipped (after fixes)
  • Core Agent: ~85% pass rate
  • Tool Tests: Voice mode and web tools showing errors (config-related)
  • E2E Tests: Excluded (require external services)

Provider Trait Spike Performance

Mock Provider:    <1ms response
Kimi Provider:    Ready (needs aiohttp)
Ollama Provider:  Ready (needs local server)
Factory Pattern:  Working

System Resources (Allegro VM - 167.99.126.228)

Memory:     7.8 GB total
  Used:     980 MB (12.5%)
  Free:     5.7 GB (73%)
  Cache:    1.4 GB (18%)
  
Disk:       154 GB total
  Used:     12 GB (8%)
  Free:     143 GB (92%)
  
Load:       0.97 (1-min avg)
Uptime:     1:56

Wizard Health (From Monitoring Script)

Allegro-Primus Gateway: DOWN (port 8644)
Ezra Gateway:          DOWN (port 8643)
Ollama:                DOWN (port 11434)
Gitea:                 DOWN (port 3000)
SearXNG:               DOWN (port 8080)

Note: All services operational on Ezra (143.198.27.163).
Allegro cannot reach them due to Tailscale isolation.

Claw Code Identity Activation

User Created:    claw-code (ID: 24)
Token Generated: 5466b7a69b62677bc869778a3d9d12e8f6ee40b4
Token Deployed:  Allegro .env
Test Issue:      #799 (Timmy_Foundation/the-nexus)
Status:          OPERATIONAL

Problem Statement

No centralized metrics infrastructure exists.

Current limitations:

  • Manual test runs required to get coverage
  • No historical trend tracking
  • No alerting on failure thresholds
  • No performance benchmarking over time
  • Service health checked via bash scripts, not time-series DB
  • Metrics scattered across multiple VMs and files

Proposed Solution

Phase 1: Metrics Collection Layer (Week 1)

Components:

  1. Prometheus - Time-series metrics database

    • Port: 9090
    • Retention: 30 days local, archive to S3
  2. Node Exporter - System metrics

    • CPU, memory, disk, network
    • Per-process metrics for wizards
  3. Custom Exporters:

    • hermes_metrics_exporter - Test pass/fail rates, execution times
    • gitea_metrics_exporter - Issue counts, PR merge rates, CI times
    • wizard_health_exporter - Gateway status, response times

Phase 2: Visualization Layer (Week 2)

Grafana - Dashboards

  • Port: 3001 (conflicts with Gitea on 3000)
  • Dashboards:
    • "Wizard Fleet Health" - All wizards, all services
    • "Test Suite Quality" - Pass rates, coverage trends, flaky tests
    • "Performance Metrics" - Response times, token usage, costs
    • "Infrastructure" - CPU, memory, disk across all VMs

Phase 3: Alerting Layer (Week 3)

AlertManager - Notifications

  • P0: Service down > 5 min
  • P1: Test pass rate < 70%
  • P2: Disk usage > 85%
  • P3: Response time > 10s

Delivery: Telegram, email, Gitea issue auto-created


Technical Architecture

┌─────────────────────────────────────────────────────────┐
│                    METRICS PIPELINE                     │
├─────────────────────────────────────────────────────────┤
│  Sources          Collectors           Storage          │
│  ───────          ─────────           ────────          │
│  Hermes Agent  →  Prometheus      →   Prometheus TSDB   │
│  Gitea         →  Exporters          (local + S3)       │
│  System        →  Node Exporter                         │
│  Wizards       →  Health Checks                         │
│                                                         │
│  Visualization: Grafana (port 3001)                     │
│  Alerting: AlertManager → Telegram/Gitea                │
└─────────────────────────────────────────────────────────┘

Metrics to Track

Code Quality Metrics

Metric Target Alert If
Test Pass Rate > 85% < 70%
Code Coverage > 80% < 60%
Flaky Tests 0 > 5
ACP Fix Rate 100% < 95%

Performance Metrics

Metric Target Alert If
Provider Response < 5s > 10s
Gateway Uptime > 99% < 95%
Memory Usage < 80% > 90%
Disk Usage < 80% > 90%

Business Metrics

Metric Target Alert If
Issues Resolved/Day > 5 < 2
PR Merge Time < 24h > 72h
Auto-Dispatch Success > 95% < 80%

Implementation Plan

Week 1: Foundation

  • Deploy Prometheus (Docker)
  • Deploy Node Exporter on all VMs
  • Create hermes_metrics_exporter
  • Configure scraping

Week 2: Visualization

  • Deploy Grafana
  • Create "Wizard Fleet Health" dashboard
  • Create "Test Suite Quality" dashboard
  • Create "Performance Metrics" dashboard

Week 3: Alerting

  • Deploy AlertManager
  • Configure Telegram notifications
  • Configure Gitea issue creation
  • Test alert pipeline

Week 4: Polish

  • Document dashboards
  • Train team on usage
  • Archive old metrics
  • Handoff to operations

Acceptance Criteria

  • Prometheus collecting metrics from all sources
  • Grafana accessible at http://metrics.timmy.local:3001
  • 3+ dashboards operational
  • Alerts firing and delivering to Telegram
  • Historical data retained for 30 days
  • Runbook documented

Resources Required

Resource Spec Cost
Prometheus 2 CPU, 4GB RAM, 100GB disk $20/mo
Grafana 1 CPU, 2GB RAM $10/mo
Storage 500GB for 90-day retention $15/mo
Total $45/mo

Or deploy on existing Ezra VPS (has capacity).


Current Workaround

Until EPIC complete, use:

  • wizard-health.sh - Basic HTML dashboard
  • Manual pytest runs - Test metrics
  • claw-code-dispatcher.py logs - Dispatch metrics
  • System top/free/df - Resource metrics

  • Issue #799 - Claw Code identity (metrics will track its dispatch success)
  • ADR-001 - Harness engineering (metrics prove value)
  • Provider spike - Performance baselines established

Truth in metrics. No fakes.