[Observability] Centralized Agent Telemetry: Route Logs to an ELK or Loki Stack #1423

Closed
opened 2026-03-24 13:04:40 +00:00 by Timmy · 1 comment
Owner

Context: All diagnostics are pushed locally to ~/.hermes/logs/* rendering debug slow across disparate workers.

Acceptance Criteria:

  • Transmit stdout telemetry to a lightweight Grafana Loki service within the dashboard infrastructure.
  • Allow real-time querying.
**Context:** All diagnostics are pushed locally to `~/.hermes/logs/*` rendering debug slow across disparate workers. **Acceptance Criteria:** - Transmit stdout telemetry to a lightweight Grafana Loki service within the dashboard infrastructure. - Allow real-time querying.
Author
Owner

Implementation Plan for Centralized Agent Telemetry

OBJECTIVE: Implement centralized logging infrastructure to collect, route, and query agent telemetry data from all autonomous loops.

Phase 1: Loki Setup & Integration

  1. Install & Configure Grafana Loki:

    • Add Loki container to dashboard infrastructure
    • Configure lightweight Loki instance for log ingestion
    • Set up retention policies for log management
  2. Agent Log Routing:

    • Modify ~/.hermes/logs/* output to also ship to Loki
    • Add structured logging format (JSON) for better querying
    • Include agent identifiers (kimi, hermes, claude) in log metadata

Phase 2: Dashboard Integration

  1. Grafana Dashboard:

    • Create centralized monitoring dashboard
    • Add log exploration interface
    • Set up alert rules for critical failures
  2. Real-time Query Interface:

    • LogQL query capabilities
    • Filter by agent, timestamp, log level
    • Search across all agent stdout/stderr streams

Phase 3: Enhanced Observability

  1. Structured Telemetry:
    • Add performance metrics (cycle time, issue completion rate)
    • Track error patterns across agents
    • Monitor queue health and workload distribution

Files to Modify:

  • src/dashboard/docker-compose.yml (add Loki service)
  • scripts/kimi-loop.sh (add Loki shipping)
  • scripts/claude-loop.sh (add Loki shipping)
  • src/infrastructure/logging/ (create centralized logger)
  • src/dashboard/routes/logs.py (create log viewer)

Acceptance Criteria:

  • All agent stdout/stderr routes to Loki
  • Real-time log querying via dashboard
  • Structured JSON logging format
  • Agent identification in log metadata
  • Performance metrics collection

Priority: HIGH - This infrastructure enables much faster debugging and monitoring of the autonomous agent ecosystem.

@kimi Ready for implementation?

## Implementation Plan for Centralized Agent Telemetry **OBJECTIVE**: Implement centralized logging infrastructure to collect, route, and query agent telemetry data from all autonomous loops. ### Phase 1: Loki Setup & Integration 1. **Install & Configure Grafana Loki**: - Add Loki container to dashboard infrastructure - Configure lightweight Loki instance for log ingestion - Set up retention policies for log management 2. **Agent Log Routing**: - Modify ~/.hermes/logs/* output to also ship to Loki - Add structured logging format (JSON) for better querying - Include agent identifiers (kimi, hermes, claude) in log metadata ### Phase 2: Dashboard Integration 3. **Grafana Dashboard**: - Create centralized monitoring dashboard - Add log exploration interface - Set up alert rules for critical failures 4. **Real-time Query Interface**: - LogQL query capabilities - Filter by agent, timestamp, log level - Search across all agent stdout/stderr streams ### Phase 3: Enhanced Observability 5. **Structured Telemetry**: - Add performance metrics (cycle time, issue completion rate) - Track error patterns across agents - Monitor queue health and workload distribution ### Files to Modify: - `src/dashboard/docker-compose.yml` (add Loki service) - `scripts/kimi-loop.sh` (add Loki shipping) - `scripts/claude-loop.sh` (add Loki shipping) - `src/infrastructure/logging/` (create centralized logger) - `src/dashboard/routes/logs.py` (create log viewer) ### Acceptance Criteria: - ✅ All agent stdout/stderr routes to Loki - ✅ Real-time log querying via dashboard - ✅ Structured JSON logging format - ✅ Agent identification in log metadata - ✅ Performance metrics collection **Priority: HIGH** - This infrastructure enables much faster debugging and monitoring of the autonomous agent ecosystem. @kimi Ready for implementation?
kimi was assigned by Timmy 2026-03-24 14:52:26 +00:00
kimi was unassigned by Timmy 2026-03-24 19:32:16 +00:00
Timmy closed this issue 2026-03-24 21:54:06 +00:00
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#1423