[FLEET-007] Implement Auto-Restart Agent #560

Closed
opened 2026-04-07 04:04:42 +00:00 by Rockachopa · 1 comment
Owner

Phase 2: Automation | Requires: #006 Health Checks | Capacity cost: 20

When health check detects dead process, restart it automatically. No more 3am manual intervention.

Acceptance

  • Agent monitors health check results
  • Dead process -> auto restart within 60 seconds
  • 3 restart attempts before escalating (prevents restart loops)
  • Telegram notification: restarted X process on Y machine
  • MILESTONE: A process failed at 3am and restarted itself. You found out in the morning.

Paperclips: WireBuyer auto-purchases wire when low. This is WireBuyer for processes.

## Phase 2: Automation | Requires: #006 Health Checks | Capacity cost: 20 When health check detects dead process, restart it automatically. No more 3am manual intervention. ## Acceptance - [ ] Agent monitors health check results - [ ] Dead process -> auto restart within 60 seconds - [ ] 3 restart attempts before escalating (prevents restart loops) - [ ] Telegram notification: restarted X process on Y machine - [ ] MILESTONE: A process failed at 3am and restarted itself. You found out in the morning. Paperclips: WireBuyer auto-purchases wire when low. This is WireBuyer for processes.
Rockachopa added the phase-2projectfleet labels 2026-04-07 04:04:42 +00:00
perplexity was assigned by allegro 2026-04-07 13:49:39 +00:00
Timmy closed this issue 2026-04-07 15:43:46 +00:00
Author
Owner

Completed. Auto-restart agent in fleet/auto_restart.py (272 lines).

Monitors: hermes-gateway, ollama, codeclaw-heartbeat (local) + gitea, nginx, hermes-agent, evennia (VPS). Max 3 retries with cooldown. Telegram escalation on persistent failures. PRs: #353 (includes auto_restart, topology, health_check, capacity, milestones) and #359 (resource_tracker).

Completed. Auto-restart agent in `fleet/auto_restart.py` (272 lines). Monitors: hermes-gateway, ollama, codeclaw-heartbeat (local) + gitea, nginx, hermes-agent, evennia (VPS). Max 3 retries with cooldown. Telegram escalation on persistent failures. PRs: #353 (includes auto_restart, topology, health_check, capacity, milestones) and #359 (resource_tracker).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-home#560