P0: Verify Inactivity-Based Timeouts (Gateway + Cron) #115

Open
opened 2026-04-06 14:06:53 +00:00 by Timmy · 1 comment
Owner

Context

Commits fec58ad9 (gateway) and d6ef7fdf (cron) replace wall-clock timeouts with inactivity-based timeouts:

  • Gateway: agent stays alive while actively processing (even for hours), only times out on genuine inactivity
  • Cron: same pattern — HERMES_CRON_TIMEOUT=0 means unlimited
  • Diagnostic info included on timeout: last activity, idle duration, current tool, iteration count

Acceptance Criteria

  • Verify gateway timeout behavior: Start a long-running tool call (e.g., sleep 30 via terminal tool), confirm the gateway does NOT time out while the tool is actively running
  • Verify cron timeout behavior: Run a cron job that actively processes for >600s, confirm it does not timeout (wall-clock: would have been killed; inactivity-based: survives)
  • Verify diagnostic timeout message: Trigger a timeout (set HERMES_CRON_TIMEOUT=5 for testing), confirm the error includes: last activity description, idle duration, current tool, iteration count
  • Verify unlimited mode: Set HERMES_CRON_TIMEOUT=0, run a 15-minute task, confirm it completes without timeout

Why This Matters

Our burn-mode fleet (scouts, fleet auditors, CI monitors) were dying at 600s exactly, even when actively working. This was silently losing partial work. Inactivity-based timeouts mean agents work until they actually stall — not until a wall clock expires.

Hints

  • Gateway timeout: gateway/run.py — look for _check_inactivity_timeout()
  • Cron timeout: cron/scheduler.py — polling loop with agent.get_activity_summary() every 5s
  • Tests: tests/cron/test_cron_inactivity_timeout.py (289 lines, 9 test scenarios)

Parent: #111

## Context Commits `fec58ad9` (gateway) and `d6ef7fdf` (cron) replace wall-clock timeouts with inactivity-based timeouts: - Gateway: agent stays alive while actively processing (even for hours), only times out on genuine inactivity - Cron: same pattern — `HERMES_CRON_TIMEOUT=0` means unlimited - Diagnostic info included on timeout: last activity, idle duration, current tool, iteration count ## Acceptance Criteria - [ ] **Verify gateway timeout behavior**: Start a long-running tool call (e.g., `sleep 30` via terminal tool), confirm the gateway does NOT time out while the tool is actively running - [ ] **Verify cron timeout behavior**: Run a cron job that actively processes for >600s, confirm it does not timeout (wall-clock: would have been killed; inactivity-based: survives) - [ ] **Verify diagnostic timeout message**: Trigger a timeout (set `HERMES_CRON_TIMEOUT=5` for testing), confirm the error includes: last activity description, idle duration, current tool, iteration count - [ ] **Verify unlimited mode**: Set `HERMES_CRON_TIMEOUT=0`, run a 15-minute task, confirm it completes without timeout ## Why This Matters Our burn-mode fleet (scouts, fleet auditors, CI monitors) were dying at 600s exactly, even when actively working. This was silently losing partial work. Inactivity-based timeouts mean agents work until they actually stall — not until a wall clock expires. ## Hints - Gateway timeout: `gateway/run.py` — look for `_check_inactivity_timeout()` - Cron timeout: `cron/scheduler.py` — polling loop with `agent.get_activity_summary()` every 5s - Tests: `tests/cron/test_cron_inactivity_timeout.py` (289 lines, 9 test scenarios) Parent: #111
Member

🏷️ Automated Triage Check

Timestamp: 2026-04-06T16:30:13.327338
Agent: Allegro Heartbeat

This issue has been identified as needing triage:

Checklist

  • Clear acceptance criteria defined
  • Priority label assigned (p0-critical / p1-important / p2-backlog)
  • Size estimate added (quick-fix / day / week / epic)
  • Owner assigned
  • Related issues linked

Context

  • No comments yet — needs engagement
  • No labels — needs categorization
  • Part of automated backlog maintenance

Automated triage from Allegro 15-minute heartbeat

## 🏷️ Automated Triage Check **Timestamp:** 2026-04-06T16:30:13.327338 **Agent:** Allegro Heartbeat This issue has been identified as needing triage: ### Checklist - [ ] Clear acceptance criteria defined - [ ] Priority label assigned (p0-critical / p1-important / p2-backlog) - [ ] Size estimate added (quick-fix / day / week / epic) - [ ] Owner assigned - [ ] Related issues linked ### Context - No comments yet — needs engagement - No labels — needs categorization - Part of automated backlog maintenance --- *Automated triage from Allegro 15-minute heartbeat*
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#115