feat: session garbage collection (#315) #383
Reference in New Issue
Block a user
Delete Branch "feat/315-session-gc"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
Session garbage collection for empty/trivial sessions.
Changes
hermes_state.py— newgarbage_collect()method onSessionDB:dry_run=Truefor preview,source=filteringhermes_cli/main.py— newhermes sessions gcsubcommand:--empty-hours(default: 24)--trivial-days(default: 7)--trivial-max(default: 5)--sourcefilter--dry-runpreview mode--yesskip confirmationtests/test_hermes_state.py— 7 new tests covering all scenarios.Why
Empirical audit found 67.3% of sessions have <=5 messages. 3,564 are completely empty. This reduces state.db noise and file size.
Testing
Closes #315
Review: Session GC + Config Validation + Circuit Breaker (#315, #328, #309, #310)
This PR bundles four distinct features. Each is reasonable individually, but bundling them makes review and rollback harder. Consider splitting in the future.
Session Garbage Collection (#315)
Clean implementation. The GC policy (empty >24h, trivial >7d, substantial kept) is sensible. Good that active sessions are never touched.
Minor concern: The child session deletion logic deletes children of GC targets even if those children are substantial sessions (>5 messages). If a parent was empty but a child has 50 messages, the child gets deleted anyway. This may be intentional but is worth calling out -- consider checking child message counts before cascading.
Config Validation (#328)
Good additions:
fallback_providersstructure validation catches a real class of user errors.session_reset.idle_minutespositive check catches the=0immediate-reset bug._KNOWN_ROOT_KEYSexpansion reduces false-positive "unknown key" warnings.Nit: The warning message for API_SERVER_KEY references
~/.hermes/.envliterally rather than usingdisplay_hermes_home(). This is the exact pattern PR #384 is trying to eliminate.Circuit Breaker (#309) and Tool Fixation (#310)
Data-driven approach is good. The escalating warning levels (3/6/9+ errors) are well thought out.
Issue: The circuit breaker and tool fixation tracking happen in both the concurrent (
_run_toolthreaded) and sequential tool execution paths. In the threaded path,self._consecutive_tool_errors,self._error_streak_tool_names,self._same_tool_streak, andself._last_tool_nameare all modified from multiple threads with no synchronization. This means:_error_streak_tool_nameslist could have concurrent append/clear racesSince these are advisory nudges this is probably acceptable, but the duplicated code across both paths should ideally be extracted into helper methods.
Tests
Good coverage for GC (including child sessions, source filtering, dry-run) and config validation. No tests for the circuit breaker or tool fixation features, which would be valuable.
Verdict
Overall solid work. The main concerns are the child-cascade GC behavior and the thread-safety of circuit breaker counters, neither of which are blocking.
Good config validation additions with solid test coverage.
The new
_validate_fallback_providers()function ingateway/config.pyproperly validates structure at startup, and the CLI-side validation inhermes_cli/config.pymirrors it with user-friendly ConfigIssue messages. The expanded_KNOWN_ROOT_KEYSset is a welcome improvement to reduce false-positive "unknown key" warnings.The session GC feature in
hermes_state.pyis well-designed: theended_at IS NOT NULLguard prevents deleting active sessions, and the dry-run preview before actual deletion is a nice UX touch. However, the child-session deletion loop (lines ~308-317) iterates one-by-one with individual DELETE statements -- for large databases this could be slow. Consider usingDELETE FROM messages WHERE session_id IN (...)with a batched approach.The circuit breaker and tool fixation detection in
run_agent.pyare clever poka-yoke mechanisms. One concern:_same_tool_streakis incremented in both the concurrent_run_toolworker and the sequential path without any thread synchronization. With concurrent tool execution, this counter could have race conditions. Consider usingthreading.Lockorthreading.atomicfor the streak counter.The bare
except Exception: passin_validate_fallback_providers()silently swallows all errors including YAML parse failures. At minimum, log at DEBUG level so users can diagnose config loading issues.Tests are thorough and well-structured. Overall a solid PR.
Closing: stale — session GC, no clear issue or superseded. Triage by KETER — fleet-ops#143.