feat: session garbage collection (#315) #383

Rockachopa · 2026-04-13T21:31:25Z

Rockachopa commented

2026-04-13 21:31:25 +00:00

What

Session garbage collection for empty/trivial sessions.

Changes

hermes_state.py — new garbage_collect() method on SessionDB:

Empty sessions (0 messages) older than 24 hours → deleted
Trivial sessions (1–5 messages) older than 7 days → deleted
Sessions with >5 messages → kept indefinitely
Active (not ended) sessions → never touched
Handles child session FK constraints
Supports dry_run=True for preview, source= filtering

hermes_cli/main.py — new hermes sessions gc subcommand:

--empty-hours (default: 24)
--trivial-days (default: 7)
--trivial-max (default: 5)
--source filter
--dry-run preview mode
--yes skip confirmation

tests/test_hermes_state.py — 7 new tests covering all scenarios.

Why

Empirical audit found 67.3% of sessions have <=5 messages. 3,564 are completely empty. This reduces state.db noise and file size.

Testing

pytest tests/test_hermes_state.py::TestGarbageCollect -v
→ 7 passed
pytest tests/test_hermes_state.py::TestPruneSessions -v
→ 3 passed (no regression)

Closes #315

## What Session garbage collection for empty/trivial sessions. ## Changes **`hermes_state.py`** — new `garbage_collect()` method on `SessionDB`: - Empty sessions (0 messages) older than 24 hours → deleted - Trivial sessions (1–5 messages) older than 7 days → deleted - Sessions with >5 messages → kept indefinitely - Active (not ended) sessions → never touched - Handles child session FK constraints - Supports `dry_run=True` for preview, `source=` filtering **`hermes_cli/main.py`** — new `hermes sessions gc` subcommand: - `--empty-hours` (default: 24) - `--trivial-days` (default: 7) - `--trivial-max` (default: 5) - `--source` filter - `--dry-run` preview mode - `--yes` skip confirmation **`tests/test_hermes_state.py`** — 7 new tests covering all scenarios. ## Why Empirical audit found 67.3% of sessions have <=5 messages. 3,564 are completely empty. This reduces state.db noise and file size. ## Testing ``` pytest tests/test_hermes_state.py::TestGarbageCollect -v → 7 passed pytest tests/test_hermes_state.py::TestPruneSessions -v → 3 passed (no regression) ``` Closes #315

Rockachopa added 4 commits 2026-04-13 21:31:26 +00:00

fix(poka-yoke): add circuit breaker for error cascading (#309 )

Forge CI / smoke-and-build (pull_request) Failing after 28s

Details

110642d86a

After 3 consecutive tool errors, inject a warning into the tool result
advising the agent to switch strategies. Escalates at 6 and 9+ errors.

Empirical data from audit:
- P(error | prev error) = 58.6% vs P(error | prev success) = 25.2%
- 2.33x cascade amplification factor
- Max observed streak: 31 consecutive errors

Intervention tiers:
- 3 errors: advisory warning (try different tool, use terminal, simplify)
- 6 errors: urgent stop (halt retries, investigate or switch)
- 9+ errors: terminal-only recovery path

Tracks errors in both sequential and concurrent execution paths.

fix(poka-yoke): add tool fixation detection (#310 )

Forge CI / smoke-and-build (pull_request) Failing after 26s

Details

ec3cd2081b

Detect when the same tool is called 5+ times consecutively and inject
a nudge advising the agent to diversify its approach.

Evidence from empirical audit:
- Top marathon session (qwen, 1643 msgs): execute_code streak of 20
- Opus session (1472 msgs): terminal streak of 10

The nudge fires every 5 consecutive calls (5, 10, 15...) so it
persists without being spammy. Tracks independently in both
sequential and concurrent execution paths.

fix: gateway config debt - validation, defaults, fallback chain checks (#328 )

Forge CI / smoke-and-build (pull_request) Failing after 1m32s

Details

992498463e

- Expand validate_config_structure() to catch:
  - fallback_providers format errors (non-list, missing provider/model)
  - session_reset.idle_minutes <= 0 (causes immediate resets)
  - session_reset.at_hour out of 0-23 range
  - API_SERVER enabled without API_SERVER_KEY
  - Unknown root-level keys that look like misplaced custom_providers fields
- Add _validate_fallback_providers() in gateway/config.py to validate
  fallback chain at gateway startup (logs warnings for malformed entries)
- Add API_SERVER_KEY check in gateway config loader (warns on unauthenticated endpoint)
- Expand _KNOWN_ROOT_KEYS to include all valid top-level config sections
  (session_reset, browser, checkpoints, voice, stt, tts, etc.)
- Add 13 new tests for fallback_providers and session_reset validation
- All existing tests pass (47/47)

Closes #328

feat: session garbage collection (#315 )

Forge CI / smoke-and-build (pull_request) Failing after 14s

Details

69e10967bd

Add garbage_collect() method to SessionDB that cleans up empty and
trivial sessions based on age:
- Empty sessions (0 messages) older than 24h
- Trivial sessions (1-5 messages) older than 7 days
- Sessions with >5 messages kept indefinitely

Add `hermes sessions gc` CLI command with:
- --empty-hours (default: 24)
- --trivial-days (default: 7)
- --trivial-max (default: 5)
- --source filter
- --dry-run preview mode
- --yes skip confirmation

The dry-run flow: preview what would be deleted, ask for confirmation,
then execute. Handles child session FK constraints properly.

7 tests covering: empty/trivial deletion, active session protection,
substantial session preservation, dry-run, source filtering, and child
session handling.

Closes #315

Rockachopa referenced this pull request

2026-04-13 21:31:43 +00:00

[Infra] Session garbage collection for empty/trivial sessions #315

Timmy approved these changes 2026-04-13 22:08:55 +00:00

Dismissed

Timmy left a comment

Review: Session GC + Config Validation + Circuit Breaker (#315, #328, #309, #310)

This PR bundles four distinct features. Each is reasonable individually, but bundling them makes review and rollback harder. Consider splitting in the future.

Session Garbage Collection (#315)

Clean implementation. The GC policy (empty >24h, trivial >7d, substantial kept) is sensible. Good that active sessions are never touched.

Minor concern: The child session deletion logic deletes children of GC targets even if those children are substantial sessions (>5 messages). If a parent was empty but a child has 50 messages, the child gets deleted anyway. This may be intentional but is worth calling out -- consider checking child message counts before cascading.

Config Validation (#328)

Good additions:

fallback_providers structure validation catches a real class of user errors.
session_reset.idle_minutes positive check catches the =0 immediate-reset bug.
_KNOWN_ROOT_KEYS expansion reduces false-positive "unknown key" warnings.
API_SERVER_KEY check is a nice security nudge.

Nit: The warning message for API_SERVER_KEY references ~/.hermes/.env literally rather than using display_hermes_home(). This is the exact pattern PR #384 is trying to eliminate.

Circuit Breaker (#309) and Tool Fixation (#310)

Data-driven approach is good. The escalating warning levels (3/6/9+ errors) are well thought out.

Issue: The circuit breaker and tool fixation tracking happen in both the concurrent (_run_tool threaded) and sequential tool execution paths. In the threaded path, self._consecutive_tool_errors, self._error_streak_tool_names, self._same_tool_streak, and self._last_tool_name are all modified from multiple threads with no synchronization. This means:

Counters may increment incorrectly under concurrency
The _error_streak_tool_names list could have concurrent append/clear races

Since these are advisory nudges this is probably acceptable, but the duplicated code across both paths should ideally be extracted into helper methods.

Tests

Good coverage for GC (including child sessions, source filtering, dry-run) and config validation. No tests for the circuit breaker or tool fixation features, which would be valuable.

Verdict

Overall solid work. The main concerns are the child-cascade GC behavior and the thread-safety of circuit breaker counters, neither of which are blocking.

## Review: Session GC + Config Validation + Circuit Breaker (#315, #328, #309, #310) This PR bundles four distinct features. Each is reasonable individually, but bundling them makes review and rollback harder. Consider splitting in the future. ### Session Garbage Collection (#315) Clean implementation. The GC policy (empty >24h, trivial >7d, substantial kept) is sensible. Good that active sessions are never touched. **Minor concern**: The child session deletion logic deletes children of GC targets even if those children are *substantial* sessions (>5 messages). If a parent was empty but a child has 50 messages, the child gets deleted anyway. This may be intentional but is worth calling out -- consider checking child message counts before cascading. ### Config Validation (#328) Good additions: - `fallback_providers` structure validation catches a real class of user errors. - `session_reset.idle_minutes` positive check catches the `=0` immediate-reset bug. - `_KNOWN_ROOT_KEYS` expansion reduces false-positive "unknown key" warnings. - API_SERVER_KEY check is a nice security nudge. **Nit**: The warning message for API_SERVER_KEY references `~/.hermes/.env` literally rather than using `display_hermes_home()`. This is the exact pattern PR #384 is trying to eliminate. ### Circuit Breaker (#309) and Tool Fixation (#310) Data-driven approach is good. The escalating warning levels (3/6/9+ errors) are well thought out. **Issue**: The circuit breaker and tool fixation tracking happen in both the concurrent (`_run_tool` threaded) and sequential tool execution paths. In the threaded path, `self._consecutive_tool_errors`, `self._error_streak_tool_names`, `self._same_tool_streak`, and `self._last_tool_name` are all modified from multiple threads with no synchronization. This means: - Counters may increment incorrectly under concurrency - The `_error_streak_tool_names` list could have concurrent append/clear races Since these are advisory nudges this is probably acceptable, but the duplicated code across both paths should ideally be extracted into helper methods. ### Tests Good coverage for GC (including child sessions, source filtering, dry-run) and config validation. No tests for the circuit breaker or tool fixation features, which would be valuable. ### Verdict Overall solid work. The main concerns are the child-cascade GC behavior and the thread-safety of circuit breaker counters, neither of which are blocking.

Timmy referenced this pull request

2026-04-13 22:09:26 +00:00

fix: gateway config debt - validation, defaults, fallback chain checks (#328) #381

Timmy approved these changes 2026-04-14 00:11:53 +00:00

Timmy left a comment

Good config validation additions with solid test coverage.

The new _validate_fallback_providers() function in gateway/config.py properly validates structure at startup, and the CLI-side validation in hermes_cli/config.py mirrors it with user-friendly ConfigIssue messages. The expanded _KNOWN_ROOT_KEYS set is a welcome improvement to reduce false-positive "unknown key" warnings.

The session GC feature in hermes_state.py is well-designed: the ended_at IS NOT NULL guard prevents deleting active sessions, and the dry-run preview before actual deletion is a nice UX touch. However, the child-session deletion loop (lines ~308-317) iterates one-by-one with individual DELETE statements -- for large databases this could be slow. Consider using DELETE FROM messages WHERE session_id IN (...) with a batched approach.

The circuit breaker and tool fixation detection in run_agent.py are clever poka-yoke mechanisms. One concern: _same_tool_streak is incremented in both the concurrent _run_tool worker and the sequential path without any thread synchronization. With concurrent tool execution, this counter could have race conditions. Consider using threading.Lock or threading.atomic for the streak counter.

The bare except Exception: pass in _validate_fallback_providers() silently swallows all errors including YAML parse failures. At minimum, log at DEBUG level so users can diagnose config loading issues.

Tests are thorough and well-structured. Overall a solid PR.

**Good config validation additions with solid test coverage.** The new `_validate_fallback_providers()` function in `gateway/config.py` properly validates structure at startup, and the CLI-side validation in `hermes_cli/config.py` mirrors it with user-friendly ConfigIssue messages. The expanded `_KNOWN_ROOT_KEYS` set is a welcome improvement to reduce false-positive "unknown key" warnings. The session GC feature in `hermes_state.py` is well-designed: the `ended_at IS NOT NULL` guard prevents deleting active sessions, and the dry-run preview before actual deletion is a nice UX touch. However, the child-session deletion loop (lines ~308-317) iterates one-by-one with individual DELETE statements -- for large databases this could be slow. Consider using `DELETE FROM messages WHERE session_id IN (...)` with a batched approach. The circuit breaker and tool fixation detection in `run_agent.py` are clever poka-yoke mechanisms. One concern: `_same_tool_streak` is incremented in both the concurrent `_run_tool` worker and the sequential path without any thread synchronization. With concurrent tool execution, this counter could have race conditions. Consider using `threading.Lock` or `threading.atomic` for the streak counter. The bare `except Exception: pass` in `_validate_fallback_providers()` silently swallows all errors including YAML parse failures. At minimum, log at DEBUG level so users can diagnose config loading issues. Tests are thorough and well-structured. Overall a solid PR.

Rockachopa merged commit 37af40a38e into main

2026-04-14 00:33:17 +00:00

Rockachopa referenced this issue from a commit

2026-04-14 00:33:20 +00:00

Merge pull request 'feat: session garbage collection (#315)' (#383) from feat/315-session-gc into main

Rockachopa commented

2026-04-14 00:33:22 +00:00

Closing: stale — session GC, no clear issue or superseded. Triage by KETER — fleet-ops#143.

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#383