feat: session garbage collection (#315 )

Add garbage_collect() method to SessionDB that cleans up empty and trivial sessions based on age: - Empty sessions (0 messages) older than 24h - Trivial sessions (1-5 messages) older than 7 days - Sessions with >5 messages kept indefinitely Add `hermes sessions gc` CLI command with: - --empty-hours (default: 24) - --trivial-days (default: 7) - --trivial-max (default: 5) - --source filter - --dry-run preview mode - --yes skip confirmation The dry-run flow: preview what would be deleted, ask for confirmation, then execute. Handles child session FK constraints properly. 7 tests covering: empty/trivial deletion, active session protection, substantial session preservation, dry-run, source filtering, and child session handling. Closes #315
fix: gateway config debt - validation, defaults, fallback chain checks (#328 )
2026-04-13 17:30:39 -04:00 · 2026-04-13 17:29:20 -04:00 · 2026-04-13 10:16:11 -04:00 · 2026-04-13 10:12:24 -04:00
7 changed files with 595 additions and 1 deletions
--- a/gateway/config.py
+++ b/gateway/config.py
@@ -412,6 +412,52 @@ class GatewayConfig:
        return self.unauthorized_dm_behavior


+def _validate_fallback_providers() -> None:
+    """Validate fallback_providers from config.yaml at gateway startup.
+
+    Checks that each entry has 'provider' and 'model' fields and logs
+    warnings for malformed entries.  This catches broken fallback chains
+    before they silently degrade into no-fallback mode.
+    """
+    try:
+        _home = get_hermes_home()
+        _config_path = _home / "config.yaml"
+        if not _config_path.exists():
+            return
+        import yaml
+        with open(_config_path, encoding="utf-8") as _f:
+            _cfg = yaml.safe_load(_f) or {}
+        fbp = _cfg.get("fallback_providers")
+        if not fbp:
+            return
+        if not isinstance(fbp, list):
+            logger.warning(
+                "fallback_providers should be a YAML list, got %s. "
+                "Fallback chain will be disabled.",
+                type(fbp).__name__,
+            )
+            return
+        for i, entry in enumerate(fbp):
+            if not isinstance(entry, dict):
+                logger.warning(
+                    "fallback_providers[%d] is not a dict (got %s). Skipping entry.",
+                    i, type(entry).__name__,
+                )
+                continue
+            if not entry.get("provider"):
+                logger.warning(
+                    "fallback_providers[%d] missing 'provider' field. Skipping entry.",
+                    i,
+                )
+            if not entry.get("model"):
+                logger.warning(
+                    "fallback_providers[%d] missing 'model' field. Skipping entry.",
+                    i,
+                )
+    except Exception:
+        pass  # Non-fatal; validation is advisory
+
+
 def load_gateway_config() -> GatewayConfig:
    """
    Load gateway configuration from multiple sources.
@@ -645,6 +691,19 @@ def load_gateway_config() -> GatewayConfig:
                platform.value, env_name,
            )

+    # Warn about API Server enabled without a key (unauthenticated endpoint)
+    if Platform.API_SERVER in config.platforms:
+        api_cfg = config.platforms[Platform.API_SERVER]
+        if api_cfg.enabled and not api_cfg.extra.get("key"):
+            logger.warning(
+                "api_server is enabled but API_SERVER_KEY is not set. "
+                "The API endpoint will run unauthenticated. "
+                "Set API_SERVER_KEY in ~/.hermes/.env to secure it.",
+            )
+
+    # Validate fallback_providers structure from config.yaml
+    _validate_fallback_providers()
+
    return config


--- a/hermes_cli/config.py
+++ b/hermes_cli/config.py
@@ -1338,6 +1338,11 @@ _KNOWN_ROOT_KEYS = {
    "fallback_providers", "credential_pool_strategies", "toolsets",
    "agent", "terminal", "display", "compression", "delegation",
    "auxiliary", "custom_providers", "memory", "gateway",
+    "session_reset", "browser", "checkpoints", "smart_model_routing",
+    "voice", "stt", "tts", "human_delay", "security", "privacy",
+    "cron", "logging", "approvals", "command_allowlist", "quick_commands",
+    "personalities", "skills", "honcho", "timezone", "discord",
+    "whatsapp", "prefill_messages_file", "file_read_max_chars",
 }

 # Valid fields inside a custom_providers list entry
@@ -1478,6 +1483,72 @@ def validate_config_structure(config: Optional[Dict[str, Any]] = None) -> List["
                f"Move '{key}' under the appropriate section",
            ))

+    # ── fallback_providers must be a list of dicts with provider + model ─
+    fbp = config.get("fallback_providers")
+    if fbp is not None:
+        if not isinstance(fbp, list):
+            issues.append(ConfigIssue(
+                "error",
+                f"fallback_providers should be a YAML list, got {type(fbp).__name__}",
+                "Change to:\n"
+                "  fallback_providers:\n"
+                "    - provider: openrouter\n"
+                "      model: google/gemini-3-flash-preview",
+            ))
+        elif fbp:
+            for i, entry in enumerate(fbp):
+                if not isinstance(entry, dict):
+                    issues.append(ConfigIssue(
+                        "warning",
+                        f"fallback_providers[{i}] is not a dict (got {type(entry).__name__})",
+                        "Each entry needs at minimum: provider, model",
+                    ))
+                    continue
+                if not entry.get("provider"):
+                    issues.append(ConfigIssue(
+                        "warning",
+                        f"fallback_providers[{i}] is missing 'provider' field — this fallback will be skipped",
+                        "Add: provider: openrouter (or another provider name)",
+                    ))
+                if not entry.get("model"):
+                    issues.append(ConfigIssue(
+                        "warning",
+                        f"fallback_providers[{i}] is missing 'model' field — this fallback will be skipped",
+                        "Add: model: google/gemini-3-flash-preview (or another model slug)",
+                    ))
+
+    # ── session_reset validation ─────────────────────────────────────────
+    session_reset = config.get("session_reset", {})
+    if isinstance(session_reset, dict):
+        idle_minutes = session_reset.get("idle_minutes")
+        if idle_minutes is not None:
+            if not isinstance(idle_minutes, (int, float)) or idle_minutes <= 0:
+                issues.append(ConfigIssue(
+                    "warning",
+                    f"session_reset.idle_minutes={idle_minutes} is invalid (must be a positive number)",
+                    "Set to a positive integer, e.g. 1440 (24 hours). Using 0 causes immediate resets.",
+                ))
+        at_hour = session_reset.get("at_hour")
+        if at_hour is not None:
+            if not isinstance(at_hour, (int, float)) or not (0 <= at_hour <= 23):
+                issues.append(ConfigIssue(
+                    "warning",
+                    f"session_reset.at_hour={at_hour} is invalid (must be 0-23)",
+                    "Set to an hour between 0 and 23, e.g. 4 for 4am",
+                ))
+
+    # ── API Server key check ─────────────────────────────────────────────
+    # If api_server is enabled via env, but no key is set, warn.
+    # This catches the "API_SERVER_KEY not configured" error from gateway logs.
+    api_server_enabled = os.getenv("API_SERVER_ENABLED", "").lower() in ("true", "1", "yes")
+    api_server_key = os.getenv("API_SERVER_KEY", "").strip()
+    if api_server_enabled and not api_server_key:
+        issues.append(ConfigIssue(
+            "warning",
+            "API_SERVER is enabled but API_SERVER_KEY is not set — the API server will run unauthenticated",
+            "Set API_SERVER_KEY in ~/.hermes/.env to secure the API endpoint",
+        ))
+
    return issues


--- a/hermes_cli/main.py
+++ b/hermes_cli/main.py
@@ -5004,7 +5004,7 @@ For more help on a command:
    # =========================================================================
    sessions_parser = subparsers.add_parser(
        "sessions",
-        help="Manage session history (list, rename, export, prune, delete)",
+        help="Manage session history (list, rename, export, prune, gc, delete)",
        description="View and manage the SQLite session store"
    )
    sessions_subparsers = sessions_parser.add_subparsers(dest="sessions_action")
@@ -5027,6 +5027,14 @@ For more help on a command:
    sessions_prune.add_argument("--source", help="Only prune sessions from this source")
    sessions_prune.add_argument("--yes", "-y", action="store_true", help="Skip confirmation")

+    sessions_gc = sessions_subparsers.add_parser("gc", help="Garbage-collect empty/trivial sessions")
+    sessions_gc.add_argument("--empty-hours", type=int, default=24, help="Delete empty (0-msg) sessions older than N hours (default: 24)")
+    sessions_gc.add_argument("--trivial-days", type=int, default=7, help="Delete trivial (1-5 msg) sessions older than N days (default: 7)")
+    sessions_gc.add_argument("--trivial-max", type=int, default=5, help="Max messages to consider trivial (default: 5)")
+    sessions_gc.add_argument("--source", help="Only GC sessions from this source")
+    sessions_gc.add_argument("--dry-run", action="store_true", help="Show what would be deleted without deleting")
+    sessions_gc.add_argument("--yes", "-y", action="store_true", help="Skip confirmation")
+
    sessions_stats = sessions_subparsers.add_parser("stats", help="Show session store statistics")

    sessions_rename = sessions_subparsers.add_parser("rename", help="Set or change a session's title")
@@ -5196,6 +5204,49 @@ For more help on a command:
                size_mb = os.path.getsize(db_path) / (1024 * 1024)
                print(f"Database size: {size_mb:.1f} MB")

+        elif action == "gc":
+            dry_run = getattr(args, "dry_run", False)
+            if dry_run:
+                counts = db.garbage_collect(
+                    empty_older_than_hours=args.empty_hours,
+                    trivial_max_messages=args.trivial_max,
+                    trivial_older_than_days=args.trivial_days,
+                    source=args.source,
+                    dry_run=True,
+                )
+                print(f"[dry-run] Would delete {counts['total']} session(s):")
+                print(f"  Empty (0 msgs, >{args.empty_hours}h old): {counts['empty']}")
+                print(f"  Trivial (<={args.trivial_max} msgs, >{args.trivial_days}d old): {counts['trivial']}")
+            else:
+                # Preview first
+                preview = db.garbage_collect(
+                    empty_older_than_hours=args.empty_hours,
+                    trivial_max_messages=args.trivial_max,
+                    trivial_older_than_days=args.trivial_days,
+                    source=args.source,
+                    dry_run=True,
+                )
+                if preview["total"] == 0:
+                    print("Nothing to collect.")
+                else:
+                    if not args.yes:
+                        if not _confirm_prompt(
+                            f"Delete {preview['total']} session(s) "
+                            f"({preview['empty']} empty, {preview['trivial']} trivial)? [y/N] "
+                        ):
+                            print("Cancelled.")
+                            return
+                    counts = db.garbage_collect(
+                        empty_older_than_hours=args.empty_hours,
+                        trivial_max_messages=args.trivial_max,
+                        trivial_older_than_days=args.trivial_days,
+                        source=args.source,
+                        dry_run=False,
+                    )
+                    print(f"Collected {counts['total']} session(s):")
+                    print(f"  Empty: {counts['empty']}")
+                    print(f"  Trivial: {counts['trivial']}")
+
        else:
            sessions_parser.print_help()

--- a/hermes_state.py
+++ b/hermes_state.py
@@ -1303,3 +1303,78 @@ class SessionDB:
            return len(session_ids)

        return self._execute_write(_do)
+
+    def garbage_collect(
+        self,
+        empty_older_than_hours: int = 24,
+        trivial_max_messages: int = 5,
+        trivial_older_than_days: int = 7,
+        source: str = None,
+        dry_run: bool = False,
+    ) -> Dict[str, int]:
+        """Delete empty and trivial sessions based on age.
+
+        Policy (matches #315):
+        - Empty sessions (0 messages) older than ``empty_older_than_hours``
+        - Trivial sessions (1..``trivial_max_messages`` msgs) older than
+          ``trivial_older_than_days``
+        - Sessions with more than ``trivial_max_messages`` are kept indefinitely
+        - Active (not ended) sessions are never deleted
+
+        Returns a dict with counts: ``empty``, ``trivial``, ``total``.
+        """
+        now = time.time()
+        empty_cutoff = now - (empty_older_than_hours * 3600)
+        trivial_cutoff = now - (trivial_older_than_days * 86400)
+
+        def _do(conn):
+            # --- Find empty sessions ---
+            empty_q = (
+                "SELECT id FROM sessions "
+                "WHERE message_count = 0 AND started_at < ? AND ended_at IS NOT NULL"
+            )
+            params = [empty_cutoff]
+            if source:
+                empty_q += " AND source = ?"
+                params.append(source)
+            empty_ids = [r[0] for r in conn.execute(empty_q, params).fetchall()]
+
+            # --- Find trivial sessions ---
+            trivial_q = (
+                "SELECT id FROM sessions "
+                "WHERE message_count BETWEEN 1 AND ? AND started_at < ? AND ended_at IS NOT NULL"
+            )
+            t_params = [trivial_max_messages, trivial_cutoff]
+            if source:
+                trivial_q += " AND source = ?"
+                t_params.append(source)
+            trivial_ids = [r[0] for r in conn.execute(trivial_q, t_params).fetchall()]
+
+            all_ids = set(empty_ids) | set(trivial_ids)
+
+            if dry_run:
+                return {"empty": len(empty_ids), "trivial": len(trivial_ids),
+                        "total": len(all_ids)}
+
+            # --- Collect child sessions to delete first (FK constraint) ---
+            child_ids = set()
+            for sid in all_ids:
+                for r in conn.execute(
+                    "SELECT id FROM sessions WHERE parent_session_id = ?", (sid,)
+                ).fetchall():
+                    child_ids.add(r[0])
+
+            # Delete children
+            for cid in child_ids:
+                conn.execute("DELETE FROM messages WHERE session_id = ?", (cid,))
+                conn.execute("DELETE FROM sessions WHERE id = ?", (cid,))
+
+            # Delete targets
+            for sid in all_ids:
+                conn.execute("DELETE FROM messages WHERE session_id = ?", (sid,))
+                conn.execute("DELETE FROM sessions WHERE id = ?", (sid,))
+
+            return {"empty": len(empty_ids), "trivial": len(trivial_ids),
+                    "total": len(all_ids)}
+
+        return self._execute_write(_do)
--- a/run_agent.py
+++ b/run_agent.py
@@ -721,6 +721,19 @@ class AIAgent:
        self._current_tool: str | None = None
        self._api_call_count: int = 0

+        # Poka-yoke #309: Circuit breaker for error cascading
+        # P(error | prev was error) = 58.6% vs P(error | prev was success) = 25.2%
+        # After 3+ consecutive errors, inject guidance to break the cascade.
+        self._consecutive_tool_errors: int = 0
+        self._error_streak_tool_names: list = []  # track which tools are in the streak
+
+        # Poka-yoke #310: Tool fixation detection
+        # Marathon sessions show tool fixation - same tool called 8-25 times in a row.
+        # After 5 consecutive calls to the same tool, nudge the agent to diversify.
+        self._last_tool_name: str | None = None
+        self._same_tool_streak: int = 0
+        self._tool_fixation_threshold: int = 5
+
        # Centralized logging — agent.log (INFO+) and errors.log (WARNING+)
        # both live under ~/.hermes/logs/.  Idempotent, so gateway mode
        # (which creates a new AIAgent per message) won't duplicate handlers.
@@ -6238,6 +6251,12 @@ class AIAgent:
        def _run_tool(index, tool_call, function_name, function_args):
            """Worker function executed in a thread."""
            start = time.time()
+            # Poka-yoke #310: Tool fixation detection (concurrent path)
+            if function_name == self._last_tool_name:
+                self._same_tool_streak += 1
+            else:
+                self._last_tool_name = function_name
+                self._same_tool_streak = 1
            try:
                result = self._invoke_tool(function_name, function_args, effective_task_id, tool_call.id)
            except Exception as tool_error:
@@ -6288,6 +6307,13 @@ class AIAgent:
                if is_error:
                    result_preview = function_result[:200] if len(function_result) > 200 else function_result
                    logger.warning("Tool %s returned error (%.2fs): %s", function_name, tool_duration, result_preview)
+                    # Circuit breaker: track consecutive errors
+                    self._consecutive_tool_errors += 1
+                    self._error_streak_tool_names.append(function_name)
+                else:
+                    # Reset circuit breaker on success
+                    self._consecutive_tool_errors = 0
+                    self._error_streak_tool_names = []

                if self.tool_progress_callback:
                    try:
@@ -6331,6 +6357,41 @@ class AIAgent:
            if subdir_hints:
                function_result += subdir_hints

+            # Circuit breaker: inject warning after 3+ consecutive errors
+            if self._consecutive_tool_errors >= 3:
+                streak_info = self._error_streak_tool_names[-self._consecutive_tool_errors:]
+                unique_tools = list(dict.fromkeys(streak_info))
+                if self._consecutive_tool_errors == 3:
+                    cb_msg = (
+                        f"\n\n⚠️ CIRCUIT BREAKER: You have had {self._consecutive_tool_errors} consecutive tool errors "
+                        f"({', '.join(unique_tools)}). Errors cascade — P(error|error) is 2.33x higher than normal. "
+                        f"Consider: (1) trying a different tool type, (2) using terminal to debug, "
+                        f"(3) simplifying your approach, or (4) asking the user for guidance."
+                    )
+                    function_result += cb_msg
+                elif self._consecutive_tool_errors == 6:
+                    cb_msg = (
+                        f"\n\n🛑 CIRCUIT BREAKER: {self._consecutive_tool_errors} consecutive errors. "
+                        f"The error cascade is severe. STOP retrying the same approach. "
+                        f"Use terminal to investigate, or switch strategies entirely."
+                    )
+                    function_result += cb_msg
+                elif self._consecutive_tool_errors >= 9 and self._consecutive_tool_errors % 3 == 0:
+                    cb_msg = (
+                        f"\n\n🔴 CIRCUIT BREAKER: {self._consecutive_tool_errors} consecutive errors. "
+                        f"Terminal is your only reliable recovery path. Use it now."
+                    )
+                    function_result += cb_msg
+
+            # Poka-yoke #310: Tool fixation nudge
+            if self._same_tool_streak >= self._tool_fixation_threshold and self._same_tool_streak % self._tool_fixation_threshold == 0:
+                fixation_msg = (
+                    f"\n\n🔄 TOOL FIXATION: You have called `{function_name}` {self._same_tool_streak} times consecutively. "
+                    f"Consider: (1) trying a different tool, (2) using `terminal` to verify your approach, "
+                    f"(3) stepping back to reassess the task."
+                )
+                function_result += fixation_msg
+
            # Append tool result message in order
            tool_msg = {
                "role": "tool",
@@ -6416,6 +6477,13 @@ class AIAgent:
            self._current_tool = function_name
            self._touch_activity(f"executing tool: {function_name}")

+            # Poka-yoke #310: Tool fixation detection
+            if function_name == self._last_tool_name:
+                self._same_tool_streak += 1
+            else:
+                self._last_tool_name = function_name
+                self._same_tool_streak = 1
+
            if self.tool_progress_callback:
                try:
                    preview = _build_tool_preview(function_name, function_args)
@@ -6609,8 +6677,14 @@ class AIAgent:
            _is_error_result, _ = _detect_tool_failure(function_name, function_result)
            if _is_error_result:
                logger.warning("Tool %s returned error (%.2fs): %s", function_name, tool_duration, result_preview)
+                # Circuit breaker: track consecutive errors
+                self._consecutive_tool_errors += 1
+                self._error_streak_tool_names.append(function_name)
            else:
                logger.info("tool %s completed (%.2fs, %d chars)", function_name, tool_duration, len(function_result))
+                # Reset circuit breaker on success
+                self._consecutive_tool_errors = 0
+                self._error_streak_tool_names = []

            if self.tool_progress_callback:
                try:
@@ -6642,6 +6716,41 @@ class AIAgent:
            if subdir_hints:
                function_result += subdir_hints

+            # Circuit breaker: inject warning after 3+ consecutive errors
+            if self._consecutive_tool_errors >= 3:
+                streak_info = self._error_streak_tool_names[-self._consecutive_tool_errors:]
+                unique_tools = list(dict.fromkeys(streak_info))  # preserve order, deduplicate
+                if self._consecutive_tool_errors == 3:
+                    cb_msg = (
+                        f"\n\n⚠️ CIRCUIT BREAKER: You have had {self._consecutive_tool_errors} consecutive tool errors "
+                        f"({', '.join(unique_tools)}). Errors cascade — P(error|error) is 2.33x higher than normal. "
+                        f"Consider: (1) trying a different tool type, (2) using terminal to debug, "
+                        f"(3) simplifying your approach, or (4) asking the user for guidance."
+                    )
+                    function_result += cb_msg
+                elif self._consecutive_tool_errors == 6:
+                    cb_msg = (
+                        f"\n\n🛑 CIRCUIT BREAKER: {self._consecutive_tool_errors} consecutive errors. "
+                        f"The error cascade is severe. STOP retrying the same approach. "
+                        f"Use terminal to investigate, or switch strategies entirely."
+                    )
+                    function_result += cb_msg
+                elif self._consecutive_tool_errors >= 9 and self._consecutive_tool_errors % 3 == 0:
+                    cb_msg = (
+                        f"\n\n🔴 CIRCUIT BREAKER: {self._consecutive_tool_errors} consecutive errors. "
+                        f"Terminal is your only reliable recovery path. Use it now."
+                    )
+                    function_result += cb_msg
+
+            # Poka-yoke #310: Tool fixation nudge
+            if self._same_tool_streak >= self._tool_fixation_threshold and self._same_tool_streak % self._tool_fixation_threshold == 0:
+                fixation_msg = (
+                    f"\n\n🔄 TOOL FIXATION: You have called `{function_name}` {self._same_tool_streak} times consecutively. "
+                    f"Consider: (1) trying a different tool, (2) using `terminal` to verify your approach, "
+                    f"(3) stepping back to reassess the task."
+                )
+                function_result += fixation_msg
+
            tool_msg = {
                "role": "tool",
                "content": function_result,
--- a/tests/hermes_cli/test_config_validation.py
+++ b/tests/hermes_cli/test_config_validation.py
@@ -172,3 +172,111 @@ class TestConfigIssueDataclass:
        a = ConfigIssue("error", "msg", "hint")
        b = ConfigIssue("error", "msg", "hint")
        assert a == b
+
+
+class TestFallbackProvidersValidation:
+    """fallback_providers must be a list of dicts with provider + model."""
+
+    def test_non_list(self):
+        """fallback_providers as string should error."""
+        issues = validate_config_structure({
+            "fallback_providers": "openrouter:google/gemini-3-flash-preview",
+        })
+        errors = [i for i in issues if i.severity == "error"]
+        assert any("fallback_providers" in i.message and "list" in i.message for i in errors)
+
+    def test_dict_instead_of_list(self):
+        """fallback_providers as dict should error."""
+        issues = validate_config_structure({
+            "fallback_providers": {"provider": "openrouter", "model": "test"},
+        })
+        errors = [i for i in issues if i.severity == "error"]
+        assert any("fallback_providers" in i.message and "dict" in i.message for i in errors)
+
+    def test_entry_missing_provider(self):
+        """Entry without provider should warn."""
+        issues = validate_config_structure({
+            "fallback_providers": [{"model": "google/gemini-3-flash-preview"}],
+        })
+        assert any("missing 'provider'" in i.message for i in issues)
+
+    def test_entry_missing_model(self):
+        """Entry without model should warn."""
+        issues = validate_config_structure({
+            "fallback_providers": [{"provider": "openrouter"}],
+        })
+        assert any("missing 'model'" in i.message for i in issues)
+
+    def test_entry_not_dict(self):
+        """Non-dict entries should warn."""
+        issues = validate_config_structure({
+            "fallback_providers": ["not-a-dict"],
+        })
+        assert any("not a dict" in i.message for i in issues)
+
+    def test_valid_entries(self):
+        """Valid fallback_providers should produce no fallback-related issues."""
+        issues = validate_config_structure({
+            "fallback_providers": [
+                {"provider": "openrouter", "model": "google/gemini-3-flash-preview"},
+                {"provider": "gemini", "model": "gemini-2.5-flash"},
+            ],
+        })
+        fb_issues = [i for i in issues if "fallback_providers" in i.message]
+        assert len(fb_issues) == 0
+
+    def test_empty_list_no_issues(self):
+        """Empty list is valid (fallback disabled)."""
+        issues = validate_config_structure({
+            "fallback_providers": [],
+        })
+        fb_issues = [i for i in issues if "fallback_providers" in i.message]
+        assert len(fb_issues) == 0
+
+
+class TestSessionResetValidation:
+    """session_reset.idle_minutes must be positive."""
+
+    def test_zero_idle_minutes(self):
+        """idle_minutes=0 should warn."""
+        issues = validate_config_structure({
+            "session_reset": {"idle_minutes": 0},
+        })
+        assert any("idle_minutes=0" in i.message for i in issues)
+
+    def test_negative_idle_minutes(self):
+        """idle_minutes=-5 should warn."""
+        issues = validate_config_structure({
+            "session_reset": {"idle_minutes": -5},
+        })
+        assert any("idle_minutes=-5" in i.message for i in issues)
+
+    def test_string_idle_minutes(self):
+        """idle_minutes as string should warn."""
+        issues = validate_config_structure({
+            "session_reset": {"idle_minutes": "abc"},
+        })
+        assert any("idle_minutes=" in i.message for i in issues)
+
+    def test_valid_idle_minutes(self):
+        """Valid idle_minutes should not warn."""
+        issues = validate_config_structure({
+            "session_reset": {"idle_minutes": 1440},
+        })
+        idle_issues = [i for i in issues if "idle_minutes" in i.message]
+        assert len(idle_issues) == 0
+
+    def test_invalid_at_hour(self):
+        """at_hour=25 should warn."""
+        issues = validate_config_structure({
+            "session_reset": {"at_hour": 25},
+        })
+        assert any("at_hour=25" in i.message for i in issues)
+
+    def test_valid_at_hour(self):
+        """Valid at_hour should not warn."""
+        issues = validate_config_structure({
+            "session_reset": {"at_hour": 4},
+        })
+        hour_issues = [i for i in issues if "at_hour" in i.message]
+        assert len(hour_issues) == 0
--- a/tests/test_hermes_state.py
+++ b/tests/test_hermes_state.py
@@ -665,6 +665,127 @@ class TestPruneSessions:


 # =========================================================================
+# =========================================================================
+# Garbage Collect
+# =========================================================================
+
+class TestGarbageCollect:
+    def test_gc_deletes_empty_old_sessions(self, db):
+        """Empty sessions (0 messages) older than 24h should be deleted."""
+        db.create_session(session_id="empty_old", source="cli")
+        db.end_session("empty_old", end_reason="done")
+        db._conn.execute(
+            "UPDATE sessions SET started_at = ? WHERE id = ?",
+            (time.time() - 48 * 3600, "empty_old"),  # 48 hours ago
+        )
+        db._conn.commit()
+
+        # Recent empty session should be kept
+        db.create_session(session_id="empty_new", source="cli")
+        db.end_session("empty_new", end_reason="done")
+
+        result = db.garbage_collect()
+        assert result["empty"] == 1
+        assert result["trivial"] == 0
+        assert result["total"] == 1
+        assert db.get_session("empty_old") is None
+        assert db.get_session("empty_new") is not None
+
+    def test_gc_deletes_trivial_old_sessions(self, db):
+        """Sessions with 1-5 messages older than 7 days should be deleted."""
+        db.create_session(session_id="trivial_old", source="cli")
+        for i in range(3):
+            db.append_message("trivial_old", role="user", content=f"msg {i}")
+        db.end_session("trivial_old", end_reason="done")
+        db._conn.execute(
+            "UPDATE sessions SET started_at = ? WHERE id = ?",
+            (time.time() - 10 * 86400, "trivial_old"),  # 10 days ago
+        )
+        db._conn.commit()
+
+        result = db.garbage_collect()
+        assert result["trivial"] == 1
+        assert result["total"] == 1
+        assert db.get_session("trivial_old") is None
+
+    def test_gc_keeps_active_sessions(self, db):
+        """Active (not ended) sessions should never be deleted."""
+        db.create_session(session_id="active_old", source="cli")
+        # Backdate but don't end
+        db._conn.execute(
+            "UPDATE sessions SET started_at = ? WHERE id = ?",
+            (time.time() - 48 * 3600, "active_old"),
+        )
+        db._conn.commit()
+
+        result = db.garbage_collect()
+        assert result["total"] == 0
+        assert db.get_session("active_old") is not None
+
+    def test_gc_keeps_substantial_sessions(self, db):
+        """Sessions with >5 messages should never be deleted."""
+        db.create_session(session_id="big_old", source="cli")
+        for i in range(10):
+            db.append_message("big_old", role="user", content=f"msg {i}")
+        db.end_session("big_old", end_reason="done")
+        db._conn.execute(
+            "UPDATE sessions SET started_at = ? WHERE id = ?",
+            (time.time() - 365 * 86400, "big_old"),  # 1 year ago
+        )
+        db._conn.commit()
+
+        result = db.garbage_collect()
+        assert result["total"] == 0
+        assert db.get_session("big_old") is not None
+
+    def test_gc_dry_run_does_not_delete(self, db):
+        """dry_run=True should return counts but not delete anything."""
+        db.create_session(session_id="empty_old", source="cli")
+        db.end_session("empty_old", end_reason="done")
+        db._conn.execute(
+            "UPDATE sessions SET started_at = ? WHERE id = ?",
+            (time.time() - 48 * 3600, "empty_old"),
+        )
+        db._conn.commit()
+
+        result = db.garbage_collect(dry_run=True)
+        assert result["total"] == 1
+        assert db.get_session("empty_old") is not None  # Still exists
+
+    def test_gc_with_source_filter(self, db):
+        """--source should only GC sessions from that source."""
+        for sid, src in [("old_cli", "cli"), ("old_tg", "telegram")]:
+            db.create_session(session_id=sid, source=src)
+            db.end_session(sid, end_reason="done")
+            db._conn.execute(
+                "UPDATE sessions SET started_at = ? WHERE id = ?",
+                (time.time() - 48 * 3600, sid),
+            )
+        db._conn.commit()
+
+        result = db.garbage_collect(source="cli")
+        assert result["total"] == 1
+        assert db.get_session("old_cli") is None
+        assert db.get_session("old_tg") is not None
+
+    def test_gc_handles_child_sessions(self, db):
+        """Child sessions should be deleted when parent is GC'd."""
+        db.create_session(session_id="parent_old", source="cli")
+        db.end_session("parent_old", end_reason="done")
+        db._conn.execute(
+            "UPDATE sessions SET started_at = ? WHERE id = ?",
+            (time.time() - 48 * 3600, "parent_old"),
+        )
+        # Create child session
+        db.create_session(session_id="child", source="cli", parent_session_id="parent_old")
+        db.end_session("child", end_reason="done")
+        db._conn.commit()
+
+        result = db.garbage_collect()
+        assert result["total"] == 1
+        assert db.get_session("parent_old") is None
+        assert db.get_session("child") is None
+
 # Schema and WAL mode
 # =========================================================================
Author	SHA1	Message	Date
Alexander Whitestone	69e10967bd	feat: session garbage collection (#315 ) Some checks failed Forge CI / smoke-and-build (pull_request) Failing after 14s Details Add garbage_collect() method to SessionDB that cleans up empty and trivial sessions based on age: - Empty sessions (0 messages) older than 24h - Trivial sessions (1-5 messages) older than 7 days - Sessions with >5 messages kept indefinitely Add `hermes sessions gc` CLI command with: - --empty-hours (default: 24) - --trivial-days (default: 7) - --trivial-max (default: 5) - --source filter - --dry-run preview mode - --yes skip confirmation The dry-run flow: preview what would be deleted, ask for confirmation, then execute. Handles child session FK constraints properly. 7 tests covering: empty/trivial deletion, active session protection, substantial session preservation, dry-run, source filtering, and child session handling. Closes #315	2026-04-13 17:30:39 -04:00
Alexander Whitestone	992498463e	fix: gateway config debt - validation, defaults, fallback chain checks (#328 ) Some checks failed Forge CI / smoke-and-build (pull_request) Failing after 1m32s Details - Expand validate_config_structure() to catch: - fallback_providers format errors (non-list, missing provider/model) - session_reset.idle_minutes <= 0 (causes immediate resets) - session_reset.at_hour out of 0-23 range - API_SERVER enabled without API_SERVER_KEY - Unknown root-level keys that look like misplaced custom_providers fields - Add _validate_fallback_providers() in gateway/config.py to validate fallback chain at gateway startup (logs warnings for malformed entries) - Add API_SERVER_KEY check in gateway config loader (warns on unauthenticated endpoint) - Expand _KNOWN_ROOT_KEYS to include all valid top-level config sections (session_reset, browser, checkpoints, voice, stt, tts, etc.) - Add 13 new tests for fallback_providers and session_reset validation - All existing tests pass (47/47) Closes #328	2026-04-13 17:29:20 -04:00
Alexander Whitestone	ec3cd2081b	fix(poka-yoke): add tool fixation detection (#310 ) Some checks failed Forge CI / smoke-and-build (pull_request) Failing after 26s Details Detect when the same tool is called 5+ times consecutively and inject a nudge advising the agent to diversify its approach. Evidence from empirical audit: - Top marathon session (qwen, 1643 msgs): execute_code streak of 20 - Opus session (1472 msgs): terminal streak of 10 The nudge fires every 5 consecutive calls (5, 10, 15...) so it persists without being spammy. Tracks independently in both sequential and concurrent execution paths.	2026-04-13 10:16:11 -04:00
Alexander Whitestone	110642d86a	fix(poka-yoke): add circuit breaker for error cascading (#309 ) Some checks failed Forge CI / smoke-and-build (pull_request) Failing after 28s Details After 3 consecutive tool errors, inject a warning into the tool result advising the agent to switch strategies. Escalates at 6 and 9+ errors. Empirical data from audit: - P(error \| prev error) = 58.6% vs P(error \| prev success) = 25.2% - 2.33x cascade amplification factor - Max observed streak: 31 consecutive errors Intervention tiers: - 3 errors: advisory warning (try different tool, use terminal, simplify) - 6 errors: urgent stop (halt retries, investigate or switch) - 9+ errors: terminal-only recovery path Tracks errors in both sequential and concurrent execution paths.	2026-04-13 10:12:24 -04:00