When resolve_runtime_provider is called with requested='auto' and
auth.json has a stale active_provider (nous or openai-codex) whose
OAuth refresh token has been revoked, the AuthError now falls through
to the next provider in the chain (e.g. OpenRouter via env vars)
instead of propagating to the user as a blocking error.
When the user explicitly requested the OAuth provider, the error
still propagates so they know to re-authenticate.
Root cause: resolve_provider('auto') checks auth.json for an active
OAuth provider before checking env vars. get_nous_auth_status()
reports logged_in=True if any access_token exists (even expired),
so the Nous path is taken. resolve_nous_runtime_credentials() then
tries to refresh the token, fails with 'Refresh session has been
revoked', and the AuthError bubbles up to the CLI bold-red display.
Adds 3 tests: Nous fallthrough, Codex fallthrough, explicit-request
still raises.
Cherry-picked from PR #5580 by MestreY0d4-Uninter.
- Share parent's credential pool with child agents for key rotation
- Leasing layer spreads parallel children across keys (least-loaded)
- Thread-safe acquire_lease/release_lease in CredentialPool
- Reverted sneaked-in tool-name restoration change (kept original
getattr + isinstance guard pattern)
When streaming was enabled on the gateway, the stream consumer created a
single message at the start and kept editing it as tokens arrived. Tool
progress messages were sent as separate messages below it. Since edits
don't change message position on Telegram/Matrix/Discord, the final
response ended up stuck above all tool progress messages — users had to
scroll up past potentially dozens of tool call lines to read the answer.
The agent already sends stream_delta_callback(None) at tool boundaries
(before _execute_tool_calls). The stream consumer was ignoring this
signal. Now it treats None as a segment break: finalizes the current
message (removes cursor), resets _message_id, and the next text chunk
creates a fresh message below the tool progress messages.
Timeline before:
[msg 1: 'Let me search...' → edits → 'Here is the answer'] ← top
[msg 2: tool progress lines] ← bottom
Timeline after:
[msg 1: 'Let me search...'] ← top
[msg 2: tool progress lines]
[msg 3: 'Here is the answer'] ← bottom (visible)
Reported by SkyLinx on Discord.
When a user replies in a Slack thread where the bot has an active
conversation session, the bot now processes the message even without
an explicit @mention. This improves UX for ongoing threaded
discussions.
Changes:
- Added set_session_store() to BasePlatformAdapter for adapters to
check active sessions
- Modified SlackAdapter to detect thread replies and check if a
session exists for that thread before requiring @mentions
- Updated GatewayRunner to inject the session store into adapters
- Added comprehensive tests for the new behavior
Fixes: Thread replies without @jarvis are now processed if there is
an active session, matching user expectations for conversation flow
Add fine-grained authorization policies per Feishu group chat via
platforms.feishu.extra configuration.
- Add global bot-level admins that bypass all group restrictions
- Add per-group policies: open, allowlist, blacklist, admin_only, disabled
- Add default_group_policy fallback for chats without explicit rules
- Thread chat_id through group message gate for per-chat rule selection
- Match both open_id and user_id for backward compatibility
- Preserve existing FEISHU_ALLOWED_USERS / FEISHU_GROUP_POLICY behavior
- Add focused regression tests for all policy modes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Consolidate coercion functions, extract loop readiness check, and deduplicate test mock setup to improve maintainability without changing behavior.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reapply local reconnect and ping settings after the Feishu SDK refreshes its client config so user-provided websocket tuning actually takes effect.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allow Feishu websocket keepalive timing to be configured via platform
extra config so disconnects can be detected faster in unstable networks.
New optional extra settings:
- ws_ping_interval
- ws_ping_timeout
These values are applied only when explicitly configured. Invalid values
fall back to the websocket library defaults by leaving the options unset.
This complements the reconnect timing settings added previously and helps
reduce total recovery time after network interruptions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allow users to configure websocket reconnect behavior via platform extra
config to reduce reconnect latency in production environments.
The official Feishu SDK defaults to:
- First reconnect: random jitter 0-30 seconds
- Subsequent retries: 120 second intervals
This can cause 20-30 second delays before reconnection after network
interruptions. This commit makes these values configurable while keeping
the SDK defaults for backward compatibility.
Configuration via ~/.hermes/config.yaml:
```yaml
platforms:
feishu:
extra:
ws_reconnect_nonce: 0 # Disable first-reconnect jitter (default: 30)
ws_reconnect_interval: 3 # Retry every 3 seconds (default: 120)
```
Invalid values (negative numbers, non-integers) fall back to SDK defaults.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit fixes two critical bugs in the Feishu adapter that affect
message reliability and process lifecycle.
**Bug Fix 1: Intermittent Message Drops**
Root cause: Event handler was created once in __init__ and reused across
reconnects, causing callbacks to capture stale loop references. When the
adapter disconnected and reconnected, old callbacks continued firing with
invalid loop references, resulting in dropped messages with warnings:
"[Feishu] Dropping inbound message before adapter loop is ready"
Fix:
- Rebuild event handler on each connect (websocket/webhook)
- Clear handler on disconnect
- Ensure callbacks always capture current valid loop
- Add defensive loop.is_closed() checks with getattr for test compatibility
- Unify webhook dispatch path to use same loop checks as websocket mode
**Bug Fix 2: Process Hangs on Ctrl+C / SIGTERM**
Root cause: Feishu SDK's websocket client runs in a background thread with
an infinite _select() loop that never exits naturally. The thread was never
properly joined on disconnect, causing processes to hang indefinitely after
Ctrl+C or gateway stop commands.
Fix:
- Store reference to thread-local event loop (_ws_thread_loop)
- On disconnect, cancel all tasks in thread loop and stop it gracefully
via call_soon_threadsafe()
- Await thread future with 10s timeout
- Clean up pending tasks in thread's finally block before closing loop
- Add detailed debug logging for disconnect flow
**Additional Improvements:**
- Add regression tests for disconnect cleanup and webhook dispatch
- Ensure all event callbacks check loop readiness before dispatching
Tested on Linux with websocket mode. All Feishu tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two issues caused Matrix E2EE to silently not work in encrypted rooms:
1. When matrix-nio is installed without the [e2e] extra (no python-olm /
libolm), nio.crypto.ENCRYPTION_ENABLED is False and client.olm is
never initialized. The adapter logged warnings but returned True from
connect(), so the bot appeared online but could never decrypt messages.
Now: check_matrix_requirements() and connect() both hard-fail with a
clear error message when MATRIX_ENCRYPTION=true but E2EE deps are
missing.
2. Without a stable device_id, the bot gets a new device identity on each
restart. Other clients see it as "unknown device" and refuse to share
Megolm session keys. Now: MATRIX_DEVICE_ID env var lets users pin a
stable device identity that persists across restarts and is passed to
nio.AsyncClient constructor + restore_login().
Changes:
- gateway/platforms/matrix.py: add _check_e2ee_deps(), hard-fail in
connect() and check_matrix_requirements(), MATRIX_DEVICE_ID support
in constructor + restore_login
- gateway/config.py: plumb MATRIX_DEVICE_ID into platform extras
- hermes_cli/config.py: add MATRIX_DEVICE_ID to OPTIONAL_ENV_VARS
Closes#3521
The _async_flush_memories() helper accepts (session_id) but both the
/new and /resume handlers passed two arguments (session_id, session_key).
The TypeError was silently swallowed at DEBUG level, so memory extraction
never ran when users typed /new or /resume.
One call site (the session expiry watcher) was already fixed in 9c96f669,
but /new and /resume were missed.
- gateway/run.py:3247 — remove stray session_key from /new handler
- gateway/run.py:4989 — remove stray session_key from /resume handler
- tests/gateway/test_resume_command.py:222 — update test assertion
Previously the scheduler checked startswith('[SILENT]'), so agents that
appended [SILENT] after an explanation (e.g. 'N items filtered.\n\n[SILENT]')
would still trigger delivery.
Change the check to 'in' so the marker is caught regardless of position.
Add test_silent_trailing_suppresses_delivery to cover this case.
- {__raw__} in webhook prompt templates dumps the full JSON payload (truncated at 4000 chars)
- _deliver_cross_platform now passes thread_id/message_thread_id from deliver_extra as metadata, enabling Telegram forum topic delivery
- Tests for both features
* feat(tools): add Firecrawl cloud browser provider
Adds Firecrawl (https://firecrawl.dev) as a cloud browser provider
alongside Browserbase and Browser Use. All browser tools route through
Firecrawl's cloud browser via CDP when selected.
- tools/browser_providers/firecrawl.py — FirecrawlProvider
- tools/browser_tool.py — register in _PROVIDER_REGISTRY
- hermes_cli/tools_config.py — add to onboarding provider picker
- hermes_cli/setup.py — add to setup summary
- hermes_cli/config.py — add FIRECRAWL_BROWSER_TTL config
- website/docs/ — browser docs and env var reference
Based on #4490 by @developersdigest.
Co-Authored-By: Developers Digest <124798203+developersdigest@users.noreply.github.com>
* refactor: simplify FirecrawlProvider.emergency_cleanup
Use self._headers() and self._api_url() instead of duplicating
env-var reads and header construction.
* fix: recognize Firecrawl in subscription browser detection
_resolve_browser_feature_state() now handles "firecrawl" as a direct
browser provider (same pattern as "browser-use"), so hermes setup
summary correctly shows "Browser Automation (Firecrawl)" instead of
misreporting as "Local browser".
Also fixes test_config_version_unchanged assertion (11 → 12).
---------
Co-authored-by: Developers Digest <124798203+developersdigest@users.noreply.github.com>
launchctl kickstart returns exit code 113 ("Could not find service") when
the plist exists but the job hasn't been bootstrapped into the runtime domain.
The existing recovery path only caught exit code 3 ("unloaded"), causing an
unhandled CalledProcessError.
Exit code 113 means the same thing practically -- the service definition needs
bootstrapping before it can be kicked. Add it to the same recovery path that
already handles exit 3, matching the existing pattern in launchd_stop().
Follow-up: add a unit test covering the 113 recovery path.
When a user runs out of OpenRouter credits and switches to Codex (or any
other provider), auxiliary tasks (compression, vision, web_extract) would
still try OpenRouter first and fail with 402. Two fixes:
1. Payment fallback in call_llm(): When a resolved provider returns HTTP 402
or a credit-related error, automatically retry with the next available
provider in the auto-detection chain. Skips the depleted provider and
tries Nous → Custom → Codex → API-key providers.
2. Remove hardcoded OpenRouter fallback: The old code fell back specifically
to OpenRouter when auto/custom resolution returned no client. Now falls
back to the full auto-detection chain, which handles any available
provider — not just OpenRouter.
Also extracts _get_provider_chain() as a shared function (replaces inline
tuple in _resolve_auto and the new fallback), built at call time so test
patches on _try_* functions remain visible.
Adds 16 tests covering _is_payment_error(), _get_provider_chain(),
_try_payment_fallback(), and call_llm() integration with 402 retry.
Centralize the skill → slash command registration that Telegram already had
in commands.py so Discord uses the exact same priority system, filtering,
and cap enforcement:
1. Core/built-in commands (never trimmed)
2. Plugin commands (never trimmed)
3. Skill commands (fill remaining slots, alphabetical, only tier trimmed)
Changes:
hermes_cli/commands.py:
- Rename _TG_NAME_LIMIT → _CMD_NAME_LIMIT (32 chars shared by both platforms)
- Rename _clamp_telegram_names → _clamp_command_names (generic)
- Extract _collect_gateway_skill_entries() — shared plugin + skill
collection with platform filtering, name sanitization, description
truncation, and cap enforcement
- Refactor telegram_menu_commands() to use the shared helper
- Add discord_skill_commands() that returns (name, desc, cmd_key) triples
- Preserve _sanitize_telegram_name() for Telegram-specific name cleaning
gateway/platforms/discord.py:
- Call discord_skill_commands() from _register_slash_commands()
- Create app_commands.Command per skill entry with cmd_key callback
- Respect 100-command global Discord limit
- Log warning when skills are skipped due to cap
Backward-compat aliases preserved for _TG_NAME_LIMIT and
_clamp_telegram_names.
Tests: 9 new tests (7 Discord + 2 backward-compat), 98 total pass.
Inspired by PR #5498 (sprmn24). Closes#5480.
The cron scheduler delivery path passed raw text including MEDIA: tags
to _send_to_platform(), so media attachments were delivered as literal
text instead of actual files. The send function already supports
media_files= but the cron path never used it.
Now calls BasePlatformAdapter.extract_media() to split media paths
from text before sending, matching the gateway's normal message flow.
Salvaged from PR #4877 by robert-hoffmann.
The Signal adapter inherited base class defaults for send_image_file(),
send_voice(), and send_video() which only sent the file path as text
(e.g. '🖼️ Image: /tmp/chart.png') instead of actually delivering the file
as a Signal attachment.
When agent responses contain MEDIA:/path/to/file tags, the gateway
media pipeline extracts them and routes through these methods by file
type. Without proper overrides, image/audio/video files were never
actually delivered to Signal users.
Extract a shared _send_attachment() helper that handles all file
validation, size checking, group/DM routing, and RPC dispatch. The four
public methods (send_document, send_image_file, send_voice, send_video)
now delegate to this helper, following the same pattern used by WhatsApp
(_send_media_to_bridge) and Discord (_send_file_attachment).
The helper also uses a single stat() call with try/except FileNotFoundError
instead of the previous exists() + stat() two-syscall pattern, eliminating
a TOCTOU race. As a bonus, send_document() now gains the 100MB size check
that was previously missing (inconsistency with send_image).
Add 25 tests covering all methods plus MEDIA: tag extraction integration,
method-override guards, and send_document's new size check.
Fixes#5105
Telegram Bot API requires command names to contain only lowercase a-z,
digits 0-9, and underscores. Skill/plugin names containing characters
like +, /, @, or . caused set_my_commands to fail with
Bot_command_invalid.
Two-layer fix:
- scan_skill_commands(): strip non-alphanumeric/non-hyphen chars from
cmd_key at source, collapse consecutive hyphens, trim edges, skip
names that sanitize to empty string
- _sanitize_telegram_name(): centralized helper used by all 3 Telegram
name generation sites (core commands, plugin commands, skill commands)
with empty-name guard at each call site
Closes#5534
OpenCode Go model names with dots (minimax-m2.7, glm-4.5, kimi-k2.5)
were being mangled to hyphens (minimax-m2-7), causing HTTP 401 errors.
Two code paths were affected:
1. model_normalize.py: opencode-go was incorrectly in DOT_TO_HYPHEN_PROVIDERS
2. run_agent.py: _anthropic_preserve_dots() did not check for opencode-go
Fix:
- Remove opencode-go from _DOT_TO_HYPHEN_PROVIDERS (dots are correct for Go)
- Add opencode-go to _anthropic_preserve_dots() provider check
- Add opencode.ai/zen/go to base_url fallback check
- Add regression tests in tests/test_model_normalize.py
Co-authored-by: jacob3712 <jacob3712@users.noreply.github.com>
Grok models (x-ai/grok-4.20-beta, grok-code-fast-1) now receive tool-use
enforcement guidance, steering them to actually call tools instead of
describing intended actions. Matches both OpenRouter (x-ai/grok-*) and
direct xAI API usage.
After restarting a service-managed gateway (systemd/launchd), the
stale-process sweep calls find_gateway_pids() which returns ALL gateway
PIDs via ps aux — including the one just spawned by the service manager.
The sweep kills it, leaving the user with a stopped gateway and a
confusing 'Restart manually' message.
Fix: add _get_service_pids() to query systemd MainPID and launchd PID
for active gateway services, then exclude those PIDs from the sweep.
Also add exclude_pids parameter to find_gateway_pids() and
kill_gateway_processes() so callers can skip known service-managed PIDs.
Adds 9 targeted tests covering:
- _get_service_pids() for systemd, launchd, empty, and zero-PID cases
- find_gateway_pids() exclude_pids filtering
- cmd_update integration: service PID not killed after restart
- cmd_update integration: manual PID killed while service PID preserved
Cron agents were burning iterations trying to use send_message (which is
disabled via messaging toolset) because their prompts said things like
'send the report to Telegram'. The scheduler handles delivery
automatically via the deliver setting, but nothing told the agent that.
Add a delivery guidance hint to _build_job_prompt alongside the existing
[SILENT] hint: tells agents their final response is auto-delivered and
they should NOT use send_message.
Before: only [SILENT] suppression hint
After: delivery guidance ('do NOT use send_message') + [SILENT] hint
Port the gateway's inactivity-based timeout pattern (PR #5389) to the
cron scheduler. The agent can now run for hours if it's actively calling
tools or receiving stream tokens — only genuine inactivity (no activity
for HERMES_CRON_TIMEOUT seconds, default 600s) triggers a timeout.
This fixes the Sunday PR scouts (openclaw, nanoclaw, ironclaw) which
all hit the hard 600s wall-clock limit while actively working.
Changes:
- Replace flat future.result(timeout=N) with a polling loop that checks
agent.get_activity_summary() every 5s (same pattern as gateway)
- Timeout error now includes diagnostic info: last activity description,
idle duration, current tool, iteration count
- HERMES_CRON_TIMEOUT=0 means unlimited (no timeout)
- Move sys.path.insert before repo-level imports to fix
ModuleNotFoundError for hermes_time on stale gateway processes
- Add time import needed by the polling loop
- Add 9 tests covering active/idle/unlimited/env-var/diagnostic scenarios
- Rename per-LLM-call hooks from pre_llm_request/post_llm_request for clarity vs pre_llm_call
- Emit summary kwargs only (counts, usage dict from normalize_usage); keep env_var_enabled for HERMES_DUMP_REQUESTS
- Add is_truthy_value/env_var_enabled to utils; wire hermes_cli.plugins._env_enabled through it
- Update Langfuse local setup doc; add scripts/langfuse_smoketest.py and optional ~/.hermes plugin tests
Made-with: Cursor
Add validate_config_structure() that catches common config.yaml mistakes:
- custom_providers as dict instead of list (missing '-' in YAML)
- fallback_model accidentally nested inside another section
- custom_providers entries missing required fields (name, base_url)
- Missing model section when custom_providers is configured
- Root-level keys that look like misplaced custom_providers fields
Surface these diagnostics at three levels:
1. Startup: print_config_warnings() runs at CLI and gateway module load,
so users see issues before hitting cryptic errors
2. Error time: 'Unknown provider' errors in auth.py and model_switch.py
now include config diagnostics with fix suggestions
3. Doctor: 'hermes doctor' shows a Config Structure section with all
issues and fix hints
Also adds a warning log in runtime_provider.py when custom_providers
is a dict (previously returned None silently).
Motivated by a Discord user who had malformed custom_providers YAML
and got only 'Unknown Provider' with no guidance on what was wrong.
17 new tests covering all validation paths.
Consolidated salvage from PRs #5301 (qaqcvc), #5339 (lance0),
#5058 and #5098 (maymuneth).
Mem0 API v2 compatibility (#5301):
- All reads use filters={user_id: ...} instead of bare user_id= kwarg
- All writes use filters with user_id + agent_id for attribution
- Response unwrapping for v2 dict format {results: [...]}
- Split _read_filters() vs _write_filters() — reads are user-scoped
only for cross-session recall, writes include agent_id
- Preserved 'hermes-user' default (no breaking change for existing users)
- Omitted run_id scoping from #5301 — cross-session memory is Mem0's
core value, session-scoping reads would defeat that purpose
Memory prefetch context fencing (#5339):
- Wraps prefetched memory in <memory-context> fenced blocks with system
note marking content as recalled context, NOT user input
- Sanitizes provider output to strip fence-escape sequences, preventing
injection where memory content breaks out of the fence
- API-call-time only — never persisted to session history
Secret redaction (#5058, #5098):
- Added prefix patterns for Groq (gsk_), Matrix (syt_), RetainDB
(retaindb_), Hindsight (hsk-), Mem0 (mem0_), ByteRover (brv_)
All 35 subprocess.run() calls in hermes_cli/gateway.py lacked timeout
parameters. If systemctl, launchctl, loginctl, wmic, or ps blocks,
hermes gateway start/stop/restart/status/install/uninstall hangs
indefinitely with no feedback.
Timeouts tiered by operation type:
- 10s: instant queries (is-active, status, list, ps, tail, journalctl)
- 30s: fast lifecycle (daemon-reload, enable, start, bootstrap, kickstart)
- 90s: graceful shutdown (stop, restart, bootout, kickstart -k) — exceeds
our TimeoutStopSec=60 to avoid premature timeout during shutdown
Special handling: _is_service_running() and launchd_status() catch
TimeoutExpired and treat it as not-running/not-loaded, consistent with
how non-zero return codes are already handled.
Inspired by PR #3732 (dlkakbs) and issue #4057 (SHL0MS).
Reimplemented on current main which has significantly changed launchctl
handling (bootout/bootstrap/kickstart vs legacy load/unload/start/stop).