dict.get(key, default) returns None — not the default — when the key IS
present but explicitly set to null/~ in YAML. Calling .lower() on that
raises AttributeError.
Use (config.get(key) or fallback) so both missing keys and explicit nulls
coalesce to the intended default.
Files fixed:
- tools/tts_tool.py — _get_provider()
- tools/web_tools.py — _get_backend()
- tools/mcp_tool.py — MCPServerTask auth config
- trajectory_compressor.py — _detect_provider() and config loading
Co-authored-by: dieutx <dangtc94@gmail.com>
V4A patches with only + lines (no context or - lines) were silently
dropped because search_lines was empty and the 'if search_lines:' block
was the only code path. Addition-only hunks are common when the model
generates patches for new functions or blocks.
Adds an else branch that inserts at the context_hint position when
available, or appends at end of file.
Includes 2 regression tests for addition-only hunks with and without
context hints.
Salvaged from PR #3092 by thakoreh.
Co-authored-by: Hiren <hiren.thakore58@gmail.com>
* fix(gateway): add media download retry to Mattermost, Slack, and base cache
Media downloads on Mattermost and Slack fail permanently on transient
errors (timeouts, 429 rate limits, 5xx server errors). Telegram and
WhatsApp already have retry logic, but these platforms had single-attempt
downloads with hardcoded 30s timeouts.
Changes:
- base.py cache_image_from_url: add retry with exponential backoff
(covers Signal and any platform using the shared cache helper)
- mattermost.py _send_media_url: retry on 429/5xx/timeout (3 attempts)
- slack.py _download_slack_file: retry on timeout/5xx (3 attempts)
- slack.py _download_slack_file_bytes: same retry pattern
* test: add tests for media download retry
---------
Co-authored-by: dieutx <dangtc94@gmail.com>
When a new session starts in the gateway (via /new, /reset, or
auto-reset), send the user a summary of the detected configuration:
✨ Session reset! Starting fresh.
◆ Model: qwen3.5:27b-q4_K_M
◆ Provider: custom
◆ Context: 8K tokens (config)
◆ Endpoint: http://localhost:11434/v1
This makes misconfigured context length immediately visible — a user
running a local 8K model that falls to the 128K default will see:
◆ Context: 128K tokens (default — set model.context_length in config to override)
Instead of silently getting no compression and degrading responses.
- _format_session_info() resolves model, provider, context length,
and endpoint from config + runtime, matching the hygiene code's
resolution chain
- Local/custom endpoints shown; cloud endpoints hidden (not useful)
- Context source annotated: config, detected, or default with hint
- Appended to /new and /reset responses, and auto-reset notifications
- 9 tests covering all formatting paths and failure resilience
Addresses the user-facing side of #2708 — instead of trying to fix
every edge case in context detection, surface the values so users
can immediately see when something is wrong.
When user messages have empty content (e.g., Discord @mention-only
messages, unrecognized attachments), the Anthropic API rejects the
request with 'user messages must have non-empty content'.
Changes:
- anthropic_adapter.py: Add empty content validation for user messages
(string and list formats), matching the existing pattern for assistant
and tool messages. Empty content gets '(empty message)' placeholder.
- discord.py: Defense-in-depth check at gateway layer to catch empty
messages before they enter session history.
- Add 4 regression tests covering empty string, whitespace-only,
empty list, and empty text block scenarios.
Fixes#3143
Co-authored-by: Bartok9 <bartok9@users.noreply.github.com>
Two remaining CI failures:
1. agent-client-protocol 0.9.0 removed AuthMethod (replaced with
AuthMethodAgent/EnvVar/Terminal). Pin to <0.9 until the new API
is evaluated — our usage doesn't map 1:1 to the new types.
2. test_429_exhausts_all_retries_before_raising expected pytest.raises
but the agent now catches 429s after max retries, tries fallback,
then returns a result dict. Updated to check final_response.
The gateway's update_session() used += for token counts, but the cached
agent's session_prompt_tokens / session_completion_tokens are cumulative
totals that grow across messages. Each update_session call re-added the
running total, inflating usage stats with every message (1.7x after 3
messages, worse over longer conversations).
Fix: change += to = for in-memory entry fields, add set_token_counts()
to SessionDB that uses direct assignment instead of SQL increment, and
switch the gateway to call it.
CLI mode continues using update_token_counts() (increment) since it
tracks per-API-call deltas — that path is unchanged.
Based on analysis from PR #3222 by @zaycruz (closed).
Co-authored-by: zaycruz <zay@users.noreply.github.com>
signal-cli sends SSE comment lines (':') as keepalives every ~15s. The
SSE listener only counted 'data:' lines as activity, so the health
monitor reported false idle warnings every 2 minutes during quiet
periods. Recognize ':' lines as valid activity per the SSE spec.
Salvaged from PR #2938 by ticketclosed-wontfix.
The cached agent accumulates session_input_tokens across messages, so
run_conversation() returns cumulative totals. But update_session() used
+= (increment), double-counting on every message after the first.
- session.py: change in-memory entry updates from += to = (direct
assignment for cumulative values)
- hermes_state.py: add absolute=True flag to update_token_counts()
that uses SET column = ? instead of SET column = column + ?
- session.py: pass absolute=True to the DB call
CLI path is unchanged — it passes per-API-call deltas directly to
update_token_counts() with the default absolute=False (increment).
Reported by @zaycruz in #3222. Closes#3222.
Three improvements salvaged from PR #3225 by Mibayy:
1. Add /resume slash command handler in CLI process_command(). The
command was registered in the commands registry but had no handler,
so typing /resume produced 'Unknown command'. The handler resolves
by title or session ID, ends the current session cleanly, loads
conversation history from SQLite, re-opens the target session, and
syncs the AIAgent instance. Follows the same pattern as new_session().
2. Add truncation guard in _save_session_log(). When resuming a session
whose messages weren't fully written to SQLite, the agent starts with
partial history and the first save would overwrite the full JSON log
on disk. The guard reads the existing file and skips the write if it
already has more messages than the current batch.
3. Add reopen_session() method to SessionDB. Proper API for clearing
ended_at/end_reason instead of reaching into _conn directly.
Note: Bug 1 from the original PR (INSERT OR IGNORE + _session_db = None)
is already fixed on main — skipped as redundant.
Closes#3123.
The startup warning 'No user allowlists configured' only checked
GATEWAY_ALLOW_ALL_USERS and per-platform _ALLOWED_USERS vars. It
missed SIGNAL_GROUP_ALLOWED_USERS and per-platform _ALLOW_ALL_USERS
vars (e.g. TELEGRAM_ALLOW_ALL_USERS), causing a false warning even
when users had these configured. The actual auth check in
_is_user_authorized already recognized these vars.
Cherry-picked from PR #3202 by binhnt92.
Co-authored-by: binhnt92 <binhnt.ht.92@gmail.com>
_try_anthropic() caught ImportError on the module import (line 667-669)
but not on the build_anthropic_client() call (line 696). When the
anthropic_adapter module imports fine but the anthropic SDK is missing,
build_anthropic_client() raises ImportError at call time. This escaped
_try_anthropic() entirely, killing get_available_vision_backends() and
cascading to 7 test failures:
- 4 setup wizard tests hit unexpected 'Configure vision:' prompt
- 3 codex-auth-as-vision tests failed check_vision_requirements()
The fix wraps the build_anthropic_client call in try/except ImportError,
returning (None, None) when the SDK is unavailable — consistent with the
existing guard at the top of the function.
rewrite_transcript (used by /retry, /undo, /compress) was calling
append_message without reasoning, reasoning_details, or
codex_reasoning_items — permanently dropping them from SQLite.
Co-authored-by: alireza78a <alireza78.crypto@gmail.com>
When _try_activate_fallback() switches to the fallback model, it
updates the agent's model/provider/client but never touches
self.context_compressor. The compressor keeps the primary model's
context_length and threshold_tokens, so compression decisions use
wrong limits — a 200K primary → 32K fallback still uses 200K-based
thresholds, causing oversized sessions to overflow the fallback.
Update the compressor's model, credentials, context_length, and
threshold_tokens after fallback activation using get_model_context_length()
for the new model.
Cherry-picked from PR #3202 by binhnt92.
Co-authored-by: binhnt92 <binhnt.ht.92@gmail.com>
The API server adapter was creating agents without specifying
enabled_toolsets, causing ALL tools to load — including clarify,
send_message, and text_to_speech which don't work without interactive
callbacks or gateway dispatch.
Changes:
- toolsets.py: Add hermes-api-server toolset (core tools minus clarify,
send_message, text_to_speech)
- api_server.py: Resolve toolsets from config.yaml platform_toolsets
via _get_platform_tools() — same path as all other gateway platforms.
Falls back to hermes-api-server default when no override configured.
- tools_config.py: Add api_server to PLATFORMS dict so users can
customize via 'hermes tools' or platform_toolsets.api_server in
config.yaml
- 12 tests covering toolset definition, config resolution, and
user override
Reported by thatwolfieguy on Discord.
YAML 1.1 parses bare `off` as boolean False, which is falsy in
Python's `or` chain and silently falls through to the 'all' default.
Users setting `display.tool_progress: off` in config.yaml saw no
effect — tool progress stayed on.
Normalise False → 'off' before the or chain in both affected paths:
- gateway/run.py _run_agent() tool progress reader
- cli.py HermesCLI.__init__() tool_progress_mode
Reported by @gibbsoft in #2859. Closes#2859.
Two changes:
1. Fix /queue command: remove the _agent_running guard that rejected
/queue after the agent finished. The prompt was deferred in
_pending_input until the agent completed, then the handler checked
_agent_running (now False) and rejected it. /queue now always queues
regardless of timing.
2. Add display.busy_input_mode config (CLI-only):
- 'interrupt' (default): Enter while busy interrupts the current run
(preserves existing behavior)
- 'queue': Enter while busy queues the message for the next turn,
with a 'Queued for the next turn: ...' confirmation
Ctrl+C always interrupts regardless of this setting.
Salvaged from PR #3037 by StefanoChiodino. Key differences:
- Default is 'interrupt' (preserves existing behavior) not 'queue'
- No config version bump (unnecessary for new key in existing section)
- Simpler normalization (no alias map)
- /queue fix is simpler: just remove the guard instead of intercepting
commands during busy state
* fix(gateway): silence flush agent terminal output
quiet_mode=True only suppresses AIAgent init messages.
Tool call output still leaks to the terminal through
_safe_print → _print_fn during session reset/expiry.
Since #2670 injected live memory state into the flush prompt,
the flush agent now reliably calls memory tools — making the
output leak noticeable for the first time.
Set _print_fn to a no-op so the background flush is fully silent.
* test(gateway): add test for flush agent terminal silence + fix dotenv mock
- Add TestFlushAgentSilenced: verifies _print_fn is set to a no-op on
the flush agent so tool output never leaks to the terminal
- Fix pre-existing test failures: replace patch('run_agent.AIAgent')
with sys.modules mock to avoid importing run_agent (requires openai)
- Add autouse _mock_dotenv fixture so all tests in this file run
without the dotenv package installed
* fix(display): route KawaiiSpinner output through print_fn to fully silence flush agent
The previous fix set tmp_agent._print_fn = no-op on the flush agent but
spinner output and quiet-mode cute messages bypassed _print_fn entirely:
- KawaiiSpinner captured sys.stdout at __init__ and wrote directly to it
- quiet-mode tool results used builtin print() instead of _safe_print()
Add optional print_fn parameter to KawaiiSpinner.__init__; _write routes
through it when set. Pass self._print_fn to all spinner construction sites
in run_agent.py and change the quiet-mode cute message print to _safe_print.
The existing gateway fix (tmp_agent._print_fn = lambda) now propagates
correctly through both paths.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(gateway): silence hygiene and compression background agents
Two more background AIAgent instances in the gateway were created with
quiet_mode=True but without _print_fn = no-op, causing tool output to
leak to the terminal:
- _hyg_agent (in-turn hygiene memory agent)
- tmp_agent (_compress_context path)
Apply the same _print_fn no-op pattern used for the flush agent.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore(display): remove unused _last_flush_time from KawaiiSpinner
Attribute was set but never read; upstream already removed it.
Leftover from conflict resolution during rebase onto upstream/main.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Dilee <uzmpsk.dilekakbas@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The background memory/skill review (_spawn_background_review) runs
after the agent response when turn/iteration counters exceed their
thresholds. It saves memories and skills, then prints a summary like
'💾 Memory updated · User profile updated'. In CLI mode this goes to
the terminal via _safe_print. In gateway mode, _safe_print routes to
print() which goes to stdout — invisible to the user.
Add a background_review_callback attribute to AIAgent. When set, the
background review thread calls it with the summary string after saves
complete. The gateway wires this to adapter.send() via the same
run_coroutine_threadsafe bridge used by status_callback, delivering
the notification to the user's chat.
When send() fails due to a network error (ConnectError, ReadTimeout, etc.),
the failure was silently logged and the user received no feedback — appearing
as a hang. In one reported case, a user waited 1+ hour for a response that
had already been generated but failed to deliver (#2910).
Adds _send_with_retry() to BasePlatformAdapter:
- Transient errors: retry up to 2x with exponential backoff + jitter
- On exhaustion: send delivery-failure notice so user knows to retry
- Permanent errors: fall back to plain-text version (preserves existing behavior)
- SendResult.retryable flag for platform-specific transient errors
All adapters benefit automatically via BasePlatformAdapter inheritance.
Cherry-picked from PR #3108 by Mibayy.
Co-authored-by: Mibayy <mibayy@users.noreply.github.com>
shutil.get_terminal_size() can return stale/fallback values on SSH that
differ from prompt_toolkit's actual terminal width. Fragments built for
the wrong width overflow and wrap onto a second line (wrap_lines=True
default), appearing as progressively degrading duplicates.
- Read width from get_app().output.get_size().columns when inside a
prompt_toolkit TUI, falling back to shutil outside TUI context
- Add wrap_lines=False on the status bar Window as belt-and-suspenders
guard against any future width mismatch
Closes#3130
Co-authored-by: Mibayy <Mibayy@users.noreply.github.com>
Two bugs caused the OpenClaw migration during first-time setup to be
ineffective, forcing users to reconfigure everything manually:
1. The setup wizard created config.yaml with all defaults BEFORE running
the migration, then the migrator ran with overwrite=False. Every config
setting was reported as a 'conflict' against the defaults and skipped.
Fix: use overwrite=True during setup-time migration (safe because only
defaults exist at that point). The hermes claw migrate CLI command
still defaults to overwrite=False for post-setup use.
2. After migration, the full setup wizard ran all 5 sections unconditionally,
forcing the user through model/terminal/agent/messaging/tools configuration
even when those settings were just imported.
Fix: add _get_section_config_summary() and _skip_configured_section()
helpers. After migration, each section checks if it's already configured
(API keys present, non-default values, platform tokens) and offers
'Reconfigure? [y/N]' with default No. Unconfigured sections still run
normally.
Reported by Dev Bredda on social media.
When the homeserver returns an error response, matrix-nio parses it
as a SyncError return value rather than raising an exception. The sync
loop only had backoff in the except handler, so SyncError caused a
tight retry loop (~489 req/s) flooding logs and hammering the
homeserver. Check the return value and sleep 5s before retry.
Cherry-picked from PR #2937 by ticketclosed-wontfix.
Co-authored-by: ticketclosed-wontfix <ticketclosed-wontfix@users.noreply.github.com>
After a Telegram 502, _handle_polling_network_error calls updater.stop()
then start_polling(). If start_polling() also raises, the old code logged
a warning and returned — but the comment 'The next network error will
trigger another attempt' was wrong. The updater loop is dead after stop(),
so no further error callbacks ever fire. The gateway stays alive but
permanently deaf to messages.
Fix: when start_polling() fails in the except branch, schedule a new
_handle_polling_network_error task to continue the exponential backoff
retry chain. The task is tracked in _background_tasks (preventing GC).
Guarded by has_fatal_error to avoid spurious retries during shutdown.
Closes#3173.
Salvaged from PR #3177 by Mibayy.
The delegate_task tool accepts a toolsets parameter directly from the
LLM's function call arguments. When provided, these toolsets are passed
through _strip_blocked_tools but never intersected with the parent
agent's enabled_toolsets. A model can request toolsets the parent does
not have (e.g., web, browser, rl), granting the subagent tools that
were explicitly disabled for the parent.
Intersect LLM-requested toolsets with the parent's enabled set before
applying the blocked-tool filter, so subagents can only receive a
subset of the parent's tools.
Co-authored-by: dieutx <dangtc94@gmail.com>
* feat: config-gated /verbose command for messaging gateway
Add gateway_config_gate field to CommandDef, allowing cli_only commands
to be conditionally available in the gateway based on a config value.
- CommandDef gains gateway_config_gate: str | None — a config dotpath
that, when truthy, overrides cli_only for gateway surfaces
- /verbose uses gateway_config_gate='display.tool_progress_command'
- Default is off (cli_only behavior preserved)
- When enabled, /verbose cycles tool_progress mode (off/new/all/verbose)
in the gateway, saving to config.yaml — same cycle as the CLI
- Gateway helpers (help, telegram menus, slack mapping) dynamically
check config to include/exclude config-gated commands
- GATEWAY_KNOWN_COMMANDS always includes config-gated commands so
the gateway recognizes them and can respond appropriately
- Handles YAML 1.1 bool coercion (bare 'off' parses as False)
- 8 new tests for the config gate mechanism + gateway handler
* docs: document gateway_config_gate and /verbose messaging support
- AGENTS.md: add gateway_config_gate to CommandDef fields
- slash-commands.md: note /verbose can be enabled for messaging, update Notes
- configuration.md: add tool_progress_command to display section + usage note
- cli.md: cross-link to config docs for messaging enablement
- messaging/index.md: show tool_progress_command in config snippet
- plugins.md: add gateway_config_gate to register_command parameter table
Python's asyncio event loop holds only weak references to tasks.
Without a strong reference, the garbage collector can destroy a task
while it's awaiting I/O — silently dropping messages. Python 3.12+
made this more aggressive.
Audit of all gateway platform adapters found 6 untracked create_task
calls across 6 files:
Per-message tasks (tracked via _background_tasks set from base class):
- gateway/platforms/webhook.py: handle_message task
- gateway/platforms/sms.py: handle_message task
- gateway/platforms/signal.py: SSE response aclose task
Long-running infrastructure tasks (stored in named instance vars):
- gateway/platforms/slack.py: Socket Mode handler (_socket_mode_task)
- gateway/platforms/discord.py: bot client (_bot_task)
- gateway/platforms/whatsapp.py: message poll loop (_poll_task, 2 sites)
All other adapters (telegram, mattermost, matrix, email, homeassistant,
dingtalk) already tracked their tasks correctly.
Salvaged from PR #3160 by memosr — expanded from 1 file to 6.
Add timeout=30 to all bare ClientSession, IMAP4_SSL, smtplib.SMTP, and
ws_connect calls that previously had no timeout, preventing indefinite
hangs when an external server is slow or unresponsive.
Adapters hardened:
- HomeAssistant: REST + WS session creation, ws_connect handshake
- Email: all IMAP4_SSL (x2) and smtplib.SMTP (x3) calls
- Mattermost: session creation, _api_get, _api_post, _upload_file (60s)
- SMS: session creation in connect() + fallback session in send()
Salvaged from PRs #3161, #3168, #3170 (memosr) and #3201 (binhnt92).
SMS fallback ClientSession on send() also patched (missed in #3201).
Co-authored-by: memosr <memosr@users.noreply.github.com>
Co-authored-by: nguyen binh <binhnt92@users.noreply.github.com>
When third-party tools (Paperclip orchestrator, etc.) spawn hermes chat
as a subprocess, their sessions pollute user session history and search.
- hermes chat --source <tag> (also HERMES_SESSION_SOURCE env var)
- exclude_sources parameter on list_sessions_rich() and search_messages()
- Sessions with source=tool hidden from sessions list/browse/search
- Third-party adapters pass --source tool to isolate agent sessions
Cherry-picked from PR #3208 by HenkDz.
Co-authored-by: Henkey <noonou7@gmail.com>
except Exception does not catch KeyboardInterrupt (inherits from
BaseException). A second Ctrl+C during exit cleanup aborts pending
writes — Honcho observations dropped, SQLite sessions left unclosed,
cron job sessions never marked ended.
Changed to except (Exception, KeyboardInterrupt) at all five sites:
- cli.py: honcho.shutdown() and end_session() in finally exit block
- run_agent.py: _flush_honcho_on_exit atexit handler
- cron/scheduler.py: end_session() and close() in job finally block
Tests exercise the actual production code paths and confirm
KeyboardInterrupt propagates without the fix.
Co-authored-by: dieutx <dangtc94@gmail.com>
Asyncio tasks created with create_task() but never stored can be
garbage collected mid-execution. Add self._background_tasks set to
hold references, with add_done_callback cleanup. Tracks:
- /background command task
- session-reset memory flush task
- session-resume memory flush task
Cancel all pending tasks in stop().
Update test fixtures that construct GatewayRunner via object.__new__()
to include the new _background_tasks attribute.
Cherry-picked from PR #3167 by memosr. The original PR also deleted
the DM topic auto-skill loading code — that deletion was excluded
from this salvage as it removes a shipped feature (#2598).
Co-authored-by: memosr.eth <96793918+memosr@users.noreply.github.com>
detect_dangerous_command() ran regex patterns against raw command strings
without normalization, allowing bypass via Unicode fullwidth chars,
ANSI escape codes, null bytes, and 8-bit C1 controls.
Adds _normalize_command_for_detection() that:
- Strips ANSI escapes using the full ECMA-48 strip_ansi() from
tools/ansi_strip (CSI, OSC, DCS, 8-bit C1, nF sequences)
- Removes null bytes
- Normalizes Unicode via NFKC (fullwidth Latin → ASCII, etc.)
Includes 12 regression tests covering fullwidth, ANSI, C1, null byte,
and combined obfuscation bypasses.
Salvaged from PR #3089 by thakoreh — improved ANSI stripping to use
existing comprehensive strip_ansi() instead of a weaker hand-rolled
regex, and added test coverage.
Co-authored-by: Hiren <hiren.thakore58@gmail.com>
Nous Portal now passes through OpenRouter model names and routes from
there. Update the static fallback model list and auxiliary client default
to use OpenRouter-format slugs (provider/model) instead of bare names.
- _PROVIDER_MODELS['nous']: full OpenRouter catalog
- _NOUS_MODEL: google/gemini-3-flash-preview (was gemini-3-flash)
- Updated 4 test assertions for the new default model name
* fix(session-db): survive CLI/gateway concurrent write contention
Closes#3139
Three layered fixes for the scenario where CLI and gateway write to
state.db concurrently, causing create_session() to fail with
'database is locked' and permanently disabling session_search on the
gateway side.
1. Increase SQLite connection timeout: 10s -> 30s
hermes_state.py: longer window for the WAL writer to finish a batch
flush before the other process gives up entirely.
2. INSERT OR IGNORE in create_session
hermes_state.py: prevents IntegrityError on duplicate session IDs
(e.g. gateway restarts while CLI session is still alive).
3. Don't null out _session_db on create_session failure (main fix)
run_agent.py: a transient lock at agent startup must not permanently
disable session_search for the lifetime of that agent instance.
_session_db now stays alive so subsequent flushes and searches work
once the lock clears.
4. New ensure_session() helper + call it during flush
hermes_state.py: INSERT OR IGNORE for a minimal session row.
run_agent.py _flush_messages_to_session_db: calls ensure_session()
before appending messages, so the FK constraint is satisfied even
when create_session() failed at startup. No-op when the row exists.
* fix(state): release lock between context queries in search_messages
The context-window queries (one per FTS5 match) were running inside
the same lock acquisition as the primary FTS5 query, holding the lock
for O(N) sequential SQLite round-trips. Move per-match context fetches
outside the outer lock block so each acquires the lock independently,
keeping critical sections short and allowing other threads to interleave.
* fix(session): prefer longer source in load_transcript to prevent legacy truncation
When a long-lived session pre-dates SQLite storage (e.g. sessions
created before the DB layer was introduced, or after a clean
deployment that reset the DB), _flush_messages_to_session_db only
writes the *new* messages from the current turn to SQLite — it skips
messages already present in conversation_history, assuming they are
already persisted.
That assumption fails for legacy JSONL-only sessions:
Turn N (first after DB migration):
load_transcript(id) → SQLite: 0 → falls back to JSONL: 994 ✓
_flush_messages_to_session_db: skip first 994, write 2 new → SQLite: 2
Turn N+1:
load_transcript(id) → SQLite: 2 → returns immediately ✗
Agent sees 2 messages of history instead of 996
The same pattern causes the reported symptom: session JSON truncated
to 4 messages (_save_session_log writes agent.messages which only has
2 history + 2 new = 4).
Fix: always load both sources and return whichever is longer. For a
fully-migrated session SQLite will always be ≥ JSONL, so there is no
regression. For a legacy session that hasn't been bootstrapped yet,
JSONL wins and the full history is restored.
Closes#3212
* test: add load_transcript source preference tests for #3212
Covers: JSONL longer returns JSONL, SQLite longer returns SQLite,
SQLite empty falls back to JSONL, both empty returns empty, equal
length prefers SQLite (richer reasoning fields).
---------
Co-authored-by: Mibayy <mibayy@hermes.ai>
Co-authored-by: kewe63 <kewe.3217@gmail.com>
Co-authored-by: Mibayy <mibayy@users.noreply.github.com>
Validate each ZIP member's resolved path against the extraction directory
before extracting. A crafted ZIP with paths like ../../etc/passwd would
previously write outside the target directory.
Fixes#3075
Co-authored-by: Hiren <hiren.thakore58@gmail.com>
* fix(skills): reduce skills.sh resolution churn and preserve trust for wrapped identifiers
- Accept common skills.sh prefix typos (skils-sh/, skils.sh/)
- Strip skills-sh/ prefix in _resolve_trust_level() so trusted repos
stay trusted when installed through skills.sh
- Use resolved identifier (from bundle/meta) for scan_skill source
- Prefer tree search before root scan in _discover_identifier()
- Add _resolve_github_meta() consolidation for inspect flow
Cherry-picked from PR #3001 by kshitijk4poor.
* fix: restore candidate loop in SkillsShSource.fetch() for consistency
The cherry-picked PR only tried the first candidate identifier in
fetch() while inspect() (via _resolve_github_meta) tried all four.
This meant skills at repo/skills/path would be found by inspect but
missed by fetch, forcing it through the heavier _discover_identifier
flow. Restore the candidate loop so both paths behave identically.
Updated the test assertion to match.
---------
Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
Gateway sessions had their own inline toolset resolution that only read
platform_toolsets from config, which never includes MCP server names.
MCP tools were discovered and registered but invisible to the model.
- Replace duplicated gateway toolset resolution in _run_agent() and
_run_background_task() with calls to the shared _get_platform_tools()
- Extend _get_platform_tools() to include globally enabled MCP servers
at runtime (include_default_mcp_servers=True), while config-editing
flows use include_default_mcp_servers=False to avoid persisting
implicit MCP defaults into platform_toolsets
- Add homeassistant to PLATFORMS dict (was missing, caused KeyError)
- Fix CLI entry point to use _get_platform_tools() as well, so MCP
tools are visible in CLI mode too
- Remove redundant platform_key reassignment in _run_background_task
Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>
Anthropic migrated their OAuth infrastructure from console.anthropic.com
to platform.claude.com (Claude Code v2.1.81+). Update _refresh_oauth_token()
to try the new endpoint first, falling back to the old one for tokens
issued before the migration.
Also switches Content-Type from application/x-www-form-urlencoded to
application/json to match current Claude Code behavior.
Salvaged from PR #2741 by kshitijk4poor.
Previously _agent_config_signature() used only the first 8 characters of
the API key, which causes false cache hits for JWT/OAuth tokens that share
a common prefix (e.g. 'eyJhbGci'). This led to cross-account cache
collisions when switching OAuth accounts in multi-user gateway deployments.
Replace the 8-char prefix with a SHA-256 hash of the full key so the
signature is unique per credential while keeping secrets out of the
cache key.
Salvaged from PR #3117 by EmpireOperating.
Co-authored-by: EmpireOperating <EmpireOperating@users.noreply.github.com>
Salvages PR #3005 by web3blind. Cherry-picked onto current main with functional skill binding and docs added.
- DM topic creation via createForumTopic (Bot API 9.4, Feb 2026)
- Config-driven topics with thread_id persistence across restarts
- Session isolation via existing build_session_key thread_id support
- auto_skill field on MessageEvent for topic-skill bindings
- Gateway auto-loads bound skill on new sessions (same as /skill commands)
- Docs: full Private Chat Topics section in Telegram messaging guide
- 20 tests (17 original + 3 for auto_skill)
Closes#2598
Co-authored-by: web3blind <web3blind@users.noreply.github.com>
Two improvements salvaged from PR #2600 (paraddox):
1. Preflight compression now counts tool schema tokens alongside system
prompt and messages. With 50+ tools enabled, schemas can add 20-30K
tokens that were previously invisible to the estimator, delaying
compression until the API rejected the request.
2. Context probe persistence guard: when the agent steps down context
tiers after a context-length error, only provider-confirmed numeric
limits (parsed from the error message) are cached to disk. Guessed
fallback tiers from get_next_probe_tier() stay in-memory only,
preventing wrong values from polluting the persistent cache.
Co-authored-by: paraddox <paraddox@users.noreply.github.com>
_send_discord(), _send_slack(), and _send_twilio() all created
aiohttp.ClientSession() without a timeout, leaving HTTP requests
able to hang indefinitely. _send_whatsapp() already used
aiohttp.ClientTimeout(total=30) — this fix applies the same
pattern consistently to all platform send functions.
- Add ClientTimeout(total=30) to _send_discord() ClientSession
- Add ClientTimeout(total=30) to _send_slack() ClientSession
- Add ClientTimeout(total=30) to _send_twilio() ClientSession
_mute_post_response was set True whenever a turn had both content
and tool_calls, suppressing ALL subsequent _vprint output including
tool completion messages. This meant users only saw "preparing
search_files..." but never the result.
Now only mutes output when every tool in the batch is housekeeping
(memory, todo, skill_manage, session_search). Substantive tools
like search_files, read_file, write_file, terminal etc. keep their
completion messages visible.
Also fixes: run_conversation no longer raises on max retries
(returns graceful error dict instead), and cli.py wraps the agent
thread in try/except as a safety net.
Made-with: Cursor
The default SOUL.md seeded for new users should match
DEFAULT_AGENT_IDENTITY — a short, neutral identity paragraph.
The elaborate voice spec (avoid lists, dialogue examples, symbol
conventions) was never intended as the default for all users.
Users who want a custom persona write their own SOUL.md.
The non-streaming API call path (_interruptible_api_call) had no
wall-clock timeout. When providers keep connections alive with SSE
keep-alive pings but never deliver a response, httpx's inactivity
timeout never fires and the call hangs indefinitely.
Subagents always used the non-streaming path because they have no
stream consumers (quiet_mode=True). This caused delegate_task to
hang for 40+ minutes in production.
The streaming path has two layers of protection:
- httpx read timeout (60s, HERMES_STREAM_READ_TIMEOUT)
- Stale stream detection (90s, HERMES_STREAM_STALE_TIMEOUT)
Both work because streaming sends chunks continuously — a 90-second
gap between chunks genuinely means the connection is broken, even for
reasoning models that take minutes to complete.
Now run_conversation() always prefers the streaming path. The streaming
method falls back to non-streaming automatically if the provider
doesn't support it. Stream delta callbacks are no-ops when no
consumers are registered, so there's no overhead for subagents.
Add _emit_status() helper that sends lifecycle notifications to both
CLI (via _vprint force=True) and gateway (via status_callback). No
retry, fallback, or compression path is silent anymore.
Pathways surfaced:
- General retry backoff: was logger-only, now shows countdown
- Provider fallback: changed raw print() to _emit_status for gateway
- Rate limit eager fallback: new notification before switching
- Empty/malformed response fallback: new notification
- Client error fallback: new notification with HTTP status
- Max retries fallback: new notification before attempting
- Max retries giving up: upgraded from _vprint to _emit_status
- Compression retry (413 + context overflow): upgraded to _emit_status
- Compression success + retry: upgraded to _emit_status (2 instances)
Three categories of cleanup, all zero-behavioral-change:
1. F-strings without placeholders (154 fixes across 29 files)
- Converted f'...' to '...' where no {expression} was present
- Heaviest files: run_agent.py (24), cli.py (20), honcho_integration/cli.py (34)
2. Simplify defensive patterns in run_agent.py
- Added explicit self._is_anthropic_oauth = False in __init__ (before
the api_mode branch that conditionally sets it)
- Replaced 7x getattr(self, '_is_anthropic_oauth', False) with direct
self._is_anthropic_oauth (attribute always initialized now)
- Added _is_openrouter_url() and _is_anthropic_url() helper methods
- Replaced 3 inline 'openrouter' in self._base_url_lower checks
3. Remove dead code in small files
- hermes_cli/claw.py: removed unused 'total' computation
- tools/fuzzy_match.py: removed unused strip_indent() function and
pattern_stripped variable
Full test suite: 6184 passed, 0 failures
E2E PTY: banner clean, tool calls work, zero garbled ANSI
run_conversation raised the raw exception after exhausting retries,
which crashed the background thread in cli.py (unhandled exception
in Thread). Now returns a proper error result dict with failed=True
and persists the session, matching the pattern used by other error
paths (invalid responses, empty content, etc.).
Also wraps cli.py's run_agent thread function in try/except as a
safety net against any future unhandled exceptions from
run_conversation.
Made-with: Cursor
Local models (Ollama, LM Studio) embed reasoning in <think> tags in
delta.content. During streaming, _stream_delta() already displays these
blocks. Then _build_assistant_message() extracts them again and fires
reasoning_callback, causing duplicate display.
Track whether reasoning came from structured fields (reasoning_content)
vs <think> tag extraction. Only fire the callback for <think>-extracted
reasoning when stream_delta_callback is NOT active. Structured reasoning
always fires regardless.
Salvaged from PR #2076 by dusterbloom (Fix A only — Fix B was already
covered by PR #3013's _current_reasoning_callback centralization).
Closes#2069.