When stdout is closed (piped to a dead process, broken terminal),
Python raises ValueError('I/O operation on closed file'), not OSError.
_safe_print and the API error printer only caught OSError, letting the
ValueError propagate and crash the agent.
Salvaged from PR #3760 by @apexscaleai. Fixes#3534.
Co-authored-by: apexscaleai <apexscaleai@users.noreply.github.com>
Tool call previews (paths, commands, queries) were hardcoded to truncate
at 35-40 chars across CLI spinners, completion lines, and gateway progress
messages. Users could not see full file paths in tool output.
New config option: display.tool_preview_length (default 0 = no limit).
Set a positive number to truncate at that length.
Changes:
- display.py: module-level _tool_preview_max_len with getter/setter;
build_tool_preview() and get_cute_tool_message() _trunc/_path respect it
- cli.py: reads config at startup, spinner widget respects config
- gateway/run.py: reads config per-message, progress callback respects config
- run_agent.py: removed redundant 30-char quiet-mode spinner truncation
- config.py: added display.tool_preview_length to DEFAULT_CONFIG
Reported by kriskaminski
When compression creates a child session with a new session_id,
session_log_file was still pointing to the old session's JSON file.
This caused _save_session_log() to write new data to the wrong file.
Closes#3731.
Co-authored-by: kelsia14 <kelsia14@users.noreply.github.com>
Some providers (Fireworks AI) reject tools=null, and others (Anthropic)
reject tools=[]. The safest approach is to not include the key at all
when there are no tools — the OpenAI SDK treats a missing parameter as
NOT_GIVEN and omits it from the request entirely.
Inspired by PR #3736 (@kelsia14).
Extends the single fallback_model mechanism into an ordered chain.
When the primary model fails, Hermes tries each fallback provider in
sequence until one succeeds or the chain is exhausted.
Config format (new):
fallback_providers:
- provider: openrouter
model: anthropic/claude-sonnet-4
- provider: openai
model: gpt-4o
Legacy single-dict fallback_model format still works unchanged.
Key fix vs original PR: the call sites in the retry loop now use
_fallback_index < len(_fallback_chain) instead of the old one-shot
_fallback_activated guard, so the chain actually advances through
all configured providers.
Changes:
- run_agent.py: _fallback_chain list + _fallback_index replaces
one-shot _fallback_model; _try_activate_fallback() advances
through chain; failed provider resolution skips to next entry;
call sites updated to allow chain advancement
- cli.py: reads fallback_providers with legacy fallback_model compat
- gateway/run.py: same
- hermes_cli/config.py: fallback_providers: [] in DEFAULT_CONFIG
- tests: 12 new chain tests + 6 existing test fixtures updated
Co-authored-by: uzaylisak <uzaylisak@users.noreply.github.com>
When hitting rate limits (429), the agent now:
- Extracts the Retry-After header from the provider response and uses it
as the wait time instead of blind exponential backoff (capped at 120s)
- Shows rate-limit-specific messaging: 'Rate limit reached. Waiting Xs
before retry (attempt N/M)...'
- Shows a distinct exhaustion message: 'Rate limit persisted after N
retries. Please try again later.'
Non-429 errors keep the existing exponential backoff and generic messaging.
Co-authored-by: ygd58 <ygd58@users.noreply.github.com>
When a user runs 'hermes update', the Python process caches old modules
in sys.modules. After git pull updates files on disk, lazy imports of
newly-updated modules fail because they try to import display_hermes_home
from the cached (old) hermes_constants which doesn't have the function.
This specifically broke the gateway auto-restart in cmd_update — importing
hermes_cli/gateway.py triggered the top-level 'from hermes_constants
import display_hermes_home' against the cached old module. The ImportError
was silently caught, so the gateway was never restarted after update.
Users with a running gateway then hit the ImportError on their next
Telegram/Discord message when the stale gateway process lazily loaded
run_agent.py (new version) which also had the top-level import.
Fixes:
- hermes_cli/gateway.py: lazy import at call site (line 940)
- run_agent.py: lazy import at call site (line 6927)
- tools/terminal_tool.py: lazy imports at 3 call sites
- tools/tts_tool.py: static schema string (no module-level call)
- hermes_cli/auth.py: lazy import at call site (line 2024)
- hermes_cli/main.py: reload hermes_constants after git pull in cmd_update
Also fixes 4 pre-existing test failures in test_parse_env_var caused by
NameError on display_hermes_home in terminal_tool.py.
Prep for profiles: user-facing messages now use display_hermes_home() so
diagnostic output shows the correct path for each profile.
New helper: display_hermes_home() in hermes_constants.py
12 files swept, ~30 user-facing string replacements.
Includes dynamic TTS schema description.
Self-hosted Honcho on localhost doesn't require authentication, but
both the activation gates and the SDK client required an API key.
Combined fix from three contributor PRs:
- Relax all 8 activation gates to accept (api_key OR base_url) as
valid credentials (#3482 by @cameronbergh)
- Use 'local' placeholder for the SDK client when base_url points to
localhost/127.0.0.1/::1 (#3570 by @ygd58)
Files changed: run_agent.py (2 gates), cli.py (1 gate),
gateway/run.py (1 gate), honcho_integration/cli.py (2 gates),
hermes_cli/doctor.py (2 gates), honcho_integration/client.py (SDK).
Co-authored-by: cameronbergh <cameronbergh@users.noreply.github.com>
Co-authored-by: ygd58 <ygd58@users.noreply.github.com>
Co-authored-by: devorun <devorun@users.noreply.github.com>
Pasting text from rich-text editors (Google Docs, Word, etc.) can inject
lone surrogate characters (U+D800..U+DFFF) that are invalid UTF-8.
The OpenAI SDK serializes messages with ensure_ascii=False, then encodes
to UTF-8 for the HTTP body — surrogates crash this with:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2'
Three-layer fix:
1. Primary: sanitize user_message at the top of run_conversation()
2. CLI: sanitize in chat() before appending to conversation_history
3. Safety net: catch UnicodeEncodeError in the API error handler,
sanitize the entire messages list in-place, and retry once.
Also exclude UnicodeEncodeError from is_local_validation_error
so it doesn't get classified as non-retryable.
Includes 14 new tests covering the sanitization helpers and the
integration with run_conversation().
Ollama reuses index 0 for every tool call in a parallel batch,
distinguishing them only by id. The streaming accumulator now
detects a new non-empty id at an already-active index and redirects
it to a fresh slot, preventing names and arguments from being
concatenated into a single tool call.
No-op for normal providers that use incrementing indices.
Co-authored-by: dmater01 <dmater01@users.noreply.github.com>
* fix(provider): remove MiniMax /v1→/anthropic auto-correction to allow user override
The minimax-specific auto-correction in runtime_provider.py was
preventing users from overriding to the OpenAI-compatible endpoint
via MINIMAX_BASE_URL. Users in certain regions get nginx 404 on
api.minimax.io/anthropic and need to switch to api.minimax.chat/v1.
The generic URL-suffix detection already handles /anthropic →
anthropic_messages, so the minimax-specific code was redundant for
the default path and harmful for the override path.
Now: default /anthropic URL works via generic detection, user
override to /v1 gets chat_completions mode naturally.
Closes#3546 (different approach — respects user overrides instead
of changing the default endpoint).
* fix(display): show reasoning during streaming even when tool calls suppress content
When a model generates content (containing <REASONING_SCRATCHPAD> tags)
alongside tool calls in the same API response, content deltas were
suppressed from streaming once any tool call chunk arrived. This
prevented the CLI's tag extraction from running, so reasoning was
never shown during streaming. The post-response fallback then
displayed reasoning AFTER the already-visible streamed response,
creating a confusing reversed order.
Fix: route suppressed content to stream_delta_callback even when tool
calls are present. The CLI's _stream_delta handles tag extraction —
reasoning tags are routed to the reasoning display box, while
non-reasoning text is handled by the existing stream display logic.
This ensures reasoning appears before tool execution and the final
response, matching the expected visual order.
The TOOL_USE_ENFORCEMENT_GUIDANCE injection (added in #3528) was
hardcoded to only match gpt/codex model names. This makes it a
config option so users can turn it on for any model family.
New config key: agent.tool_use_enforcement
- "auto" (default): matches gpt/codex (existing behavior)
- true: inject for all models
- false: never inject
- list of strings: custom model-name substrings to match
e.g. ["gpt", "codex", "deepseek", "qwen"]
No version bump needed — deep merge provides the default
automatically for existing installs.
12 new tests covering all config modes.
The plugin system defined six lifecycle hooks but only pre_tool_call and
post_tool_call were invoked. This activates the remaining four so that
external plugins (e.g. memory systems) can hook into the conversation
loop without touching core code.
Hook semantics:
- on_session_start: fires once when a new session is created
- pre_llm_call: fires once per turn before the tool-calling loop;
plugins can return {"context": "..."} to inject into the ephemeral
system prompt (not cached, not persisted)
- post_llm_call: fires once per turn after the loop completes, with
user_message and assistant_response for sync/storage
- on_session_end: fires at the end of every run_conversation call
invoke_hook() now returns a list of non-None callback return values,
enabling pre_llm_call context injection while remaining backward
compatible (existing hooks that return None are unaffected).
Salvaged from PR #2823.
Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>
Root cause: Anthropic buffers entire tool call arguments and goes silent
for minutes while thinking (verified: 167s gap with zero SSE events on
direct API). OpenRouter's upstream proxy times out after ~125s of
inactivity and drops the connection with 'Network connection lost'.
Fix: Send the x-anthropic-beta: fine-grained-tool-streaming-2025-05-14
header for Claude models on OpenRouter. This makes Anthropic stream
tool call arguments token-by-token instead of buffering them, keeping
the connection alive through OpenRouter's proxy.
Live-tested: the exact prompt that consistently failed at ~128s now
completes successfully — 2,972 lines written, 49K tokens, 8 minutes.
Additional improvements:
1. Send explicit max_tokens for Claude through OpenRouter. Without it,
OpenRouter defaults to 65,536 (confirmed via echo_upstream_body) —
only half of Opus 4.6's 128K limit.
2. Classify SSE 'Network connection lost' as retryable in the streaming
inner retry loop. The OpenAI SDK raises APIError from SSE error
events, which was bypassing our transient error retry logic.
3. Actionable diagnostic guidance when stream-drop retries exhaust.
Cherry-pick of feat/gpt-tool-steering with modifications:
1. Tool-use enforcement prompt (refactored from GPT-specific):
- Renamed GPT_TOOL_USE_GUIDANCE -> TOOL_USE_ENFORCEMENT_GUIDANCE
- Added TOOL_USE_ENFORCEMENT_MODELS tuple: ('gpt', 'codex')
- Injection logic now checks against the tuple instead of hardcoding
'gpt' — adding new model families is a one-line change
- Addresses models describing actions instead of making tool calls
2. Budget warning history stripping:
- _strip_budget_warnings_from_history() strips _budget_warning JSON
keys and [BUDGET WARNING: ...] text from tool results at the start
of run_conversation()
- Prevents old budget warnings from poisoning subsequent turns
Based on PR #3479 by teknium1.
* fix: cap context pressure percentage at 100% in display
The forward-looking token estimate can overshoot the compaction threshold
(e.g. a large tool result pushes it from 70% to 109% in one step). The
progress bar was already capped via min(), but pct_int was not — causing
the user to see '109% to compaction' which is confusing.
Cap pct_int at 100 in both CLI and gateway display functions.
Reported by @JoshExile82.
* refactor: use real API token counts for compression decisions
Replace the rough chars/3 estimation with actual prompt_tokens +
completion_tokens from the API response. The estimation was needed to
predict whether tool results would push context past the threshold, but
the default 50% threshold leaves ample headroom — if tool results push
past it, the next API call reports real usage and triggers compression
then.
This removes all estimation from the compression and context pressure
paths, making both 100% data-driven from provider-reported token counts.
Also removes the dead _msg_count_before_tools variable.
When finish_reason='length' and the response contains only reasoning
(think blocks or empty content), the model exhausted its output token
budget on thinking with nothing left for the actual response.
Previously, this fell into either:
- chat_completions: 3 useless continuation retries (model hits same limit)
- anthropic/codex: generic 'Response truncated' error with rollback
Now: detect the think-only + length condition early and return immediately
with a targeted error message: 'Model used all output tokens on reasoning
with none left for the response. Try lowering reasoning effort or
increasing max_tokens.'
This saves 2 wasted API calls on the chat_completions path and gives
users actionable guidance instead of a cryptic error.
The existing think-only retry logic (finish_reason='stop') is unchanged —
that's a genuine model glitch where retrying can help.
The Anthropic adapter defaulted to max_tokens=16384 when no explicit value
was configured. This severely limits thinking-enabled models where thinking
tokens count toward max_tokens:
- Claude Opus 4.6 supports 128K output but was capped at 16K
- Claude Sonnet 4.6 supports 64K output but was capped at 16K
With extended thinking (adaptive or budget-based), the model could exhaust
the entire 16K on reasoning, leaving zero tokens for the actual response.
This caused two user-visible errors:
- 'Response truncated (finish_reason=length)' — thinking consumed most tokens
- 'Response only contains think block with no content' — thinking consumed all
Fix: add _ANTHROPIC_OUTPUT_LIMITS lookup table (sourced from Anthropic docs
and Cline's model catalog) and use the model's actual output limit as the
default. Unknown future models default to 128K (the current maximum).
Also adds context_length clamping: if the user configured a smaller context
window (e.g. custom endpoint), max_tokens is clamped to context_length - 1
to avoid exceeding the window.
Closes#2706
Models like GLM-5/5.1 can think for 15+ minutes. The previous 900s
(15 min) default for HERMES_API_TIMEOUT killed legitimate requests.
Raised to 1800s (30 min) in both places that read the env var:
- _build_api_kwargs() timeout (non-streaming total timeout)
- _call_chat_completions() write timeout (streaming connection)
The streaming per-chunk read timeout (60s) and stale stream detector
(180-300s) are unchanged — those are appropriate for inter-chunk timing.
Two independent bugs caused the reasoning box to appear three times when
the model produced reasoning + tool_calls:
Bug A: _build_assistant_message() re-fired reasoning_callback with the full
reasoning text even when streaming had already displayed it. The original
guard only checked structured reasoning_content deltas, but reasoning also
arrives via content tag extraction (<REASONING_SCRATCHPAD>/<think> tags
in delta.content), which went through _fire_stream_delta not
_fire_reasoning_delta. Fix: skip the callback entirely when streaming is
active — both paths display reasoning during the stream. Any reasoning not
shown during streaming is caught by the CLI post-response fallback.
Bug B: The post-response reasoning display checked _reasoning_stream_started,
but that flag was reset by _reset_stream_state() during intermediate turn
boundaries (when stream_delta_callback(None) fires between tool calls).
Introduced _reasoning_shown_this_turn flag that persists across the tool
loop and is only reset at the start of each user turn.
Live-tested in PTY: reasoning now shows exactly once per API call, no
duplicates across tool-calling loops.
The stale stream detector (90s timeout) was killing healthy connections
during the model's thinking phase, producing self-inflicted
RemoteProtocolError ("peer closed connection without sending complete
message body"). Three issues:
1. last_chunk_time was never reset between inner stream retries, so
subsequent attempts inherited the previous attempt's stale budget
2. The non-streaming fallback path didn't reset the timer either
3. 90s base timeout was too aggressive for large-context Opus sessions
where thinking time before first token routinely exceeds 90s
Fix: reset last_chunk_time at the start of each streaming attempt and
before the non-streaming fallback. Increase base timeout to 180s and
scale to 300s for >100K token contexts.
Made-with: Cursor
Three improvements salvaged from PR #3225 by Mibayy:
1. Add /resume slash command handler in CLI process_command(). The
command was registered in the commands registry but had no handler,
so typing /resume produced 'Unknown command'. The handler resolves
by title or session ID, ends the current session cleanly, loads
conversation history from SQLite, re-opens the target session, and
syncs the AIAgent instance. Follows the same pattern as new_session().
2. Add truncation guard in _save_session_log(). When resuming a session
whose messages weren't fully written to SQLite, the agent starts with
partial history and the first save would overwrite the full JSON log
on disk. The guard reads the existing file and skips the write if it
already has more messages than the current batch.
3. Add reopen_session() method to SessionDB. Proper API for clearing
ended_at/end_reason instead of reaching into _conn directly.
Note: Bug 1 from the original PR (INSERT OR IGNORE + _session_db = None)
is already fixed on main — skipped as redundant.
Closes#3123.
When _try_activate_fallback() switches to the fallback model, it
updates the agent's model/provider/client but never touches
self.context_compressor. The compressor keeps the primary model's
context_length and threshold_tokens, so compression decisions use
wrong limits — a 200K primary → 32K fallback still uses 200K-based
thresholds, causing oversized sessions to overflow the fallback.
Update the compressor's model, credentials, context_length, and
threshold_tokens after fallback activation using get_model_context_length()
for the new model.
Cherry-picked from PR #3202 by binhnt92.
Co-authored-by: binhnt92 <binhnt.ht.92@gmail.com>
* fix(gateway): silence flush agent terminal output
quiet_mode=True only suppresses AIAgent init messages.
Tool call output still leaks to the terminal through
_safe_print → _print_fn during session reset/expiry.
Since #2670 injected live memory state into the flush prompt,
the flush agent now reliably calls memory tools — making the
output leak noticeable for the first time.
Set _print_fn to a no-op so the background flush is fully silent.
* test(gateway): add test for flush agent terminal silence + fix dotenv mock
- Add TestFlushAgentSilenced: verifies _print_fn is set to a no-op on
the flush agent so tool output never leaks to the terminal
- Fix pre-existing test failures: replace patch('run_agent.AIAgent')
with sys.modules mock to avoid importing run_agent (requires openai)
- Add autouse _mock_dotenv fixture so all tests in this file run
without the dotenv package installed
* fix(display): route KawaiiSpinner output through print_fn to fully silence flush agent
The previous fix set tmp_agent._print_fn = no-op on the flush agent but
spinner output and quiet-mode cute messages bypassed _print_fn entirely:
- KawaiiSpinner captured sys.stdout at __init__ and wrote directly to it
- quiet-mode tool results used builtin print() instead of _safe_print()
Add optional print_fn parameter to KawaiiSpinner.__init__; _write routes
through it when set. Pass self._print_fn to all spinner construction sites
in run_agent.py and change the quiet-mode cute message print to _safe_print.
The existing gateway fix (tmp_agent._print_fn = lambda) now propagates
correctly through both paths.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(gateway): silence hygiene and compression background agents
Two more background AIAgent instances in the gateway were created with
quiet_mode=True but without _print_fn = no-op, causing tool output to
leak to the terminal:
- _hyg_agent (in-turn hygiene memory agent)
- tmp_agent (_compress_context path)
Apply the same _print_fn no-op pattern used for the flush agent.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore(display): remove unused _last_flush_time from KawaiiSpinner
Attribute was set but never read; upstream already removed it.
Leftover from conflict resolution during rebase onto upstream/main.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Dilee <uzmpsk.dilekakbas@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The background memory/skill review (_spawn_background_review) runs
after the agent response when turn/iteration counters exceed their
thresholds. It saves memories and skills, then prints a summary like
'💾 Memory updated · User profile updated'. In CLI mode this goes to
the terminal via _safe_print. In gateway mode, _safe_print routes to
print() which goes to stdout — invisible to the user.
Add a background_review_callback attribute to AIAgent. When set, the
background review thread calls it with the summary string after saves
complete. The gateway wires this to adapter.send() via the same
run_coroutine_threadsafe bridge used by status_callback, delivering
the notification to the user's chat.
When third-party tools (Paperclip orchestrator, etc.) spawn hermes chat
as a subprocess, their sessions pollute user session history and search.
- hermes chat --source <tag> (also HERMES_SESSION_SOURCE env var)
- exclude_sources parameter on list_sessions_rich() and search_messages()
- Sessions with source=tool hidden from sessions list/browse/search
- Third-party adapters pass --source tool to isolate agent sessions
Cherry-picked from PR #3208 by HenkDz.
Co-authored-by: Henkey <noonou7@gmail.com>
except Exception does not catch KeyboardInterrupt (inherits from
BaseException). A second Ctrl+C during exit cleanup aborts pending
writes — Honcho observations dropped, SQLite sessions left unclosed,
cron job sessions never marked ended.
Changed to except (Exception, KeyboardInterrupt) at all five sites:
- cli.py: honcho.shutdown() and end_session() in finally exit block
- run_agent.py: _flush_honcho_on_exit atexit handler
- cron/scheduler.py: end_session() and close() in job finally block
Tests exercise the actual production code paths and confirm
KeyboardInterrupt propagates without the fix.
Co-authored-by: dieutx <dangtc94@gmail.com>
* fix(session-db): survive CLI/gateway concurrent write contention
Closes#3139
Three layered fixes for the scenario where CLI and gateway write to
state.db concurrently, causing create_session() to fail with
'database is locked' and permanently disabling session_search on the
gateway side.
1. Increase SQLite connection timeout: 10s -> 30s
hermes_state.py: longer window for the WAL writer to finish a batch
flush before the other process gives up entirely.
2. INSERT OR IGNORE in create_session
hermes_state.py: prevents IntegrityError on duplicate session IDs
(e.g. gateway restarts while CLI session is still alive).
3. Don't null out _session_db on create_session failure (main fix)
run_agent.py: a transient lock at agent startup must not permanently
disable session_search for the lifetime of that agent instance.
_session_db now stays alive so subsequent flushes and searches work
once the lock clears.
4. New ensure_session() helper + call it during flush
hermes_state.py: INSERT OR IGNORE for a minimal session row.
run_agent.py _flush_messages_to_session_db: calls ensure_session()
before appending messages, so the FK constraint is satisfied even
when create_session() failed at startup. No-op when the row exists.
* fix(state): release lock between context queries in search_messages
The context-window queries (one per FTS5 match) were running inside
the same lock acquisition as the primary FTS5 query, holding the lock
for O(N) sequential SQLite round-trips. Move per-match context fetches
outside the outer lock block so each acquires the lock independently,
keeping critical sections short and allowing other threads to interleave.
* fix(session): prefer longer source in load_transcript to prevent legacy truncation
When a long-lived session pre-dates SQLite storage (e.g. sessions
created before the DB layer was introduced, or after a clean
deployment that reset the DB), _flush_messages_to_session_db only
writes the *new* messages from the current turn to SQLite — it skips
messages already present in conversation_history, assuming they are
already persisted.
That assumption fails for legacy JSONL-only sessions:
Turn N (first after DB migration):
load_transcript(id) → SQLite: 0 → falls back to JSONL: 994 ✓
_flush_messages_to_session_db: skip first 994, write 2 new → SQLite: 2
Turn N+1:
load_transcript(id) → SQLite: 2 → returns immediately ✗
Agent sees 2 messages of history instead of 996
The same pattern causes the reported symptom: session JSON truncated
to 4 messages (_save_session_log writes agent.messages which only has
2 history + 2 new = 4).
Fix: always load both sources and return whichever is longer. For a
fully-migrated session SQLite will always be ≥ JSONL, so there is no
regression. For a legacy session that hasn't been bootstrapped yet,
JSONL wins and the full history is restored.
Closes#3212
* test: add load_transcript source preference tests for #3212
Covers: JSONL longer returns JSONL, SQLite longer returns SQLite,
SQLite empty falls back to JSONL, both empty returns empty, equal
length prefers SQLite (richer reasoning fields).
---------
Co-authored-by: Mibayy <mibayy@hermes.ai>
Co-authored-by: kewe63 <kewe.3217@gmail.com>
Co-authored-by: Mibayy <mibayy@users.noreply.github.com>
Two improvements salvaged from PR #2600 (paraddox):
1. Preflight compression now counts tool schema tokens alongside system
prompt and messages. With 50+ tools enabled, schemas can add 20-30K
tokens that were previously invisible to the estimator, delaying
compression until the API rejected the request.
2. Context probe persistence guard: when the agent steps down context
tiers after a context-length error, only provider-confirmed numeric
limits (parsed from the error message) are cached to disk. Guessed
fallback tiers from get_next_probe_tier() stay in-memory only,
preventing wrong values from polluting the persistent cache.
Co-authored-by: paraddox <paraddox@users.noreply.github.com>
_mute_post_response was set True whenever a turn had both content
and tool_calls, suppressing ALL subsequent _vprint output including
tool completion messages. This meant users only saw "preparing
search_files..." but never the result.
Now only mutes output when every tool in the batch is housekeeping
(memory, todo, skill_manage, session_search). Substantive tools
like search_files, read_file, write_file, terminal etc. keep their
completion messages visible.
Also fixes: run_conversation no longer raises on max retries
(returns graceful error dict instead), and cli.py wraps the agent
thread in try/except as a safety net.
Made-with: Cursor
The non-streaming API call path (_interruptible_api_call) had no
wall-clock timeout. When providers keep connections alive with SSE
keep-alive pings but never deliver a response, httpx's inactivity
timeout never fires and the call hangs indefinitely.
Subagents always used the non-streaming path because they have no
stream consumers (quiet_mode=True). This caused delegate_task to
hang for 40+ minutes in production.
The streaming path has two layers of protection:
- httpx read timeout (60s, HERMES_STREAM_READ_TIMEOUT)
- Stale stream detection (90s, HERMES_STREAM_STALE_TIMEOUT)
Both work because streaming sends chunks continuously — a 90-second
gap between chunks genuinely means the connection is broken, even for
reasoning models that take minutes to complete.
Now run_conversation() always prefers the streaming path. The streaming
method falls back to non-streaming automatically if the provider
doesn't support it. Stream delta callbacks are no-ops when no
consumers are registered, so there's no overhead for subagents.
Add _emit_status() helper that sends lifecycle notifications to both
CLI (via _vprint force=True) and gateway (via status_callback). No
retry, fallback, or compression path is silent anymore.
Pathways surfaced:
- General retry backoff: was logger-only, now shows countdown
- Provider fallback: changed raw print() to _emit_status for gateway
- Rate limit eager fallback: new notification before switching
- Empty/malformed response fallback: new notification
- Client error fallback: new notification with HTTP status
- Max retries fallback: new notification before attempting
- Max retries giving up: upgraded from _vprint to _emit_status
- Compression retry (413 + context overflow): upgraded to _emit_status
- Compression success + retry: upgraded to _emit_status (2 instances)
Three categories of cleanup, all zero-behavioral-change:
1. F-strings without placeholders (154 fixes across 29 files)
- Converted f'...' to '...' where no {expression} was present
- Heaviest files: run_agent.py (24), cli.py (20), honcho_integration/cli.py (34)
2. Simplify defensive patterns in run_agent.py
- Added explicit self._is_anthropic_oauth = False in __init__ (before
the api_mode branch that conditionally sets it)
- Replaced 7x getattr(self, '_is_anthropic_oauth', False) with direct
self._is_anthropic_oauth (attribute always initialized now)
- Added _is_openrouter_url() and _is_anthropic_url() helper methods
- Replaced 3 inline 'openrouter' in self._base_url_lower checks
3. Remove dead code in small files
- hermes_cli/claw.py: removed unused 'total' computation
- tools/fuzzy_match.py: removed unused strip_indent() function and
pattern_stripped variable
Full test suite: 6184 passed, 0 failures
E2E PTY: banner clean, tool calls work, zero garbled ANSI
run_conversation raised the raw exception after exhausting retries,
which crashed the background thread in cli.py (unhandled exception
in Thread). Now returns a proper error result dict with failed=True
and persists the session, matching the pattern used by other error
paths (invalid responses, empty content, etc.).
Also wraps cli.py's run_agent thread function in try/except as a
safety net against any future unhandled exceptions from
run_conversation.
Made-with: Cursor
Local models (Ollama, LM Studio) embed reasoning in <think> tags in
delta.content. During streaming, _stream_delta() already displays these
blocks. Then _build_assistant_message() extracts them again and fires
reasoning_callback, causing duplicate display.
Track whether reasoning came from structured fields (reasoning_content)
vs <think> tag extraction. Only fire the callback for <think>-extracted
reasoning when stream_delta_callback is NOT active. Structured reasoning
always fires regardless.
Salvaged from PR #2076 by dusterbloom (Fix A only — Fix B was already
covered by PR #3013's _current_reasoning_callback centralization).
Closes#2069.
Three problems with API error debugging:
1. Terminal showed str(error)[:200] — raw HTML gibberish for Cloudflare
502/503 pages instead of "502 Bad Gateway"
2. errors.log dumped the entire HTML page as unstructured text
3. _dump_api_request_debug was never called when retries exhausted,
only for non-retryable 4xx errors
Adds _summarize_api_error() that extracts <title> and Cloudflare Ray ID
from HTML error pages, and falls back to SDK error body messages. Now
the terminal shows clean one-liners like:
📝 Error: HTTP 502 — openrouter.ai | 502: Bad gateway — Ray 9e226...
Also calls _dump_api_request_debug on max_retries_exhausted so the full
request context is written to ~/.hermes/sessions/ for post-mortem.
Made-with: Cursor
When fallback activates (e.g. minimax → OpenRouter), self.provider,
self.base_url, self.api_mode, and self._client_kwargs were all updated
but self.api_key was not. delegate_tool.py reads parent_agent.api_key
to pass credentials to child agents, so subagents inherited the stale
pre-fallback key (e.g. a minimax key sent to OpenRouter), causing 401
Missing Authentication errors.
Add self.api_key = ... in both the anthropic_messages and
chat_completions branches of _try_activate_fallback().
reset_session_state() was missing two fields added after it was written:
- _user_turn_count: kept accumulating across sessions, affecting
flush_min_turns guard behavior
- context_compressor._previous_summary: old session's compression
summary leaked into new session's iterative compression
Cherry-picked from PR #2640 by dusterbloom. Closes#2635.
When an API call fails, the terminal output now includes the HTTP status
code in the header line and, for 400 errors, the response body from the
provider (truncated to 300 chars). Makes it much easier to diagnose
issues like invalid model names or malformed requests that were
previously hidden behind generic error messages.
Salvaged from PR #2646 by Mibayy. Fixes#2644.
When API calls fail with HTML error pages (e.g., CloudFlare errors), the CLI
was dumping raw HTML content to users like:
📝 Error: <!DOCTYPE html><!--[if lt IE 7]> <html class="no-js ie6...
This commit adds a _clean_error_message() utility method that:
- Detects HTML content and replaces with user-friendly message
- Collapses multiline errors to single line
- Truncates overly long errors (>150 chars)
- Preserves meaningful error text for regular errors
Applied to all user-facing error displays:
- API call failure messages (line 6314)
- Interrupt error responses (line 6324)
- Invalid response error messages (line 6000)
Before: 📝 Error: <!DOCTYPE html><!--[if lt IE 7]>...
After: 📝 Error: Service temporarily unavailable (HTML error page returned)
When context overflow triggers compression, the outer retry loop
restarts via continue without incrementing retry_count. If compression
reduces messages but not enough to fit the context window, this creates
an infinite loop burning API credits: API call → overflow → compress →
retry → overflow → compress → ...
Increment retry_count on compression restarts so the loop exits after
max_retries total attempts.
Cherry-picked from PR #2766 by dieutx.
Adds a wall-clock stale stream detector (HERMES_STREAM_STALE_TIMEOUT,
default 90s) that force-closes the httpx client when no real chunks
arrive, even if SSE keep-alive pings keep the socket alive. Works
with the existing streaming retry loop to recover via fresh connection.
Made-with: Cursor
Centralizes two widely-duplicated patterns into hermes_constants.py:
1. get_hermes_home() — Path resolution for ~/.hermes (HERMES_HOME env var)
- Was copy-pasted inline across 30+ files as:
Path(os.getenv("HERMES_HOME", Path.home() / ".hermes"))
- Now defined once in hermes_constants.py (zero-dependency module)
- hermes_cli/config.py re-exports it for backward compatibility
- Removed local wrapper functions in honcho_integration/client.py,
tools/website_policy.py, tools/tirith_security.py, hermes_cli/uninstall.py
2. parse_reasoning_effort() — Reasoning effort string validation
- Was copy-pasted in cli.py, gateway/run.py, cron/scheduler.py
- Same validation logic: check against (xhigh, high, medium, low, minimal, none)
- Now defined once in hermes_constants.py, called from all 3 locations
- Warning log for unknown values kept at call sites (context-specific)
31 files changed, net +31 lines (125 insertions, 94 deletions)
Full test suite: 6179 passed, 0 failed
After streaming retries are exhausted on transient errors, fall back to
non-streaming instead of propagating the error. Also fall back for any
other pre-delivery stream error (not just 'streaming not supported').
Added user-facing message when streaming is not supported by a model/
provider, directing users to set display.streaming: false in config.yaml
to avoid the fallback delay.
Cherry-picked from PR #3008 by kshitijk4poor. Added UX message for
streaming-not-supported detection.
Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>
- Add 'prompt exceeds max length' to context overflow detection for
Z.AI/GLM 400 errors
- Extract inline reasoning blocks from assistant content as fallback
when no structured reasoning fields are present
- Guard inline extraction so structured API reasoning takes priority
- Update test for reasoning-only response salvage behavior
Cherry-picked from PR #2993 by kshitijk4poor. Added priority guard
to fix test_structured_reasoning_takes_priority failure.
Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>
Each subagent now gets its own IterationBudget instead of sharing the
parent's. The per-subagent cap is controlled by delegation.max_iterations
in config.yaml (default 50). Total iterations across parent + subagents
can exceed the parent's max_iterations, but the user retains control via
the config setting.
Previously, subagents shared the parent's budget, so three parallel
subagents configured for max_iterations=50 racing against a parent that
already used 60 of 90 would each only get ~10 iterations.
Inspired by PR #2928 (Bartok9) which identified the issue (#2873).