Custom endpoint users (DashScope/Alibaba, Z.AI, Kimi, DeepSeek, etc.)
get wrong context lengths because their provider resolves as "openrouter"
or "custom", skipping the models.dev lookup entirely. For example,
qwen3.5-plus on DashScope falls to the generic "qwen" hardcoded default
(131K) instead of the correct 1M.
Add _infer_provider_from_url() that maps known API hostnames to their
models.dev provider IDs. When the explicit provider is generic
(openrouter/custom/empty), infer from the base URL before the models.dev
lookup. This resolves context lengths correctly for DashScope, Z.AI,
Kimi, MiniMax, DeepSeek, and Nous endpoints without requiring users to
manually set context_length in config.
Also refactors _is_known_provider_base_url() to use the same URL mapping,
removing the duplicated hostname list.
- Updated _stream_delta method in HermesCLI to handle None values, flushing the stream and resetting state for clean tool execution.
- Enhanced quiet mode handling in AIAgent to ensure proper display closure before tool execution, preventing display issues with intermediate streamed content.
These changes improve the robustness of the streaming functionality and ensure a smoother user experience during tool interactions.
Cherry-picked from PR #2169 by @0xbyt4.
1. _strip_provider_prefix: skip Ollama model:tag names (qwen:0.5b)
2. Fuzzy match: remove reverse direction that made claude-sonnet-4
resolve to 1M instead of 200K
3. _has_content_after_think_block: reuse _strip_think_blocks() to
handle all tag variants (thinking, reasoning, REASONING_SCRATCHPAD)
4. models.dev lookup: elif→if so nous provider also queries models.dev
5. Disk cache fallback: use 5-min TTL instead of full hour so network
is retried soon
6. Delegate build: wrap child construction in try/finally so
_last_resolved_tool_names is always restored on exception
Two fixes for Telegram/gateway-specific bugs:
1. Anthropic adapter: strip orphaned tool_result blocks (mirror of
existing tool_use stripping). Context compression or session
truncation can remove an assistant message containing a tool_use
while leaving the subsequent tool_result intact. Anthropic rejects
these with a 400: 'unexpected tool_use_id found in tool_result
blocks'. The adapter now collects all tool_use IDs and filters out
any tool_result blocks referencing IDs not in that set.
2. Gateway: /reset and /new now bypass the running-agent guard (like
/status already does). Previously, sending /reset while an agent
was running caused the raw text to be queued and later fed back as
a user message with the same broken history — replaying the
corrupted session instead of resetting it. Now the running agent is
interrupted, pending messages are cleared, and the reset command
dispatches immediately.
Tests updated: existing tests now include proper tool_use→tool_result
pairs; two new tests cover orphaned tool_result stripping.
Co-authored-by: Test <test@test.com>
- quickstart.md: mention context length prompt for custom endpoints,
link to configuration docs, add Ollama to provider table
- faq.md: rewrite local models section with hermes model flow and
context length prompt example, add Ollama num_ctx tip, expand
context-length-exceeded troubleshooting with detection override
options and config.yaml examples
Co-authored-by: Test <test@test.com>
* feat: context pressure warnings for CLI and gateway
User-facing notifications as context approaches the compaction threshold.
Warnings fire at 60% and 85% of the way to compaction — relative to
the configured compression threshold, not the raw context window.
CLI: Formatted line with a progress bar showing distance to compaction.
Cyan at 60% (approaching), bold yellow at 85% (imminent).
◐ context ▰▰▰▰▰▰▰▰▰▰▰▰▱▱▱▱▱▱▱▱ 60% to compaction 100k threshold (50%) · approaching compaction
⚠ context ▰▰▰▰▰▰▰▰▰▰▰▰▰▰▰▰▰▱▱▱ 85% to compaction 100k threshold (50%) · compaction imminent
Gateway: Plain-text notification sent to the user's chat via the new
status_callback mechanism (asyncio.run_coroutine_threadsafe bridge,
same pattern as step_callback).
Does NOT inject into the message stream. The LLM never sees these
warnings. Flags reset after each compaction cycle.
Files changed:
- agent/display.py — format_context_pressure(), format_context_pressure_gateway()
- run_agent.py — status_callback param, _context_50/70_warned flags,
_emit_context_pressure(), flag reset in _compress_context()
- gateway/run.py — _status_callback_sync bridge, wired to AIAgent
- tests/test_context_pressure.py — 23 tests
* Merge remote-tracking branch 'origin/main' into hermes/hermes-7ea545bf
---------
Co-authored-by: Test <test@test.com>
Matrix is a supported gateway platform but was missing from the
cron scheduler's delivery platform_map, causing cron job results
to silently fail delivery when targeting Matrix rooms.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the fragile hardcoded context length system with a multi-source
resolution chain that correctly identifies context windows per provider.
Key changes:
- New agent/models_dev.py: Fetches and caches the models.dev registry
(3800+ models across 100+ providers with per-provider context windows).
In-memory cache (1hr TTL) + disk cache for cold starts.
- Rewritten get_model_context_length() resolution chain:
0. Config override (model.context_length)
1. Custom providers per-model context_length
2. Persistent disk cache
3. Endpoint /models (local servers)
4. Anthropic /v1/models API (max_input_tokens, API-key only)
5. OpenRouter live API (existing, unchanged)
6. Nous suffix-match via OpenRouter (dot/dash normalization)
7. models.dev registry lookup (provider-aware)
8. Thin hardcoded defaults (broad family patterns)
9. 128K fallback (was 2M)
- Provider-aware context: same model now correctly resolves to different
context windows per provider (e.g. claude-opus-4.6: 1M on Anthropic,
128K on GitHub Copilot). Provider name flows through ContextCompressor.
- DEFAULT_CONTEXT_LENGTHS shrunk from 80+ entries to ~16 broad patterns.
models.dev replaces the per-model hardcoding.
- CONTEXT_PROBE_TIERS changed from [2M, 1M, 512K, 200K, 128K, 64K, 32K]
to [128K, 64K, 32K, 16K, 8K]. Unknown models no longer start at 2M.
- hermes model: prompts for context_length when configuring custom
endpoints. Supports shorthand (32k, 128K). Saved to custom_providers
per-model config.
- custom_providers schema extended with optional models dict for
per-model context_length (backward compatible).
- Nous Portal: suffix-matches bare IDs (claude-opus-4-6) against
OpenRouter's prefixed IDs (anthropic/claude-opus-4.6) with dot/dash
normalization. Handles all 15 current Nous models.
- Anthropic direct: queries /v1/models for max_input_tokens. Only works
with regular API keys (sk-ant-api*), not OAuth tokens. Falls through
to models.dev for OAuth users.
Tests: 5574 passed (18 new tests for models_dev + updated probe tiers)
Docs: Updated configuration.md context length section, AGENTS.md
Co-authored-by: Test <test@test.com>
Cron jobs run unattended with no user present. Previously the agent had
send_message and clarify tools available, which makes no sense — the
final response is auto-delivered, and there's nobody to ask questions to.
Changes:
- Disable messaging and clarify toolsets for cron agent sessions
- Update cron platform hint to emphasize autonomous execution: no user
present, cannot ask questions, must execute fully and make decisions
- Update cronjob tool schema description to match (remove stale
send_message guidance)
When streaming was enabled, two visual feedback mechanisms were
completely suppressed:
1. The thinking spinner (TUI toolbar) was skipped because the entire
spinner block was gated on 'not self._has_stream_consumers()'.
Now the thinking_callback fires in streaming mode too — the
raw KawaiiSpinner is still skipped (would conflict with streamed
tokens) but the TUI toolbar widget works fine alongside streaming.
2. Tool progress lines (the ┊ feed) were invisible because _vprint
was blanket-suppressed when stream consumers existed. But during
tool execution, no tokens are actively streaming, so printing is
safe. Added an _executing_tools flag that _vprint respects to
allow output during tool execution even with stream consumers
registered.
Based on PR #1859 by @magi-morph (too stale to cherry-pick, reimplemented).
GPT-5.x models reject tool calls + reasoning_effort on
/v1/chat/completions with a 400 error directing to /v1/responses.
This auto-detects api.openai.com in the base URL and switches to
codex_responses mode in three places:
- AIAgent.__init__: upgrades chat_completions → codex_responses
- _try_activate_fallback(): same routing for fallback model
- runtime_provider.py: _detect_api_mode_for_url() for both custom
provider and openrouter runtime resolution paths
Also extracts _is_direct_openai_url() helper to replace the inline
check in _max_tokens_param().
Support Signal 'Note to Self' messages in single-number setups where
signal-cli is linked as a secondary device on the user's own account.
syncMessage.sentMessage envelopes addressed to the bot's own account
are now promoted to dataMessage for normal processing, while other
sync events (read receipts, typing, etc.) are still filtered.
Echo-back prevention mirrors the WhatsApp bridge pattern:
- Track timestamps of recently sent messages (bounded set of 50)
- When a Note to Self sync arrives, check if its timestamp matches
a recent outbound — skip if so (agent echo-back)
- Only process sync messages that are genuinely user-initiated
Based on PR #2115 by @Stonelinks with added echo-back protection.
* fix: preserve Ollama model:tag colons in context length detection
The colon-split logic in get_model_context_length() and
_query_local_context_length() assumed any colon meant provider:model
format (e.g. "local:my-model"). But Ollama uses model:tag format
(e.g. "qwen3.5:27b"), so the split turned "qwen3.5:27b" into just
"27b" — which matches nothing, causing a fallback to the 2M token
probe tier.
Now only recognised provider prefixes (local, openrouter, anthropic,
etc.) are stripped. Ollama model:tag names pass through intact.
* fix: update claude-opus-4-6 and claude-sonnet-4-6 context length from 200K to 1M
Both models support 1,000,000 token context windows. The hardcoded defaults
were set before Anthropic expanded the context for the 4.6 generation.
Verified via models.dev and OpenRouter API data.
---------
Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
Co-authored-by: Test <test@test.com>
Cherry-picked from PR #2120 by @unclebumpy.
- from_env() now reads HONCHO_BASE_URL and enables Honcho when base_url
is set, even without an API key
- from_global_config() reads baseUrl from config root with
HONCHO_BASE_URL env var as fallback
- get_honcho_client() guard relaxed to allow base_url without api_key
for no-auth local instances
- Added HONCHO_BASE_URL to OPTIONAL_ENV_VARS registry
Result: Setting HONCHO_BASE_URL=http://localhost:8000 in ~/.hermes/.env
now correctly routes the Honcho client to a local instance.
When the user is on a custom provider (provider=custom, localhost, or
127.0.0.1 endpoint), /model <name> no longer tries to auto-detect a
provider switch. The model name changes on the current endpoint as-is.
To switch away from a custom endpoint, users must use explicit
provider:model syntax (e.g. /model openai-codex:gpt-5.2-codex).
A helpful tip is printed when changing models on a custom endpoint.
This prevents the confusing case where someone on LM Studio types
/model gpt-5.2-codex, the auto-detection tries to switch providers,
fails or partially succeeds, and requests still go to the old endpoint.
Also fixes the missing prompt_toolkit.auto_suggest mock stub in
test_cli_init.py (same issue already fixed in test_cli_new_session.py).
Follow-up to PR #2101 (InB4DevOps). Adds three missing context compressor
resets in reset_session_state():
- compression_count (displayed in status bar)
- last_total_tokens
- _context_probed (stale context-error flag)
Also fixes the test_cli_new_session.py prompt_toolkit mock (missing
auto_suggest stub) and adds a regression test for #2099 that verifies
all token counters and compressor state are zeroed on /new.
The colon-split logic in get_model_context_length() and
_query_local_context_length() assumed any colon meant provider:model
format (e.g. "local:my-model"). But Ollama uses model:tag format
(e.g. "qwen3.5:27b"), so the split turned "qwen3.5:27b" into just
"27b" — which matches nothing, causing a fallback to the 2M token
probe tier.
Now only recognised provider prefixes (local, openrouter, anthropic,
etc.) are stripped. Ollama model:tag names pass through intact.
Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
- Add <thinking> tag to streaming filter's tag list
- When show_reasoning is on, route XML reasoning content to the
reasoning display box instead of silently discarding it
- Expand _strip_think_blocks to handle all tag variants:
<think>, <thinking>, <THINKING>, <reasoning>, <REASONING_SCRATCHPAD>
Place a sentinel in _running_agents immediately after the "already
running" guard check passes — before any await. Without this, the
numerous await points between the guard (line 1324) and agent
registration (track_agent at line 4790) create a window where a
second message for the same session can bypass the guard and start
a duplicate agent, corrupting the transcript.
The await gap includes: hook emissions, vision enrichment (external
API call), audio transcription (external API call), session hygiene
compression, and the run_in_executor call itself. For messages with
media attachments the window can be several seconds wide.
The sentinel is wrapped in try/finally so it is always cleaned up —
even if the handler raises or takes an early-return path. When the
real AIAgent is created, track_agent() overwrites the sentinel with
the actual instance (preserving interrupt support).
Also handles the edge case where a message arrives while the sentinel
is set but no real agent exists yet: the message is queued via the
adapter's pending-message mechanism instead of attempting to call
interrupt() on the sentinel object.
Add FastMCP skill to optional-skills/mcp/fastmcp/ with:
- SKILL.md with workflow, design patterns, quality checklist
- Templates: API wrapper, database server, file processor
- Scaffold CLI script for template instantiation
- FastMCP CLI reference documentation
Moved to optional-skills (requires pip install fastmcp).
Based on work by kshitijk4poor in PR #2096.
Closes#343
Show complete session IDs in 'hermes sessions list' instead of
truncating to 20 characters. Widens title column from 20→30 chars
and adjusts header widths accordingly.
Fixes#2068. Based on PR #2085 by @Nebula037 with a correction
to preserve the no-titles layout (the original PR accidentally
replaced the Preview/Src header with a duplicate Title/Preview header).
The install script creates venv/ but several docs referenced .venv/,
causing agents to fail with 'No such file or directory' when following
AGENTS.md instructions.
Fixes#2066
MiniMax's default base URL was /v1 which caused runtime_provider to
default to chat_completions mode (OpenAI-style Authorization: Bearer
header). MiniMax rejects this with a 401 because they require the
Anthropic-style x-api-key header.
Changes:
- auth.py: Change default inference_base_url for minimax and minimax-cn
from /v1 to /anthropic
- runtime_provider.py: Auto-correct stale /v1 URLs from existing .env
files to /anthropic, and always default minimax/minimax-cn providers
to anthropic_messages mode
- Update tests to reflect new defaults, add tests for stale URL
auto-correction and explicit api_mode override
Based on PR #2100 by @devorun. Fixes#2094.
Co-authored-by: Test <test@test.com>
Local models (especially Qwen 3.5) sometimes wrap their entire response
inside <think> tags, leaving actual content empty. Previously this caused
3 retries and then an error, wasting tokens and failing the request.
Now when retries are exhausted and reasoning_text contains the response,
it is used as final_response instead of returning an error. The user
sees the actual answer instead of "Model generated only think blocks."
Custom endpoints (LM Studio, Ollama, vLLM, llama.cpp) silently fall
back to 2M tokens when /v1/models doesn't include context_length.
Adds _query_local_context_length() which queries server-specific APIs:
- LM Studio: /api/v1/models (max_context_length + loaded instances)
- Ollama: /api/show (model_info + num_ctx parameters)
- llama.cpp: /props (n_ctx from default_generation_settings)
- vLLM: /v1/models/{model} (max_model_len)
Prefers loaded instance context over max (e.g., 122K loaded vs 1M max).
Results are cached via save_context_length() to avoid repeated queries.
Also fixes detect_local_server_type() misidentifying LM Studio as
Ollama (LM Studio returns 200 for /api/tags with an error body).
When LM Studio has a model loaded with a custom context size (e.g.,
122K), prefer that over the model's max_context_length (e.g., 1M).
This makes the TUI status bar show the actual runtime context window.
Instead of defaulting to 2M for unknown local models, query the server
API for the real context length. Supports Ollama (/api/show), vLLM
(max_model_len), and LM Studio (/v1/models). Results are cached to
avoid repeated queries.
Two issues with /model preventing proper provider switching:
1. Bare provider names not detected: typing '/model nous' treated 'nous'
as a model name instead of triggering a provider switch. Fixed by adding
step 0 in detect_provider_for_model() that checks if the input matches
a known provider name/alias (excluding 'custom'/'openrouter' which need
explicit model names) and returns that provider's default model.
2. Custom endpoint details hidden: /model (no args) showed '[custom]' with
just a usage hint but no endpoint URL or model name. Now displays the
configured base_url for custom providers in both CLI and gateway.
Note: config base_url and OPENAI_BASE_URL are intentionally NOT cleared on
provider switch — dedicated provider paths (nous, anthropic, codex) have
their own credential resolution that ignores these, and clearing them would
destroy the user's custom endpoint config, preventing switching back.
Co-authored-by: Test <test@test.com>