The summary message was always injected as 'user' role, which causes
consecutive user messages when the last preserved head message is also
'user'. Some APIs reject this (400 error), and it produces malformed
training data.
Fix: check the role of the last head message and pick the opposite role
for the summary — 'user' after assistant/tool, 'assistant' after user.
Based on PR #328 by johnh4098. Closes#328.
- Added support for auxiliary model overrides in the configuration, allowing users to specify providers and models for vision and web extraction tasks.
- Updated the CLI configuration example to include new auxiliary model settings.
- Enhanced the environment variable mapping in the CLI to accommodate auxiliary model configurations.
- Improved the resolution logic for auxiliary clients to support task-specific provider overrides.
- Updated relevant documentation and comments for clarity on the new features and their usage.
Updated the _generate_summary method to attempt summary generation using the auxiliary model first, with a fallback to the main model. If both attempts fail, the method now returns None instead of a placeholder, allowing the caller to handle missing summaries appropriately. This change enhances the robustness of context compression and improves logging for failure scenarios.
Enhance message compression by adding a method to clean up orphaned tool-call and tool-result pairs. This ensures that the API receives well-formed messages, preventing errors related to mismatched IDs. The new functionality includes removing orphaned results and adding stub results for missing calls, improving overall message integrity during compression.
Replaces the unsafe 128K fallback for unknown models with a descending
probe strategy (2M → 1M → 512K → 200K → 128K → 64K → 32K). When a
context-length error occurs, the agent steps down tiers and retries.
The discovered limit is cached per model+provider combo in
~/.hermes/context_length_cache.yaml so subsequent sessions skip probing.
Also parses API error messages to extract the actual context limit
(e.g. 'maximum context length is 32768 tokens') for instant resolution.
The CLI banner now displays the context window size next to the model
name (e.g. 'claude-opus-4 · 200K context · Nous Research').
Changes:
- agent/model_metadata.py: CONTEXT_PROBE_TIERS, persistent cache
(save/load/get), parse_context_limit_from_error(), get_next_probe_tier()
- agent/context_compressor.py: accepts base_url, passes to metadata
- run_agent.py: step-down logic in context error handler, caches on success
- cli.py + hermes_cli/banner.py: context length in welcome banner
- tests: 22 new tests for probing, parsing, and caching
Addresses #132. PR #319's approach (8K default) rejected — too conservative.
When the auxiliary client (used for context compression summaries) fails
— e.g. due to a stale OpenRouter API key after switching to a local LLM
— fall back to the user's active endpoint (OPENAI_BASE_URL) instead of
returning a useless static summary string.
This handles the common scenario where a user switches providers via
'hermes model' but the old provider's API key remains in .env. The
auxiliary client picks up the stale key, fails (402/auth error), and
previously compression would produce garbage. Now it gracefully retries
with the working endpoint.
On successful fallback, the working client is cached for future
compressions in the same session so the fallback cost is paid only once.
Ref: #348
The OpenAI API returns content: null on assistant messages that only
contain tool calls. msg.get('content', '') returns None (not '') when
the key exists with value None, causing TypeError on len() and string
concatenation in _generate_summary and compress.
Fix: msg.get('content') or '' — handles both missing keys and None.
Tests from PR #216 (@Farukest). Fix also in PR #215 (@cutepawss).
Both PRs had stale branches and couldn't be merged directly.
Closes#211
- Enhanced Codex model discovery by fetching available models from the API, with fallback to local cache and defaults.
- Updated the context compressor's summary target tokens to 2500 for improved performance.
- Added external credential detection for Codex CLI to streamline authentication.
- Refactored various components to ensure consistent handling of authentication and model selection across the application.
- Added _max_tokens_param method in AIAgent to return appropriate max tokens parameter based on the provider (OpenAI vs. others).
- Updated API calls in AIAgent to utilize the new max tokens handling.
- Introduced auxiliary_max_tokens_param function in auxiliary_client for consistent max tokens management across auxiliary clients.
- Refactored multiple tools to use auxiliary_max_tokens_param for improved compatibility with different models and providers.