* feat: add Vercel AI Gateway as a first-class provider
Adds AI Gateway (ai-gateway.vercel.sh) as a new inference provider
with AI_GATEWAY_API_KEY authentication, live model discovery, and
reasoning support via extra_body.reasoning.
Based on PR #1492 by jerilynzheng.
* feat: add AI Gateway to setup wizard, doctor, and fallback providers
* test: add AI Gateway to api_key_providers test suite
* feat: add AI Gateway to hermes model CLI and model metadata
Wire AI Gateway into the interactive model selection menu and add
context lengths for AI Gateway model IDs in model_metadata.py.
* feat: use claude-haiku-4.5 as AI Gateway auxiliary model
* revert: use gemini-3-flash as AI Gateway auxiliary model
* fix: move AI Gateway below established providers in selection order
---------
Co-authored-by: jerilynzheng <jerilynzheng@users.noreply.github.com>
Co-authored-by: jerilynzheng <zheng.jerilyn@gmail.com>
- Add _fetch_anthropic_models() to hermes_cli/models.py — hits the
Anthropic /v1/models endpoint to get the live model catalog. Handles
both API key and OAuth token auth headers.
- Wire it into provider_model_ids() so both 'hermes model' and
'hermes setup model' show the live list instead of a stale static one.
- Update static _PROVIDER_MODELS fallback with full current catalog:
opus-4-6, sonnet-4-6, opus-4-5, sonnet-4-5, opus-4, sonnet-4, haiku-4-5
- Update model_metadata.py with context lengths for all current models.
- Fix thinking parameter for 4.5+ models: use type='adaptive' instead
of type='enabled' (Anthropic deprecated 'enabled' for newer models,
warns at runtime). Detects model version from the model name string.
Verified live:
hermes model → Anthropic → auto-detected creds → shows 7 live models
hermes chat --provider anthropic --model claude-opus-4-6 → works
* fix: /reasoning command output ordering, display, and inline think extraction
Three issues with the /reasoning command:
1. Output interleaving: The command echo used print() while feedback
used _cprint(), causing them to render out-of-order under
prompt_toolkit's patch_stdout. Changed echo to use _cprint() so
all output renders through the same path in correct order.
2. Reasoning display not working: /reasoning show toggled a flag
but reasoning never appeared for models that embed thinking in
inline <think> blocks rather than structured API fields. Added
fallback extraction in _build_assistant_message to capture
<think> block content as reasoning when no structured reasoning
fields (reasoning, reasoning_content, reasoning_details) are
present. This feeds into both the reasoning callback (during
tool loops) and the post-response reasoning box display.
3. Feedback clarity: Added checkmarks to confirm actions, persisted
show/hide to config (was session-only before), and aligned the
status display for readability.
Tests: 7 new tests for inline think block extraction (41 total).
* feat: add /reasoning command to gateway (Telegram/Discord/etc)
The /reasoning command only existed in the CLI — messaging platforms
had no way to view or change reasoning settings. This adds:
1. /reasoning command handler in the gateway:
- No args: shows current effort level and display state
- /reasoning <level>: sets reasoning effort (none/low/medium/high/xhigh)
- /reasoning show|hide: toggles reasoning display in responses
- All changes saved to config.yaml immediately
2. Reasoning display in gateway responses:
- When show_reasoning is enabled, prepends a 'Reasoning' block
with the model's last_reasoning content before the response
- Collapses long reasoning (>15 lines) to keep messages readable
- Uses last_reasoning from run_conversation result dict
3. Plumbing:
- Added _show_reasoning attribute loaded from config at startup
- Propagated last_reasoning through _run_agent return dict
- Added /reasoning to help text and known_commands set
- Uses getattr for _show_reasoning to handle test stubs
* fix: improve Kimi model selection — auto-detect endpoint, add missing models
Kimi Coding Plan setup:
- New dedicated _model_flow_kimi() replaces the generic API-key flow
for kimi-coding. Removes the confusing 'Base URL' prompt entirely —
the endpoint is auto-detected from the API key prefix:
sk-kimi-* → api.kimi.com/coding/v1 (Kimi Coding Plan)
other → api.moonshot.ai/v1 (legacy Moonshot)
- Shows appropriate models for each endpoint:
Coding Plan: kimi-for-coding, kimi-k2.5, kimi-k2-thinking, kimi-k2-thinking-turbo
Moonshot: full model catalog
- Clears any stale KIMI_BASE_URL override so runtime auto-detection
via _resolve_kimi_base_url() works correctly.
Model catalog updates:
- Added kimi-for-coding (primary Coding Plan model) and kimi-k2-thinking-turbo
to models.py, main.py _PROVIDER_MODELS, and model_metadata.py context windows.
- Updated User-Agent from KimiCLI/1.0 to KimiCLI/1.3 (Kimi's coding
endpoint whitelists known coding agents via User-Agent sniffing).
Adds DEFAULT_CONTEXT_LENGTHS entries for kimi-k2.5 (262144), kimi-k2-thinking
(262144), kimi-k2-turbo-preview (262144), kimi-k2-0905-preview (131072),
MiniMax-M2.5/M2.5-highspeed/M2.1 (204800), and glm-4.5/4.5-flash (131072).
Avoids unnecessary 2M-token probe on first use with direct providers.
Authored by manuelschipper. Adds GLM-4.7 and GLM-5 context lengths (202752)
to model_metadata.py. The key priority fix (prefer OPENAI_API_KEY for
non-OpenRouter endpoints) was already applied in PR #295; merged the Z.ai
mention into the comment.
Replaces the unsafe 128K fallback for unknown models with a descending
probe strategy (2M → 1M → 512K → 200K → 128K → 64K → 32K). When a
context-length error occurs, the agent steps down tiers and retries.
The discovered limit is cached per model+provider combo in
~/.hermes/context_length_cache.yaml so subsequent sessions skip probing.
Also parses API error messages to extract the actual context limit
(e.g. 'maximum context length is 32768 tokens') for instant resolution.
The CLI banner now displays the context window size next to the model
name (e.g. 'claude-opus-4 · 200K context · Nous Research').
Changes:
- agent/model_metadata.py: CONTEXT_PROBE_TIERS, persistent cache
(save/load/get), parse_context_limit_from_error(), get_next_probe_tier()
- agent/context_compressor.py: accepts base_url, passes to metadata
- run_agent.py: step-down logic in context error handler, caches on success
- cli.py + hermes_cli/banner.py: context length in welcome banner
- tests: 22 new tests for probing, parsing, and caching
Addresses #132. PR #319's approach (8K default) rejected — too conservative.
When base_url points to a non-OpenRouter endpoint (e.g. Z.ai),
OPENROUTER_API_KEY incorrectly takes priority over OPENAI_API_KEY,
sending the wrong credentials. This causes 401 errors on the main
inference path and forces users to comment out OPENROUTER_API_KEY,
which then breaks auxiliary clients (compression, vision).
Fix: check whether base_url contains "openrouter" and swap the key
priority accordingly. Also adds GLM-4.7 and GLM-5 context lengths
to DEFAULT_CONTEXT_LENGTHS.