Files

Teknium 57625329a2 docs+feat: comprehensive local LLM provider guides and context length warning (#4294 )

* docs: update llama.cpp section with --jinja flag and tool calling guide

The llama.cpp docs were missing the --jinja flag which is required for
tool calling to work. Without it, models output tool calls as raw JSON
text instead of structured API responses, making Hermes unable to
execute them.

Changes:
- Add --jinja and -fa flags to the server startup example
- Replace deprecated env vars (OPENAI_BASE_URL, LLM_MODEL) with
  hermes model interactive setup
- Add caution block explaining the --jinja requirement and symptoms
- List models with native tool calling support
- Add /props endpoint verification tip

* docs+feat: comprehensive local LLM provider guides and context length warning

Docs (providers.md):
- Rewrote Ollama section with context length warning (defaults to 4k on
  <24GB VRAM), three methods to increase it, and verification steps
- Rewrote vLLM section with --max-model-len, tool calling flags
  (--enable-auto-tool-choice, --tool-call-parser), and context guidance
- Rewrote SGLang section with --context-length, --tool-call-parser,
  and warning about 128-token default max output
- Added LM Studio section (port 1234, context length defaults to 2048,
  tool calling since 0.3.6)
- Added llama.cpp context length flag (-c) and GPU offload (-ngl)
- Added Troubleshooting Local Models section covering:
  - Tool calls appearing as text (with per-server fix table)
  - Silent context truncation and diagnosis commands
  - Low detected context at startup
  - Truncated responses
- Replaced all deprecated env vars (OPENAI_BASE_URL, LLM_MODEL) with
  hermes model interactive setup and config.yaml examples
- Added deprecation warning for legacy env vars in General Setup

Code (cli.py):
- Added context length warning in show_banner() when detected context
  is <= 8192 tokens, with server-specific fix hints:
  - Ollama (port 11434): suggests OLLAMA_CONTEXT_LENGTH env var
  - LM Studio (port 1234): suggests model settings adjustment
  - Other servers: suggests config.yaml override

Tests:
- 9 new tests covering warning thresholds, server-specific hints,
  and no-warning cases

2026-03-31 11:42:48 -07:00

34 KiB

Raw Blame History

title, sidebar_label, sidebar_position

title	sidebar_label	sidebar_position
AI Providers	AI Providers	1

AI Providers

This page covers setting up inference providers for Hermes Agent — from cloud APIs like OpenRouter and Anthropic, to self-hosted endpoints like Ollama and vLLM, to advanced routing and fallback configurations. You need at least one provider configured to use Hermes.

Inference Providers

You need at least one way to connect to an LLM. Use hermes model to switch providers and models interactively, or configure directly:

Provider	Setup
Nous Portal	`hermes model` (OAuth, subscription-based)
OpenAI Codex	`hermes model` (ChatGPT OAuth, uses Codex models)
GitHub Copilot	`hermes model` (OAuth device code flow, `COPILOT_GITHUB_TOKEN`, `GH_TOKEN`, or `gh auth token`)
GitHub Copilot ACP	`hermes model` (spawns local `copilot --acp --stdio`)
Anthropic	`hermes model` (Claude Pro/Max via Claude Code auth, Anthropic API key, or manual setup-token)
OpenRouter	`OPENROUTER_API_KEY` in `~/.hermes/.env`
AI Gateway	`AI_GATEWAY_API_KEY` in `~/.hermes/.env` (provider: `ai-gateway`)
z.ai / GLM	`GLM_API_KEY` in `~/.hermes/.env` (provider: `zai`)
Kimi / Moonshot	`KIMI_API_KEY` in `~/.hermes/.env` (provider: `kimi-coding`)
MiniMax	`MINIMAX_API_KEY` in `~/.hermes/.env` (provider: `minimax`)
MiniMax China	`MINIMAX_CN_API_KEY` in `~/.hermes/.env` (provider: `minimax-cn`)
Alibaba Cloud	`DASHSCOPE_API_KEY` in `~/.hermes/.env` (provider: `alibaba`, aliases: `dashscope`, `qwen`)
Kilo Code	`KILOCODE_API_KEY` in `~/.hermes/.env` (provider: `kilocode`)
OpenCode Zen	`OPENCODE_ZEN_API_KEY` in `~/.hermes/.env` (provider: `opencode-zen`)
OpenCode Go	`OPENCODE_GO_API_KEY` in `~/.hermes/.env` (provider: `opencode-go`)
DeepSeek	`DEEPSEEK_API_KEY` in `~/.hermes/.env` (provider: `deepseek`)
Hugging Face	`HF_TOKEN` in `~/.hermes/.env` (provider: `huggingface`, aliases: `hf`)
Custom Endpoint	`hermes model` (saved in `config.yaml`) or `OPENAI_BASE_URL` + `OPENAI_API_KEY` in `~/.hermes/.env`

:::tip Model key alias In the model: config section, you can use either default: or model: as the key name for your model ID. Both model: { default: my-model } and model: { model: my-model } work identically. :::

:::info Codex Note The OpenAI Codex provider authenticates via device code (open a URL, enter a code). Hermes stores the resulting credentials in its own auth store under ~/.hermes/auth.json and can import existing Codex CLI credentials from ~/.codex/auth.json when present. No Codex CLI installation is required. :::

:::warning Even when using Nous Portal, Codex, or a custom endpoint, some tools (vision, web summarization, MoA) use a separate "auxiliary" model — by default Gemini Flash via OpenRouter. An OPENROUTER_API_KEY enables these tools automatically. You can also configure which model and provider these tools use — see Auxiliary Models. :::

Anthropic (Native)

Use Claude models directly through the Anthropic API — no OpenRouter proxy needed. Supports three auth methods:

# With an API key (pay-per-token)
export ANTHROPIC_API_KEY=***
hermes chat --provider anthropic --model claude-sonnet-4-6

# Preferred: authenticate through `hermes model`
# Hermes will use Claude Code's credential store directly when available
hermes model

# Manual override with a setup-token (fallback / legacy)
export ANTHROPIC_TOKEN=***  # setup-token or manual OAuth token
hermes chat --provider anthropic

# Auto-detect Claude Code credentials (if you already use Claude Code)
hermes chat --provider anthropic  # reads Claude Code credential files automatically

When you choose Anthropic OAuth through hermes model, Hermes prefers Claude Code's own credential store over copying the token into ~/.hermes/.env. That keeps refreshable Claude credentials refreshable.

Or set it permanently:

model:
  provider: "anthropic"
  default: "claude-sonnet-4-6"

:::tip Aliases --provider claude and --provider claude-code also work as shorthand for --provider anthropic. :::

GitHub Copilot

Hermes supports GitHub Copilot as a first-class provider with two modes:

copilot — Direct Copilot API (recommended). Uses your GitHub Copilot subscription to access GPT-5.x, Claude, Gemini, and other models through the Copilot API.

hermes chat --provider copilot --model gpt-5.4

Authentication options (checked in this order):

COPILOT_GITHUB_TOKEN environment variable
GH_TOKEN environment variable
GITHUB_TOKEN environment variable
gh auth token CLI fallback

If no token is found, hermes model offers an OAuth device code login — the same flow used by the Copilot CLI and opencode.

:::warning Token types The Copilot API does not support classic Personal Access Tokens (ghp_*). Supported token types:

Type	Prefix	How to get
OAuth token	`gho_`	`hermes model` → GitHub Copilot → Login with GitHub
Fine-grained PAT	`github_pat_`	GitHub Settings → Developer settings → Fine-grained tokens (needs Copilot Requests permission)
GitHub App token	`ghu_`	Via GitHub App installation

If your gh auth token returns a ghp_* token, use hermes model to authenticate via OAuth instead. :::

API routing: GPT-5+ models (except gpt-5-mini) automatically use the Responses API. All other models (GPT-4o, Claude, Gemini, etc.) use Chat Completions. Models are auto-detected from the live Copilot catalog.

copilot-acp — Copilot ACP agent backend. Spawns the local Copilot CLI as a subprocess:

hermes chat --provider copilot-acp --model copilot-acp
# Requires the GitHub Copilot CLI in PATH and an existing `copilot login` session

Permanent config:

model:
  provider: "copilot"
  default: "gpt-5.4"

Environment variable	Description
`COPILOT_GITHUB_TOKEN`	GitHub token for Copilot API (first priority)
`HERMES_COPILOT_ACP_COMMAND`	Override the Copilot CLI binary path (default: `copilot`)
`HERMES_COPILOT_ACP_ARGS`	Override ACP args (default: `--acp --stdio`)

First-Class Chinese AI Providers

These providers have built-in support with dedicated provider IDs. Set the API key and use --provider to select:

# z.ai / ZhipuAI GLM
hermes chat --provider zai --model glm-4-plus
# Requires: GLM_API_KEY in ~/.hermes/.env

# Kimi / Moonshot AI
hermes chat --provider kimi-coding --model moonshot-v1-auto
# Requires: KIMI_API_KEY in ~/.hermes/.env

# MiniMax (global endpoint)
hermes chat --provider minimax --model MiniMax-M2.7
# Requires: MINIMAX_API_KEY in ~/.hermes/.env

# MiniMax (China endpoint)
hermes chat --provider minimax-cn --model MiniMax-M2.7
# Requires: MINIMAX_CN_API_KEY in ~/.hermes/.env

# Alibaba Cloud / DashScope (Qwen models)
hermes chat --provider alibaba --model qwen3.5-plus
# Requires: DASHSCOPE_API_KEY in ~/.hermes/.env

Or set the provider permanently in config.yaml:

model:
  provider: "zai"       # or: kimi-coding, minimax, minimax-cn, alibaba
  default: "glm-4-plus"

Base URLs can be overridden with GLM_BASE_URL, KIMI_BASE_URL, MINIMAX_BASE_URL, MINIMAX_CN_BASE_URL, or DASHSCOPE_BASE_URL environment variables.

Hugging Face Inference Providers

Hugging Face Inference Providers routes to 20+ open models through a unified OpenAI-compatible endpoint (router.huggingface.co/v1). Requests are automatically routed to the fastest available backend (Groq, Together, SambaNova, etc.) with automatic failover.

# Use any available model
hermes chat --provider huggingface --model Qwen/Qwen3-235B-A22B-Thinking-2507
# Requires: HF_TOKEN in ~/.hermes/.env

# Short alias
hermes chat --provider hf --model deepseek-ai/DeepSeek-V3.2

Or set it permanently in config.yaml:

model:
  provider: "huggingface"
  default: "Qwen/Qwen3-235B-A22B-Thinking-2507"

Get your token at huggingface.co/settings/tokens — make sure to enable the "Make calls to Inference Providers" permission. Free tier included ($0.10/month credit, no markup on provider rates).

You can append routing suffixes to model names: :fastest (default), :cheapest, or :provider_name to force a specific backend.

The base URL can be overridden with HF_BASE_URL.

Custom & Self-Hosted LLM Providers

Hermes Agent works with any OpenAI-compatible API endpoint. If a server implements /v1/chat/completions, you can point Hermes at it. This means you can use local models, GPU inference servers, multi-provider routers, or any third-party API.

General Setup

Three ways to configure a custom endpoint:

Interactive setup (recommended):

hermes model
# Select "Custom endpoint (self-hosted / VLLM / etc.)"
# Enter: API base URL, API key, Model name

Manual config (config.yaml):

# In ~/.hermes/config.yaml
model:
  default: your-model-name
  provider: custom
  base_url: http://localhost:8000/v1
  api_key: your-key-or-leave-empty-for-local

:::warning Legacy env vars OPENAI_BASE_URL and LLM_MODEL in .env are deprecated. The CLI ignores LLM_MODEL entirely (only the gateway reads it). Use hermes model or edit config.yaml directly — both persist correctly across restarts and Docker containers. :::

Both approaches persist to config.yaml, which is the source of truth for model, provider, and base URL.

Switching Models with `/model`

Once a custom endpoint is configured, you can switch models mid-session:

/model custom:qwen-2.5          # Switch to a model on your custom endpoint
/model custom                    # Auto-detect the model from the endpoint
/model openrouter:claude-sonnet-4 # Switch back to a cloud provider

If you have named custom providers configured (see below), use the triple syntax:

/model custom:local:qwen-2.5    # Use the "local" custom provider with model qwen-2.5
/model custom:work:llama3       # Use the "work" custom provider with llama3

When switching providers, Hermes persists the base URL and provider to config so the change survives restarts. When switching away from a custom endpoint to a built-in provider, the stale base URL is automatically cleared.

:::tip /model custom (bare, no model name) queries your endpoint's /models API and auto-selects the model if exactly one is loaded. Useful for local servers running a single model. :::

Everything below follows this same pattern — just change the URL, key, and model name.

Ollama — Local Models, Zero Config

Ollama runs open-weight models locally with one command. Best for: quick local experimentation, privacy-sensitive work, offline use. Supports tool calling via the OpenAI-compatible API.

# Install and run a model
ollama pull qwen2.5-coder:32b
ollama serve   # Starts on port 11434

Then configure Hermes:

hermes model
# Select "Custom endpoint (self-hosted / VLLM / etc.)"
# Enter URL: http://localhost:11434/v1
# Skip API key (Ollama doesn't need one)
# Enter model name (e.g. qwen2.5-coder:32b)

Or configure config.yaml directly:

model:
  default: qwen2.5-coder:32b
  provider: custom
  base_url: http://localhost:11434/v1
  context_length: 32768   # See warning below

:::caution Ollama defaults to very low context lengths Ollama does not use your model's full context window by default. Depending on your VRAM, the default is:

Available VRAM	Default context
Less than 24 GB	4,096 tokens
24–48 GB	32,768 tokens
48+ GB	256,000 tokens

For agent use with tools, you need at least 16k–32k context. At 4k, the system prompt + tool schemas alone can fill the window, leaving no room for conversation.

How to increase it (pick one):

# Option 1: Set server-wide via environment variable (recommended)
OLLAMA_CONTEXT_LENGTH=32768 ollama serve

# Option 2: For systemd-managed Ollama
sudo systemctl edit ollama.service
# Add: Environment="OLLAMA_CONTEXT_LENGTH=32768"
# Then: sudo systemctl daemon-reload && sudo systemctl restart ollama

# Option 3: Bake it into a custom model (persistent per-model)
echo -e "FROM qwen2.5-coder:32b\nPARAMETER num_ctx 32768" > Modelfile
ollama create qwen2.5-coder-32k -f Modelfile

You cannot set context length through the OpenAI-compatible API (/v1/chat/completions). It must be configured server-side or via a Modelfile. This is the #1 source of confusion when integrating Ollama with tools like Hermes. :::

Verify your context is set correctly:

ollama ps
# Look at the CONTEXT column — it should show your configured value

:::tip List available models with ollama list. Pull any model from the Ollama library with ollama pull <model>. Ollama handles GPU offloading automatically — no configuration needed for most setups. :::

vLLM — High-Performance GPU Inference

vLLM is the standard for production LLM serving. Best for: maximum throughput on GPU hardware, serving large models, continuous batching.

pip install vllm
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --port 8000 \
  --max-model-len 65536 \
  --tensor-parallel-size 2 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Then configure Hermes:

hermes model
# Select "Custom endpoint (self-hosted / VLLM / etc.)"
# Enter URL: http://localhost:8000/v1
# Skip API key (or enter one if you configured vLLM with --api-key)
# Enter model name: meta-llama/Llama-3.1-70B-Instruct

Context length: vLLM reads the model's max_position_embeddings by default. If that exceeds your GPU memory, it errors and asks you to set --max-model-len lower. You can also use --max-model-len auto to automatically find the maximum that fits. Set --gpu-memory-utilization 0.95 (default 0.9) to squeeze more context into VRAM.

Tool calling requires explicit flags:

Flag	Purpose
`--enable-auto-tool-choice`	Required for `tool_choice: "auto"` (the default in Hermes)
`--tool-call-parser <name>`	Parser for the model's tool call format

Supported parsers: hermes (Qwen 2.5, Hermes 2/3), llama3_json (Llama 3.x), mistral, deepseek_v3, deepseek_v31, xlam, pythonic. Without these flags, tool calls won't work — the model will output tool calls as text.

:::tip vLLM supports human-readable sizes: --max-model-len 64k (lowercase k = 1000, uppercase K = 1024). :::

SGLang — Fast Serving with RadixAttention

SGLang is an alternative to vLLM with RadixAttention for KV cache reuse. Best for: multi-turn conversations (prefix caching), constrained decoding, structured output.

pip install "sglang[all]"
python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --port 30000 \
  --context-length 65536 \
  --tp 2 \
  --tool-call-parser qwen

Then configure Hermes:

hermes model
# Select "Custom endpoint (self-hosted / VLLM / etc.)"
# Enter URL: http://localhost:30000/v1
# Enter model name: meta-llama/Llama-3.1-70B-Instruct

Context length: SGLang reads from the model's config by default. Use --context-length to override. If you need to exceed the model's declared maximum, set SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1.

Tool calling: Use --tool-call-parser with the appropriate parser for your model family: qwen (Qwen 2.5), llama3, llama4, deepseekv3, mistral, glm. Without this flag, tool calls come back as plain text.

:::caution SGLang defaults to 128 max output tokens If responses seem truncated, add max_tokens to your requests or set --default-max-tokens on the server. SGLang's default is only 128 tokens per response if not specified in the request. :::

llama.cpp / llama-server — CPU & Metal Inference

llama.cpp runs quantized models on CPU, Apple Silicon (Metal), and consumer GPUs. Best for: running models without a datacenter GPU, Mac users, edge deployment.

# Build and start llama-server
cmake -B build && cmake --build build --config Release
./build/bin/llama-server \
  --jinja -fa \
  -c 32768 \
  -ngl 99 \
  -m models/qwen2.5-coder-32b-instruct-Q4_K_M.gguf \
  --port 8080 --host 0.0.0.0

Context length (-c): Recent builds default to 0 which reads the model's training context from the GGUF metadata. For models with 128k+ training context, this can OOM trying to allocate the full KV cache. Set -c explicitly to what you need (32k–64k is a good range for agent use). If using parallel slots (-np), the total context is divided among slots — with -c 32768 -np 4, each slot only gets 8k.

Then configure Hermes to point at it:

hermes model
# Select "Custom endpoint (self-hosted / VLLM / etc.)"
# Enter URL: http://localhost:8080/v1
# Skip API key (local servers don't need one)
# Enter model name — or leave blank to auto-detect if only one model is loaded

This saves the endpoint to config.yaml so it persists across sessions.

:::caution --jinja is required for tool calling Without --jinja, llama-server ignores the tools parameter entirely. The model will try to call tools by writing JSON in its response text, but Hermes won't recognize it as a tool call — you'll see raw JSON like {"name": "web_search", ...} printed as a message instead of an actual search.

Native tool calling support (best performance): Llama 3.x, Qwen 2.5 (including Coder), Hermes 2/3, Mistral, DeepSeek, Functionary. All other models use a generic handler that works but may be less efficient. See the llama.cpp function calling docs for the full list.

You can verify tool support is active by checking http://localhost:8080/props — the chat_template field should be present. :::

:::tip Download GGUF models from Hugging Face. Q4_K_M quantization offers the best balance of quality vs. memory usage. :::

LM Studio — Desktop App with Local Models

LM Studio is a desktop app for running local models with a GUI. Best for: users who prefer a visual interface, quick model testing, developers on macOS/Windows/Linux.

Start the server from the LM Studio app (Developer tab → Start Server), or use the CLI:

lms server start                        # Starts on port 1234
lms load qwen2.5-coder --context-length 32768

Then configure Hermes:

hermes model
# Select "Custom endpoint (self-hosted / VLLM / etc.)"
# Enter URL: http://localhost:1234/v1
# Skip API key (LM Studio doesn't require one)
# Enter model name

:::caution Context length often defaults to 2048 LM Studio reads context length from the model's metadata, but many GGUF models report low defaults (2048 or 4096). Always set context length explicitly in the LM Studio model settings:

Click the gear icon next to the model picker
Set "Context Length" to at least 16384 (preferably 32768)
Reload the model for the change to take effect

Alternatively, use the CLI: lms load model-name --context-length 32768

To set persistent per-model defaults: My Models tab → gear icon on the model → set context size. :::

Tool calling: Supported since LM Studio 0.3.6. Models with native tool-calling training (Qwen 2.5, Llama 3.x, Mistral, Hermes) are auto-detected and shown with a tool badge. Other models use a generic fallback that may be less reliable.

Troubleshooting Local Models

These issues affect all local inference servers when used with Hermes.

Tool calls appear as text instead of executing

The model outputs something like {"name": "web_search", "arguments": {...}} as a message instead of actually calling the tool.

Cause: Your server doesn't have tool calling enabled, or the model doesn't support it through the server's tool calling implementation.

Server	Fix
llama.cpp	Add `--jinja` to the startup command
vLLM	Add `--enable-auto-tool-choice --tool-call-parser hermes`
SGLang	Add `--tool-call-parser qwen` (or appropriate parser)
Ollama	Tool calling is enabled by default — make sure your model supports it (check with `ollama show model-name`)
LM Studio	Update to 0.3.6+ and use a model with native tool support

Model seems to forget context or give incoherent responses

Cause: Context window is too small. When the conversation exceeds the context limit, most servers silently drop older messages. Hermes's system prompt + tool schemas alone can use 4k–8k tokens.

Diagnosis:

# Check what Hermes thinks the context is
# Look at startup line: "Context limit: X tokens"

# Check your server's actual context
# Ollama: ollama ps (CONTEXT column)
# llama.cpp: curl http://localhost:8080/props | jq '.default_generation_settings.n_ctx'
# vLLM: check --max-model-len in startup args

Fix: Set context to at least 32,768 tokens for agent use. See each server's section above for the specific flag.

"Context limit: 2048 tokens" at startup

Hermes auto-detects context length from your server's /v1/models endpoint. If the server reports a low value (or doesn't report one at all), Hermes uses the model's declared limit which may be wrong.

Fix: Set it explicitly in config.yaml:

model:
  default: your-model
  provider: custom
  base_url: http://localhost:11434/v1
  context_length: 32768

Responses get cut off mid-sentence

Possible causes:

Low max_tokens on the server — SGLang defaults to 128 tokens per response. Set --default-max-tokens on the server or configure Hermes with model.max_tokens in config.yaml.
Context exhaustion — The model filled its context window. Increase context length or enable context compression in Hermes.

LiteLLM Proxy — Multi-Provider Gateway

LiteLLM is an OpenAI-compatible proxy that unifies 100+ LLM providers behind a single API. Best for: switching between providers without config changes, load balancing, fallback chains, budget controls.

# Install and start
pip install "litellm[proxy]"
litellm --model anthropic/claude-sonnet-4 --port 4000

# Or with a config file for multiple models:
litellm --config litellm_config.yaml --port 4000

Then configure Hermes with hermes model → Custom endpoint → http://localhost:4000/v1.

Example litellm_config.yaml with fallback:

model_list:
  - model_name: "best"
    litellm_params:
      model: anthropic/claude-sonnet-4
      api_key: sk-ant-...
  - model_name: "best"
    litellm_params:
      model: openai/gpt-4o
      api_key: sk-...
router_settings:
  routing_strategy: "latency-based-routing"

ClawRouter — Cost-Optimized Routing

ClawRouter by BlockRunAI is a local routing proxy that auto-selects models based on query complexity. It classifies requests across 14 dimensions and routes to the cheapest model that can handle the task. Payment is via USDC cryptocurrency (no API keys).

# Install and start
npx @blockrun/clawrouter    # Starts on port 8402

Then configure Hermes with hermes model → Custom endpoint → http://localhost:8402/v1 → model name blockrun/auto.

Routing profiles:

Profile	Strategy	Savings
`blockrun/auto`	Balanced quality/cost	74-100%
`blockrun/eco`	Cheapest possible	95-100%
`blockrun/premium`	Best quality models	0%
`blockrun/free`	Free models only	100%
`blockrun/agentic`	Optimized for tool use	varies

:::note ClawRouter requires a USDC-funded wallet on Base or Solana for payment. All requests route through BlockRun's backend API. Run npx @blockrun/clawrouter doctor to check wallet status. :::

Other Compatible Providers

Any service with an OpenAI-compatible API works. Some popular options:

Provider	Base URL	Notes
Together AI	`https://api.together.xyz/v1`	Cloud-hosted open models
Groq	`https://api.groq.com/openai/v1`	Ultra-fast inference
DeepSeek	`https://api.deepseek.com/v1`	DeepSeek models
Fireworks AI	`https://api.fireworks.ai/inference/v1`	Fast open model hosting
Cerebras	`https://api.cerebras.ai/v1`	Wafer-scale chip inference
Mistral AI	`https://api.mistral.ai/v1`	Mistral models
OpenAI	`https://api.openai.com/v1`	Direct OpenAI access
Azure OpenAI	`https://YOUR.openai.azure.com/`	Enterprise OpenAI
LocalAI	`http://localhost:8080/v1`	Self-hosted, multi-model
Jan	`http://localhost:1337/v1`	Desktop app with local models

Configure any of these with hermes model → Custom endpoint, or in config.yaml:

model:
  default: meta-llama/Llama-3.1-70B-Instruct-Turbo
  provider: custom
  base_url: https://api.together.xyz/v1
  api_key: your-together-key

Context Length Detection

Hermes uses a multi-source resolution chain to detect the correct context window for your model and provider:

Config override — model.context_length in config.yaml (highest priority)
Custom provider per-model — custom_providers[].models.<id>.context_length
Persistent cache — previously discovered values (survives restarts)
Endpoint /models — queries your server's API (local/custom endpoints)
Anthropic /v1/models — queries Anthropic's API for max_input_tokens (API-key users only)
OpenRouter API — live model metadata from OpenRouter
Nous Portal — suffix-matches Nous model IDs against OpenRouter metadata
models.dev — community-maintained registry with provider-specific context lengths for 3800+ models across 100+ providers
Fallback defaults — broad model family patterns (128K default)

For most setups this works out of the box. The system is provider-aware — the same model can have different context limits depending on who serves it (e.g., claude-opus-4.6 is 1M on Anthropic direct but 128K on GitHub Copilot).

To set the context length explicitly, add context_length to your model config:

model:
  default: "qwen3.5:9b"
  base_url: "http://localhost:8080/v1"
  context_length: 131072  # tokens

For custom endpoints, you can also set context length per model:

custom_providers:
  - name: "My Local LLM"
    base_url: "http://localhost:11434/v1"
    models:
      qwen3.5:27b:
        context_length: 32768
      deepseek-r1:70b:
        context_length: 65536

hermes model will prompt for context length when configuring a custom endpoint. Leave it blank for auto-detection.

:::tip When to set this manually

You're using Ollama with a custom num_ctx that's lower than the model's maximum
You want to limit context below the model's maximum (e.g., 8k on a 128k model to save VRAM)
You're running behind a proxy that doesn't expose /v1/models :::

Named Custom Providers

If you work with multiple custom endpoints (e.g., a local dev server and a remote GPU server), you can define them as named custom providers in config.yaml:

custom_providers:
  - name: local
    base_url: http://localhost:8080/v1
    # api_key omitted — Hermes uses "no-key-required" for keyless local servers
  - name: work
    base_url: https://gpu-server.internal.corp/v1
    api_key: corp-api-key
    api_mode: chat_completions   # optional, auto-detected from URL
  - name: anthropic-proxy
    base_url: https://proxy.example.com/anthropic
    api_key: proxy-key
    api_mode: anthropic_messages  # for Anthropic-compatible proxies

Switch between them mid-session with the triple syntax:

/model custom:local:qwen-2.5       # Use the "local" endpoint with qwen-2.5
/model custom:work:llama3-70b      # Use the "work" endpoint with llama3-70b
/model custom:anthropic-proxy:claude-sonnet-4  # Use the proxy

You can also select named custom providers from the interactive hermes model menu.

Choosing the Right Setup

Use Case	Recommended
Just want it to work	OpenRouter (default) or Nous Portal
Local models, easy setup	Ollama
Production GPU serving	vLLM or SGLang
Mac / no GPU	Ollama or llama.cpp
Multi-provider routing	LiteLLM Proxy or OpenRouter
Cost optimization	ClawRouter or OpenRouter with `sort: "price"`
Maximum privacy	Ollama, vLLM, or llama.cpp (fully local)
Enterprise / Azure	Azure OpenAI with custom endpoint
Chinese AI models	z.ai (GLM), Kimi/Moonshot, or MiniMax (first-class providers)

:::tip You can switch between providers at any time with hermes model — no restart required. Your conversation history, memory, and skills carry over regardless of which provider you use. :::

Optional API Keys

Feature	Provider	Env Variable
Web scraping	Firecrawl	`FIRECRAWL_API_KEY`, `FIRECRAWL_API_URL`
Browser automation	Browserbase	`BROWSERBASE_API_KEY`, `BROWSERBASE_PROJECT_ID`
Image generation	FAL	`FAL_KEY`
Premium TTS voices	ElevenLabs	`ELEVENLABS_API_KEY`
OpenAI TTS + voice transcription	OpenAI	`VOICE_TOOLS_OPENAI_KEY`
RL Training	Tinker + WandB	`TINKER_API_KEY`, `WANDB_API_KEY`
Cross-session user modeling	Honcho	`HONCHO_API_KEY`

Self-Hosting Firecrawl

By default, Hermes uses the Firecrawl cloud API for web search and scraping. If you prefer to run Firecrawl locally, you can point Hermes at a self-hosted instance instead. See Firecrawl's SELF_HOST.md for complete setup instructions.

What you get: No API key required, no rate limits, no per-page costs, full data sovereignty.

What you lose: The cloud version uses Firecrawl's proprietary "Fire-engine" for advanced anti-bot bypassing (Cloudflare, CAPTCHAs, IP rotation). Self-hosted uses basic fetch + Playwright, so some protected sites may fail. Search uses DuckDuckGo instead of Google.

Setup:

Clone and start the Firecrawl Docker stack (5 containers: API, Playwright, Redis, RabbitMQ, PostgreSQL — requires ~4-8 GB RAM):

git clone https://github.com/firecrawl/firecrawl
cd firecrawl
# In .env, set: USE_DB_AUTHENTICATION=false, HOST=0.0.0.0, PORT=3002
docker compose up -d

Point Hermes at your instance (no API key needed):

hermes config set FIRECRAWL_API_URL http://localhost:3002

You can also set both FIRECRAWL_API_KEY and FIRECRAWL_API_URL if your self-hosted instance has authentication enabled.

OpenRouter Provider Routing

When using OpenRouter, you can control how requests are routed across providers. Add a provider_routing section to ~/.hermes/config.yaml:

provider_routing:
  sort: "throughput"          # "price" (default), "throughput", or "latency"
  # only: ["anthropic"]      # Only use these providers
  # ignore: ["deepinfra"]    # Skip these providers
  # order: ["anthropic", "google"]  # Try providers in this order
  # require_parameters: true  # Only use providers that support all request params
  # data_collection: "deny"   # Exclude providers that may store/train on data

Shortcuts: Append :nitro to any model name for throughput sorting (e.g., anthropic/claude-sonnet-4:nitro), or :floor for price sorting.

Fallback Model

Configure a backup provider:model that Hermes switches to automatically when your primary model fails (rate limits, server errors, auth failures):

fallback_model:
  provider: openrouter                    # required
  model: anthropic/claude-sonnet-4        # required
  # base_url: http://localhost:8000/v1    # optional, for custom endpoints
  # api_key_env: MY_CUSTOM_KEY           # optional, env var name for custom endpoint API key

When activated, the fallback swaps the model and provider mid-session without losing your conversation. It fires at most once per session.

Supported providers: openrouter, nous, openai-codex, copilot, anthropic, huggingface, zai, kimi-coding, minimax, minimax-cn, custom.

:::tip Fallback is configured exclusively through config.yaml — there are no environment variables for it. For full details on when it triggers, supported providers, and how it interacts with auxiliary tasks and delegation, see Fallback Providers. :::

Smart Model Routing

Optional cheap-vs-strong routing lets Hermes keep your main model for complex work while sending very short/simple turns to a cheaper model.

smart_model_routing:
  enabled: true
  max_simple_chars: 160
  max_simple_words: 28
  cheap_model:
    provider: openrouter
    model: google/gemini-2.5-flash
    # base_url: http://localhost:8000/v1  # optional custom endpoint
    # api_key_env: MY_CUSTOM_KEY          # optional env var name for that endpoint's API key

How it works:

If a turn is short, single-line, and does not look code/tool/debug heavy, Hermes may route it to cheap_model
If the turn looks complex, Hermes stays on your primary model/provider
If the cheap route cannot be resolved cleanly, Hermes falls back to the primary model automatically

This is intentionally conservative. It is meant for quick, low-stakes turns like:

short factual questions
quick rewrites
lightweight summaries

It will avoid routing prompts that look like:

coding/debugging work
tool-heavy requests
long or multi-line analysis asks

Use this when you want lower latency or cost without fully changing your default model.

34 KiB Raw Blame History Unescape Escape

AI Providers

Inference Providers

Anthropic (Native)

GitHub Copilot

First-Class Chinese AI Providers

Hugging Face Inference Providers

Custom & Self-Hosted LLM Providers

General Setup

Switching Models with /model

Ollama — Local Models, Zero Config

vLLM — High-Performance GPU Inference

SGLang — Fast Serving with RadixAttention

llama.cpp / llama-server — CPU & Metal Inference

LM Studio — Desktop App with Local Models

Troubleshooting Local Models

Tool calls appear as text instead of executing

Model seems to forget context or give incoherent responses

"Context limit: 2048 tokens" at startup

Responses get cut off mid-sentence

LiteLLM Proxy — Multi-Provider Gateway

ClawRouter — Cost-Optimized Routing

Other Compatible Providers

Context Length Detection

Named Custom Providers

Choosing the Right Setup

Optional API Keys

Self-Hosting Firecrawl

OpenRouter Provider Routing

Fallback Model

Smart Model Routing

See Also

34 KiB

Raw Blame History

Switching Models with `/model`