Files

Teknium 02fb7c4aaf docs: comprehensive docs audit — fix 12 stale/missing items across 10 pages (#3618 )

Fixes found by auditing docs against recent PRs/commits:

Critical (misleading):
- hooks.md: Remove stale 'planned — not yet wired' markers for 4 hooks
  that are now active (#3542). Add correct callback signatures.
- security.md: Update tirith verdict behavior — block verdicts now go
  through approval flow instead of hard-blocking (#3428). Add pkill/killall
  self-termination guard and gateway-run backgrounding patterns (#3593).

New feature docs:
- configuration.md: Add tool_use_enforcement section with value table
  (auto/true/false/list) from #3551/#3528.
- configuration.md: Expand auxiliary config with per-task timeouts
  (compression 120s, web_extract 30s, approval 30s) from #3597.
- api-server.md: Add /v1/health alias, Security Headers section,
  CORS details (Max-Age, SSE headers, Idempotency-Key) from
  #3572/#3573/#3576/#3580/#3530.

Stale/incomplete:
- configuration.md: Fix Alibaba model name qwen-plus -> qwen3.5-plus (#3484).
- environment-variables.md: Specify actual DashScope default URL.
- cli-commands.md: Add alibaba to --provider list.
- fallback-providers.md: Add Alibaba/DashScope to provider table.
- email.md: Document noreply/automated sender filtering (#3606).
- toolsets-reference.md: Add 4 missing platform toolsets — matrix,
  mattermost, dingtalk, api-server (#3583).
- skills.md: List default GitHub taps including garrytan/gstack (#3605).

2026-03-28 15:26:35 -07:00

8.3 KiB

Raw Permalink Blame History

sidebar_position, title, description

sidebar_position	title	description
14	API Server	Expose hermes-agent as an OpenAI-compatible API for any frontend

API Server

The API server exposes hermes-agent as an OpenAI-compatible HTTP endpoint. Any frontend that speaks the OpenAI format — Open WebUI, LobeChat, LibreChat, NextChat, ChatBox, and hundreds more — can connect to hermes-agent and use it as a backend.

Your agent handles requests with its full toolset (terminal, file operations, web search, memory, skills) and returns the final response. Tool calls execute invisibly server-side.

Quick Start

1. Enable the API server

Add to ~/.hermes/.env:

API_SERVER_ENABLED=true
API_SERVER_KEY=change-me-local-dev
# Optional: only if a browser must call Hermes directly
# API_SERVER_CORS_ORIGINS=http://localhost:3000

2. Start the gateway

hermes gateway

You'll see:

[API Server] API server listening on http://127.0.0.1:8642

3. Connect a frontend

Point any OpenAI-compatible client at http://localhost:8642/v1:

# Test with curl
curl http://localhost:8642/v1/chat/completions \
  -H "Authorization: Bearer change-me-local-dev" \
  -H "Content-Type: application/json" \
  -d '{"model": "hermes-agent", "messages": [{"role": "user", "content": "Hello!"}]}'

Or connect Open WebUI, LobeChat, or any other frontend — see the Open WebUI integration guide for step-by-step instructions.

Endpoints

POST /v1/chat/completions

Standard OpenAI Chat Completions format. Stateless — the full conversation is included in each request via the messages array.

Request:

{
  "model": "hermes-agent",
  "messages": [
    {"role": "system", "content": "You are a Python expert."},
    {"role": "user", "content": "Write a fibonacci function"}
  ],
  "stream": false
}

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1710000000,
  "model": "hermes-agent",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "Here's a fibonacci function..."},
    "finish_reason": "stop"
  }],
  "usage": {"prompt_tokens": 50, "completion_tokens": 200, "total_tokens": 250}
}

Streaming ("stream": true): Returns Server-Sent Events (SSE) with token-by-token response chunks. When streaming is enabled in config, tokens are emitted live as the LLM generates them. When disabled, the full response is sent as a single SSE chunk.

POST /v1/responses

OpenAI Responses API format. Supports server-side conversation state via previous_response_id — the server stores full conversation history (including tool calls and results) so multi-turn context is preserved without the client managing it.

Request:

{
  "model": "hermes-agent",
  "input": "What files are in my project?",
  "instructions": "You are a helpful coding assistant.",
  "store": true
}

Response:

{
  "id": "resp_abc123",
  "object": "response",
  "status": "completed",
  "model": "hermes-agent",
  "output": [
    {"type": "function_call", "name": "terminal", "arguments": "{\"command\": \"ls\"}", "call_id": "call_1"},
    {"type": "function_call_output", "call_id": "call_1", "output": "README.md src/ tests/"},
    {"type": "message", "role": "assistant", "content": [{"type": "output_text", "text": "Your project has..."}]}
  ],
  "usage": {"input_tokens": 50, "output_tokens": 200, "total_tokens": 250}
}

Multi-turn with previous_response_id

Chain responses to maintain full context (including tool calls) across turns:

{
  "input": "Now show me the README",
  "previous_response_id": "resp_abc123"
}

The server reconstructs the full conversation from the stored response chain — all previous tool calls and results are preserved.

Named conversations

Use the conversation parameter instead of tracking response IDs:

{"input": "Hello", "conversation": "my-project"}
{"input": "What's in src/?", "conversation": "my-project"}
{"input": "Run the tests", "conversation": "my-project"}

The server automatically chains to the latest response in that conversation. Like the /title command for gateway sessions.

GET /v1/responses/{id}

Retrieve a previously stored response by ID.

DELETE /v1/responses/{id}

Delete a stored response.

GET /v1/models

Lists hermes-agent as an available model. Required by most frontends for model discovery.

GET /health

Health check. Returns {"status": "ok"}. Also available at GET /v1/health for OpenAI-compatible clients that expect the /v1/ prefix.

System Prompt Handling

When a frontend sends a system message (Chat Completions) or instructions field (Responses API), hermes-agent layers it on top of its core system prompt. Your agent keeps all its tools, memory, and skills — the frontend's system prompt adds extra instructions.

This means you can customize behavior per-frontend without losing capabilities:

Open WebUI system prompt: "You are a Python expert. Always include type hints."
The agent still has terminal, file tools, web search, memory, etc.

Authentication

Bearer token auth via the Authorization header:

Authorization: Bearer ***

Configure the key via API_SERVER_KEY env var. If you need a browser to call Hermes directly, also set API_SERVER_CORS_ORIGINS to an explicit allowlist.

:::warning Security The API server gives full access to hermes-agent's toolset, including terminal commands. If you change the bind address to 0.0.0.0 (network-accessible), always set API_SERVER_KEY and keep API_SERVER_CORS_ORIGINS narrow — without that, remote callers may be able to execute arbitrary commands on your machine.

The default bind address (127.0.0.1) is for local-only use. Browser access is disabled by default; enable it only for explicit trusted origins. :::

Configuration

Environment Variables

Variable	Default	Description
`API_SERVER_ENABLED`	`false`	Enable the API server
`API_SERVER_PORT`	`8642`	HTTP server port
`API_SERVER_HOST`	`127.0.0.1`	Bind address (localhost only by default)
`API_SERVER_KEY`	(none)	Bearer token for auth
`API_SERVER_CORS_ORIGINS`	(none)	Comma-separated allowed browser origins

config.yaml

# Not yet supported — use environment variables.
# config.yaml support coming in a future release.

Security Headers

All responses include security headers:

X-Content-Type-Options: nosniff — prevents MIME type sniffing
Referrer-Policy: no-referrer — prevents referrer leakage

CORS

The API server does not enable browser CORS by default.

For direct browser access, set an explicit allowlist:

API_SERVER_CORS_ORIGINS=http://localhost:3000,http://127.0.0.1:3000

When CORS is enabled:

Preflight responses include Access-Control-Max-Age: 600 (10 minute cache)
SSE streaming responses include CORS headers so browser EventSource clients work correctly
Idempotency-Key is an allowed request header — clients can send it for deduplication (responses are cached by key for 5 minutes)

Most documented frontends such as Open WebUI connect server-to-server and do not need CORS at all.

Compatible Frontends

Any frontend that supports the OpenAI API format works. Tested/documented integrations:

Frontend	Stars	Connection
Open WebUI	126k	Full guide available
LobeChat	73k	Custom provider endpoint
LibreChat	34k	Custom endpoint in librechat.yaml
AnythingLLM	56k	Generic OpenAI provider
NextChat	87k	BASE_URL env var
ChatBox	39k	API Host setting
Jan	26k	Remote model config
HF Chat-UI	8k	OPENAI_BASE_URL
big-AGI	7k	Custom endpoint
OpenAI Python SDK	—	`OpenAI(base_url="http://localhost:8642/v1")`
curl	—	Direct HTTP requests

Limitations

Response storage — stored responses (for previous_response_id) are persisted in SQLite and survive gateway restarts. Max 100 stored responses (LRU eviction).
No file upload — vision/document analysis via uploaded files is not yet supported through the API.
Model field is cosmetic — the model field in requests is accepted but the actual LLM model used is configured server-side in config.yaml.

8.3 KiB Raw Permalink Blame History