feat: OpenAI-compatible API server + WhatsApp configurable reply prefix (#1756)
* feat: OpenAI-compatible API server platform adapter Salvaged from PR #956, updated for current main. Adds an HTTP API server as a gateway platform adapter that exposes hermes-agent via the OpenAI Chat Completions and Responses APIs. Any OpenAI-compatible frontend (Open WebUI, LobeChat, LibreChat, AnythingLLM, NextChat, ChatBox, etc.) can connect by pointing at http://localhost:8642/v1. Endpoints: - POST /v1/chat/completions — stateless Chat Completions API - POST /v1/responses — stateful Responses API with chaining - GET /v1/responses/{id} — retrieve stored response - DELETE /v1/responses/{id} — delete stored response - GET /v1/models — list hermes-agent as available model - GET /health — health check Features: - Real SSE streaming via stream_delta_callback (uses main's streaming) - In-memory LRU response store for Responses API conversation chaining - Named conversations via 'conversation' parameter - Bearer token auth (optional, via API_SERVER_KEY) - CORS support for browser-based frontends - System prompt layering (frontend system messages on top of core) - Real token usage tracking in responses Integration points: - Platform.API_SERVER in gateway/config.py - _create_adapter() branch in gateway/run.py - API_SERVER_* env vars in hermes_cli/config.py - Env var overrides in gateway/config.py _apply_env_overrides() Changes vs original PR #956: - Removed streaming infrastructure (already on main via stream_consumer.py) - Removed Telegram reply_to_mode (separate feature, not included) - Updated _resolve_model() -> _resolve_gateway_model() - Updated stream_callback -> stream_delta_callback - Updated connect()/disconnect() to use _mark_connected()/_mark_disconnected() - Adapted to current Platform enum (includes MATTERMOST, MATRIX, DINGTALK) Tests: 72 new tests, all passing Docs: API server guide, Open WebUI integration guide, env var reference * feat(whatsapp): make reply prefix configurable via config.yaml Reworked from PR #1764 (ifrederico) to use config.yaml instead of .env. The WhatsApp bridge prepends a header to every outgoing message. This was hardcoded to '⚕ *Hermes Agent*'. Users can now customize or disable it via config.yaml: whatsapp: reply_prefix: '' # disable header reply_prefix: '🤖 *My Bot*\n───\n' # custom prefix How it works: - load_gateway_config() reads whatsapp.reply_prefix from config.yaml and stores it in PlatformConfig.extra['reply_prefix'] - WhatsAppAdapter reads it from config.extra at init - When spawning bridge.js, the adapter passes it as WHATSAPP_REPLY_PREFIX in the subprocess environment - bridge.js handles undefined (default), empty (no header), or custom values with \\n escape support - Self-chat echo suppression uses the configured prefix Also fixes _config_version: was 9 but ENV_VARS_BY_VERSION had a key 10 (TAVILY_API_KEY), so existing users at v9 would never be prompted for Tavily. Bumped to 10 to close the gap. Added a regression test to prevent this from happening again. Credit: ifrederico (PR #1764) for the bridge.js implementation and the config version gap discovery. --------- Co-authored-by: Test <test@test.com>
This commit is contained in:
223
website/docs/user-guide/features/api-server.md
Normal file
223
website/docs/user-guide/features/api-server.md
Normal file
@@ -0,0 +1,223 @@
|
||||
---
|
||||
sidebar_position: 14
|
||||
title: "API Server"
|
||||
description: "Expose hermes-agent as an OpenAI-compatible API for any frontend"
|
||||
---
|
||||
|
||||
# API Server
|
||||
|
||||
The API server exposes hermes-agent as an OpenAI-compatible HTTP endpoint. Any frontend that speaks the OpenAI format — Open WebUI, LobeChat, LibreChat, NextChat, ChatBox, and hundreds more — can connect to hermes-agent and use it as a backend.
|
||||
|
||||
Your agent handles requests with its full toolset (terminal, file operations, web search, memory, skills) and returns the final response. Tool calls execute invisibly server-side.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Enable the API server
|
||||
|
||||
Add to `~/.hermes/.env`:
|
||||
|
||||
```bash
|
||||
API_SERVER_ENABLED=true
|
||||
```
|
||||
|
||||
### 2. Start the gateway
|
||||
|
||||
```bash
|
||||
hermes gateway
|
||||
```
|
||||
|
||||
You'll see:
|
||||
|
||||
```
|
||||
[API Server] API server listening on http://127.0.0.1:8642
|
||||
```
|
||||
|
||||
### 3. Connect a frontend
|
||||
|
||||
Point any OpenAI-compatible client at `http://localhost:8642/v1`:
|
||||
|
||||
```bash
|
||||
# Test with curl
|
||||
curl http://localhost:8642/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model": "hermes-agent", "messages": [{"role": "user", "content": "Hello!"}]}'
|
||||
```
|
||||
|
||||
Or connect Open WebUI, LobeChat, or any other frontend — see the [Open WebUI integration guide](/docs/user-guide/messaging/open-webui) for step-by-step instructions.
|
||||
|
||||
## Endpoints
|
||||
|
||||
### POST /v1/chat/completions
|
||||
|
||||
Standard OpenAI Chat Completions format. Stateless — the full conversation is included in each request via the `messages` array.
|
||||
|
||||
**Request:**
|
||||
```json
|
||||
{
|
||||
"model": "hermes-agent",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a Python expert."},
|
||||
{"role": "user", "content": "Write a fibonacci function"}
|
||||
],
|
||||
"stream": false
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"id": "chatcmpl-abc123",
|
||||
"object": "chat.completion",
|
||||
"created": 1710000000,
|
||||
"model": "hermes-agent",
|
||||
"choices": [{
|
||||
"index": 0,
|
||||
"message": {"role": "assistant", "content": "Here's a fibonacci function..."},
|
||||
"finish_reason": "stop"
|
||||
}],
|
||||
"usage": {"prompt_tokens": 50, "completion_tokens": 200, "total_tokens": 250}
|
||||
}
|
||||
```
|
||||
|
||||
**Streaming** (`"stream": true`): Returns Server-Sent Events (SSE) with token-by-token response chunks. When streaming is enabled in config, tokens are emitted live as the LLM generates them. When disabled, the full response is sent as a single SSE chunk.
|
||||
|
||||
### POST /v1/responses
|
||||
|
||||
OpenAI Responses API format. Supports server-side conversation state via `previous_response_id` — the server stores full conversation history (including tool calls and results) so multi-turn context is preserved without the client managing it.
|
||||
|
||||
**Request:**
|
||||
```json
|
||||
{
|
||||
"model": "hermes-agent",
|
||||
"input": "What files are in my project?",
|
||||
"instructions": "You are a helpful coding assistant.",
|
||||
"store": true
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"id": "resp_abc123",
|
||||
"object": "response",
|
||||
"status": "completed",
|
||||
"model": "hermes-agent",
|
||||
"output": [
|
||||
{"type": "function_call", "name": "terminal", "arguments": "{\"command\": \"ls\"}", "call_id": "call_1"},
|
||||
{"type": "function_call_output", "call_id": "call_1", "output": "README.md src/ tests/"},
|
||||
{"type": "message", "role": "assistant", "content": [{"type": "output_text", "text": "Your project has..."}]}
|
||||
],
|
||||
"usage": {"input_tokens": 50, "output_tokens": 200, "total_tokens": 250}
|
||||
}
|
||||
```
|
||||
|
||||
#### Multi-turn with previous_response_id
|
||||
|
||||
Chain responses to maintain full context (including tool calls) across turns:
|
||||
|
||||
```json
|
||||
{
|
||||
"input": "Now show me the README",
|
||||
"previous_response_id": "resp_abc123"
|
||||
}
|
||||
```
|
||||
|
||||
The server reconstructs the full conversation from the stored response chain — all previous tool calls and results are preserved.
|
||||
|
||||
#### Named conversations
|
||||
|
||||
Use the `conversation` parameter instead of tracking response IDs:
|
||||
|
||||
```json
|
||||
{"input": "Hello", "conversation": "my-project"}
|
||||
{"input": "What's in src/?", "conversation": "my-project"}
|
||||
{"input": "Run the tests", "conversation": "my-project"}
|
||||
```
|
||||
|
||||
The server automatically chains to the latest response in that conversation. Like the `/title` command for gateway sessions.
|
||||
|
||||
### GET /v1/responses/{id}
|
||||
|
||||
Retrieve a previously stored response by ID.
|
||||
|
||||
### DELETE /v1/responses/{id}
|
||||
|
||||
Delete a stored response.
|
||||
|
||||
### GET /v1/models
|
||||
|
||||
Lists `hermes-agent` as an available model. Required by most frontends for model discovery.
|
||||
|
||||
### GET /health
|
||||
|
||||
Health check. Returns `{"status": "ok"}`.
|
||||
|
||||
## System Prompt Handling
|
||||
|
||||
When a frontend sends a `system` message (Chat Completions) or `instructions` field (Responses API), hermes-agent **layers it on top** of its core system prompt. Your agent keeps all its tools, memory, and skills — the frontend's system prompt adds extra instructions.
|
||||
|
||||
This means you can customize behavior per-frontend without losing capabilities:
|
||||
- Open WebUI system prompt: "You are a Python expert. Always include type hints."
|
||||
- The agent still has terminal, file tools, web search, memory, etc.
|
||||
|
||||
## Authentication
|
||||
|
||||
Bearer token auth via the `Authorization` header:
|
||||
|
||||
```
|
||||
Authorization: Bearer ***
|
||||
```
|
||||
|
||||
Configure the key via `API_SERVER_KEY` env var. If no key is set, all requests are allowed (for local-only use).
|
||||
|
||||
:::warning Security
|
||||
The API server gives full access to hermes-agent's toolset, **including terminal commands**. If you change the bind address to `0.0.0.0` (network-accessible), **always set `API_SERVER_KEY`** — without it, anyone on your network can execute arbitrary commands on your machine.
|
||||
|
||||
The default bind address (`127.0.0.1`) is safe for local-only use.
|
||||
:::
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `API_SERVER_ENABLED` | `false` | Enable the API server |
|
||||
| `API_SERVER_PORT` | `8642` | HTTP server port |
|
||||
| `API_SERVER_HOST` | `127.0.0.1` | Bind address (localhost only by default) |
|
||||
| `API_SERVER_KEY` | _(none)_ | Bearer token for auth |
|
||||
|
||||
### config.yaml
|
||||
|
||||
```yaml
|
||||
# Not yet supported — use environment variables.
|
||||
# config.yaml support coming in a future release.
|
||||
```
|
||||
|
||||
## CORS
|
||||
|
||||
The API server includes CORS headers on all responses (`Access-Control-Allow-Origin: *`), so browser-based frontends can connect directly.
|
||||
|
||||
## Compatible Frontends
|
||||
|
||||
Any frontend that supports the OpenAI API format works. Tested/documented integrations:
|
||||
|
||||
| Frontend | Stars | Connection |
|
||||
|----------|-------|------------|
|
||||
| [Open WebUI](/docs/user-guide/messaging/open-webui) | 126k | Full guide available |
|
||||
| LobeChat | 73k | Custom provider endpoint |
|
||||
| LibreChat | 34k | Custom endpoint in librechat.yaml |
|
||||
| AnythingLLM | 56k | Generic OpenAI provider |
|
||||
| NextChat | 87k | BASE_URL env var |
|
||||
| ChatBox | 39k | API Host setting |
|
||||
| Jan | 26k | Remote model config |
|
||||
| HF Chat-UI | 8k | OPENAI_BASE_URL |
|
||||
| big-AGI | 7k | Custom endpoint |
|
||||
| OpenAI Python SDK | — | `OpenAI(base_url="http://localhost:8642/v1")` |
|
||||
| curl | — | Direct HTTP requests |
|
||||
|
||||
## Limitations
|
||||
|
||||
- **Response storage is in-memory** — stored responses (for `previous_response_id`) are lost on gateway restart. Max 100 stored responses (LRU eviction).
|
||||
- **No file upload** — vision/document analysis via uploaded files is not yet supported through the API.
|
||||
- **Model field is cosmetic** — the `model` field in requests is accepted but the actual LLM model used is configured server-side in config.yaml.
|
||||
Reference in New Issue
Block a user