6.0 KiB
name, description, version, author, license, metadata
| name | description | version | author | license | metadata | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| local-llama-tool-calling-debug | Diagnose and fix local llama.cpp models that narrate tool calls instead of executing them through the Hermes agent. Root cause analysis for the Hermes4 + llama-server + OpenAI-compatible API tool chain. | 1.0.0 | Ezra | MIT |
|
Local llama.cpp Tool Calling Debug
When to Use
- Local Hermes/llama-server model narrates tool calls as text instead of executing them
- Model outputs
<tool_call>{"name": "read_file", ...}</tool_call>as plain text - Agent treats tool-call text as a normal response and does nothing
- Issue #93 in timmy-config: "Hard local Timmy proof test"
Root Cause (discovered 2026-03-30)
The Hermes agent expects tool calls via the OpenAI API's structured tool_calls field in the JSON response:
{"choices": [{"message": {"tool_calls": [{"id": "...", "function": {"name": "...", "arguments": "..."}}]}}]}
Hermes-format models (like Hermes-4-14B) output tool calls as XML-tagged text:
<tool_call>{"name": "read_file", "arguments": {"path": "..."}}</tool_call>
If llama-server doesn't convert those text tags into structured tool_calls, the agent sees message.tool_calls = None and treats the response as plain text.
The Tool Chain
Hermes Agent (run_agent.py)
→ api_mode: "chat_completions"
→ sends tools as OpenAI function-calling JSON
→ calls client.chat.completions.create(**api_kwargs)
→ expects response.choices[0].message.tool_calls
llama-server (with --jinja)
→ applies model's chat template
→ converts <tool_call> tags → OpenAI tool_calls in response
→ WITHOUT --jinja: returns raw text, no structured tool_calls
Fix: Start llama-server with --jinja
The --jinja flag tells llama-server to use the model's built-in chat template, which includes Hermes-format tool call parsing.
llama-server \
-m /path/to/NousResearch_Hermes-4-14B-Q4_K_M.gguf \
--port 8081 \
--jinja \
-ngl 99 \
-c 8192 \
-np 1
Key flags:
--jinja— REQUIRED for tool call formatting-ngl 99— offload all layers to GPU (Apple Silicon)-c 8192— context length (keep small for speed; 65536 is valid but slow)-np 1— CRITICAL: single parallel slot. Without this, llama-server defaults to-np 4(auto), splitting context evenly across slots. With-c 8192 -np 4, each slot only gets 2048 tokens — not enough for tool schemas + system prompt. The server accepts the request butn_decodedstays at 0 forever. Use-np 1to give the full context to one slot.
Provider Configuration in Hermes
"local-llama.cpp" is NOT in the PROVIDER_REGISTRY. It resolves through custom_providers in config.yaml:
model:
default: hermes4:14b
provider: custom
base_url: http://localhost:8081/v1
custom_providers:
- name: Local llama.cpp
base_url: http://localhost:8081/v1
api_key: none
model: hermes4:14b
This gives it api_mode: "chat_completions" which is correct for llama-server's OpenAI-compatible endpoint.
Existing Parser (not wired into main loop)
The codebase has environments/tool_call_parsers/hermes_parser.py (HermesToolCallParser) that CAN parse <tool_call> tags from text. But it's only used for Atropos RL training, NOT in the main agent loop (run_agent.py).
A future code fix could wire this parser as a fallback in run_agent.py around line 7352 when message.tool_calls is None but the content contains <tool_call> tags.
Performance: System Prompt Bloat
The default Timmy config injects SOUL.md (~4K tokens) + full skills list (~6K tokens) + memory (~2K tokens) into the system prompt. On a 14B model with 65K context, this causes:
- Very slow prompt processing (minutes, not seconds)
- Timeouts on 180s SSH commands
- Session appears stuck
Fix for overnight loops:
- Use
-c 8192instead of-c 65536 - Use
system_prompt_suffix: ""or a minimal prompt - Use
-t fileto restrict toolsets (fewer tool schemas in context) - Keep task prompts short and specific
Verification
- Health check:
curl -s http://localhost:8081/health - Slot status:
curl -s http://localhost:8081/slots | python3 -m json.tool - Quick tool test:
hermes chat -Q -m hermes4:14b -t file \
-q "Use read_file to read /tmp/test.txt"
- Check session JSON for
"tool_calls"in assistant messages
Proof That --jinja Works
Session 20260329_211125_cb03eb (cron heartbeat) shows the model successfully called session_search via the tool API. The --jinja flag enables tool calling.
Pitfalls
- Without --jinja, tools silently fail. The model outputs tool-call text, agent ignores it, returns the text as the response. No error thrown.
- Large context = slow. 65K context on 14B Q4 with full system prompt can take 3+ minutes per turn. Use 8192 for tight loops.
- -np 4 silently starves slots. llama-server defaults to
n_parallel = auto(usually 4). With-c 8192, each slot gets only 2048 tokens. Tool schemas + system prompt exceed that. The request hangs withn_decoded: 0and no error. Always use-np 1for tool-calling workloads unless you've verified per-slot context is sufficient. - Custom provider CLI.
--provider customis not a valid CLI arg. Use env vars:OPENAI_BASE_URL=http://localhost:8081/v1 OPENAI_API_KEY=none hermes chat ... - System python on macOS is 3.9. Use the venv python (
~/.hermes/hermes-agent/venv/bin/python3) for scripts that use 3.10+ syntax likeX | None. - Cron jobs may inherit cloud config. If
model=nullin a cron job, it inherits the default. After cutting to local, verify all cron jobs have explicit local model/provider. - Gemini fallback may still be active. Check
fallback_modelin config.yaml — it may silently route to cloud when local times out. - OpenClaw model routing is separate. Even if Hermes config points local, OpenClaw may still route Telegram DMs to cloud. Check with
openclaw models status --json.