Files
ezra-environment/protected/skills-backup/devops/local-llama-tool-calling-debug/SKILL.md
2026-04-03 22:42:06 +00:00

6.0 KiB

name, description, version, author, license, metadata
name description version author license metadata
local-llama-tool-calling-debug Diagnose and fix local llama.cpp models that narrate tool calls instead of executing them through the Hermes agent. Root cause analysis for the Hermes4 + llama-server + OpenAI-compatible API tool chain. 1.0.0 Ezra MIT
hermes
tags related_skills
llama.cpp
tool-calling
local-model
hermes
debugging
sovereignty
systematic-debugging
wizard-house-remote-triage

Local llama.cpp Tool Calling Debug

When to Use

  • Local Hermes/llama-server model narrates tool calls as text instead of executing them
  • Model outputs <tool_call>{"name": "read_file", ...}</tool_call> as plain text
  • Agent treats tool-call text as a normal response and does nothing
  • Issue #93 in timmy-config: "Hard local Timmy proof test"

Root Cause (discovered 2026-03-30)

The Hermes agent expects tool calls via the OpenAI API's structured tool_calls field in the JSON response:

{"choices": [{"message": {"tool_calls": [{"id": "...", "function": {"name": "...", "arguments": "..."}}]}}]}

Hermes-format models (like Hermes-4-14B) output tool calls as XML-tagged text:

<tool_call>{"name": "read_file", "arguments": {"path": "..."}}</tool_call>

If llama-server doesn't convert those text tags into structured tool_calls, the agent sees message.tool_calls = None and treats the response as plain text.

The Tool Chain

Hermes Agent (run_agent.py)
  → api_mode: "chat_completions"
  → sends tools as OpenAI function-calling JSON
  → calls client.chat.completions.create(**api_kwargs)
  → expects response.choices[0].message.tool_calls
  
llama-server (with --jinja)
  → applies model's chat template
  → converts <tool_call> tags → OpenAI tool_calls in response
  → WITHOUT --jinja: returns raw text, no structured tool_calls

Fix: Start llama-server with --jinja

The --jinja flag tells llama-server to use the model's built-in chat template, which includes Hermes-format tool call parsing.

llama-server \
  -m /path/to/NousResearch_Hermes-4-14B-Q4_K_M.gguf \
  --port 8081 \
  --jinja \
  -ngl 99 \
  -c 8192 \
  -np 1

Key flags:

  • --jinja — REQUIRED for tool call formatting
  • -ngl 99 — offload all layers to GPU (Apple Silicon)
  • -c 8192 — context length (keep small for speed; 65536 is valid but slow)
  • -np 1 — CRITICAL: single parallel slot. Without this, llama-server defaults to -np 4 (auto), splitting context evenly across slots. With -c 8192 -np 4, each slot only gets 2048 tokens — not enough for tool schemas + system prompt. The server accepts the request but n_decoded stays at 0 forever. Use -np 1 to give the full context to one slot.

Provider Configuration in Hermes

"local-llama.cpp" is NOT in the PROVIDER_REGISTRY. It resolves through custom_providers in config.yaml:

model:
  default: hermes4:14b
  provider: custom
  base_url: http://localhost:8081/v1

custom_providers:
  - name: Local llama.cpp
    base_url: http://localhost:8081/v1
    api_key: none
    model: hermes4:14b

This gives it api_mode: "chat_completions" which is correct for llama-server's OpenAI-compatible endpoint.

Existing Parser (not wired into main loop)

The codebase has environments/tool_call_parsers/hermes_parser.py (HermesToolCallParser) that CAN parse <tool_call> tags from text. But it's only used for Atropos RL training, NOT in the main agent loop (run_agent.py).

A future code fix could wire this parser as a fallback in run_agent.py around line 7352 when message.tool_calls is None but the content contains <tool_call> tags.

Performance: System Prompt Bloat

The default Timmy config injects SOUL.md (~4K tokens) + full skills list (~6K tokens) + memory (~2K tokens) into the system prompt. On a 14B model with 65K context, this causes:

  • Very slow prompt processing (minutes, not seconds)
  • Timeouts on 180s SSH commands
  • Session appears stuck

Fix for overnight loops:

  • Use -c 8192 instead of -c 65536
  • Use system_prompt_suffix: "" or a minimal prompt
  • Use -t file to restrict toolsets (fewer tool schemas in context)
  • Keep task prompts short and specific

Verification

  1. Health check: curl -s http://localhost:8081/health
  2. Slot status: curl -s http://localhost:8081/slots | python3 -m json.tool
  3. Quick tool test:
hermes chat -Q -m hermes4:14b -t file \
  -q "Use read_file to read /tmp/test.txt"
  1. Check session JSON for "tool_calls" in assistant messages

Proof That --jinja Works

Session 20260329_211125_cb03eb (cron heartbeat) shows the model successfully called session_search via the tool API. The --jinja flag enables tool calling.

Pitfalls

  1. Without --jinja, tools silently fail. The model outputs tool-call text, agent ignores it, returns the text as the response. No error thrown.
  2. Large context = slow. 65K context on 14B Q4 with full system prompt can take 3+ minutes per turn. Use 8192 for tight loops.
  3. -np 4 silently starves slots. llama-server defaults to n_parallel = auto (usually 4). With -c 8192, each slot gets only 2048 tokens. Tool schemas + system prompt exceed that. The request hangs with n_decoded: 0 and no error. Always use -np 1 for tool-calling workloads unless you've verified per-slot context is sufficient.
  4. Custom provider CLI. --provider custom is not a valid CLI arg. Use env vars: OPENAI_BASE_URL=http://localhost:8081/v1 OPENAI_API_KEY=none hermes chat ...
  5. System python on macOS is 3.9. Use the venv python (~/.hermes/hermes-agent/venv/bin/python3) for scripts that use 3.10+ syntax like X | None.
  6. Cron jobs may inherit cloud config. If model=null in a cron job, it inherits the default. After cutting to local, verify all cron jobs have explicit local model/provider.
  7. Gemini fallback may still be active. Check fallback_model in config.yaml — it may silently route to cloud when local times out.
  8. OpenClaw model routing is separate. Even if Hermes config points local, OpenClaw may still route Telegram DMs to cloud. Check with openclaw models status --json.