Add stuck initiatives audit report
This commit is contained in:
@@ -0,0 +1,134 @@
|
||||
---
|
||||
name: local-llama-tool-calling-debug
|
||||
description: Diagnose and fix local llama.cpp models that narrate tool calls instead of executing them through the Hermes agent. Root cause analysis for the Hermes4 + llama-server + OpenAI-compatible API tool chain.
|
||||
version: 1.0.0
|
||||
author: Ezra
|
||||
license: MIT
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [llama.cpp, tool-calling, local-model, hermes, debugging, sovereignty]
|
||||
related_skills: [systematic-debugging, wizard-house-remote-triage]
|
||||
---
|
||||
|
||||
# Local llama.cpp Tool Calling Debug
|
||||
|
||||
## When to Use
|
||||
|
||||
- Local Hermes/llama-server model narrates tool calls as text instead of executing them
|
||||
- Model outputs `<tool_call>{"name": "read_file", ...}</tool_call>` as plain text
|
||||
- Agent treats tool-call text as a normal response and does nothing
|
||||
- Issue #93 in timmy-config: "Hard local Timmy proof test"
|
||||
|
||||
## Root Cause (discovered 2026-03-30)
|
||||
|
||||
The Hermes agent expects tool calls via the OpenAI API's structured `tool_calls` field in the JSON response:
|
||||
|
||||
```json
|
||||
{"choices": [{"message": {"tool_calls": [{"id": "...", "function": {"name": "...", "arguments": "..."}}]}}]}
|
||||
```
|
||||
|
||||
Hermes-format models (like Hermes-4-14B) output tool calls as XML-tagged text:
|
||||
```
|
||||
<tool_call>{"name": "read_file", "arguments": {"path": "..."}}</tool_call>
|
||||
```
|
||||
|
||||
If llama-server doesn't convert those text tags into structured `tool_calls`, the agent sees `message.tool_calls = None` and treats the response as plain text.
|
||||
|
||||
## The Tool Chain
|
||||
|
||||
```
|
||||
Hermes Agent (run_agent.py)
|
||||
→ api_mode: "chat_completions"
|
||||
→ sends tools as OpenAI function-calling JSON
|
||||
→ calls client.chat.completions.create(**api_kwargs)
|
||||
→ expects response.choices[0].message.tool_calls
|
||||
|
||||
llama-server (with --jinja)
|
||||
→ applies model's chat template
|
||||
→ converts <tool_call> tags → OpenAI tool_calls in response
|
||||
→ WITHOUT --jinja: returns raw text, no structured tool_calls
|
||||
```
|
||||
|
||||
## Fix: Start llama-server with --jinja
|
||||
|
||||
The `--jinja` flag tells llama-server to use the model's built-in chat template, which includes Hermes-format tool call parsing.
|
||||
|
||||
```bash
|
||||
llama-server \
|
||||
-m /path/to/NousResearch_Hermes-4-14B-Q4_K_M.gguf \
|
||||
--port 8081 \
|
||||
--jinja \
|
||||
-ngl 99 \
|
||||
-c 8192 \
|
||||
-np 1
|
||||
```
|
||||
|
||||
Key flags:
|
||||
- `--jinja` — REQUIRED for tool call formatting
|
||||
- `-ngl 99` — offload all layers to GPU (Apple Silicon)
|
||||
- `-c 8192` — context length (keep small for speed; 65536 is valid but slow)
|
||||
- `-np 1` — CRITICAL: single parallel slot. Without this, llama-server defaults to `-np 4` (auto), splitting context evenly across slots. With `-c 8192 -np 4`, each slot only gets 2048 tokens — not enough for tool schemas + system prompt. The server accepts the request but `n_decoded` stays at 0 forever. Use `-np 1` to give the full context to one slot.
|
||||
|
||||
## Provider Configuration in Hermes
|
||||
|
||||
"local-llama.cpp" is NOT in the PROVIDER_REGISTRY. It resolves through custom_providers in config.yaml:
|
||||
|
||||
```yaml
|
||||
model:
|
||||
default: hermes4:14b
|
||||
provider: custom
|
||||
base_url: http://localhost:8081/v1
|
||||
|
||||
custom_providers:
|
||||
- name: Local llama.cpp
|
||||
base_url: http://localhost:8081/v1
|
||||
api_key: none
|
||||
model: hermes4:14b
|
||||
```
|
||||
|
||||
This gives it `api_mode: "chat_completions"` which is correct for llama-server's OpenAI-compatible endpoint.
|
||||
|
||||
## Existing Parser (not wired into main loop)
|
||||
|
||||
The codebase has `environments/tool_call_parsers/hermes_parser.py` (HermesToolCallParser) that CAN parse `<tool_call>` tags from text. But it's only used for Atropos RL training, NOT in the main agent loop (`run_agent.py`).
|
||||
|
||||
A future code fix could wire this parser as a fallback in run_agent.py around line 7352 when `message.tool_calls` is None but the content contains `<tool_call>` tags.
|
||||
|
||||
## Performance: System Prompt Bloat
|
||||
|
||||
The default Timmy config injects SOUL.md (~4K tokens) + full skills list (~6K tokens) + memory (~2K tokens) into the system prompt. On a 14B model with 65K context, this causes:
|
||||
- Very slow prompt processing (minutes, not seconds)
|
||||
- Timeouts on 180s SSH commands
|
||||
- Session appears stuck
|
||||
|
||||
Fix for overnight loops:
|
||||
- Use `-c 8192` instead of `-c 65536`
|
||||
- Use `system_prompt_suffix: ""` or a minimal prompt
|
||||
- Use `-t file` to restrict toolsets (fewer tool schemas in context)
|
||||
- Keep task prompts short and specific
|
||||
|
||||
## Verification
|
||||
|
||||
1. Health check: `curl -s http://localhost:8081/health`
|
||||
2. Slot status: `curl -s http://localhost:8081/slots | python3 -m json.tool`
|
||||
3. Quick tool test:
|
||||
```bash
|
||||
hermes chat -Q -m hermes4:14b -t file \
|
||||
-q "Use read_file to read /tmp/test.txt"
|
||||
```
|
||||
4. Check session JSON for `"tool_calls"` in assistant messages
|
||||
|
||||
## Proof That --jinja Works
|
||||
|
||||
Session `20260329_211125_cb03eb` (cron heartbeat) shows the model successfully called `session_search` via the tool API. The `--jinja` flag enables tool calling.
|
||||
|
||||
## Pitfalls
|
||||
|
||||
1. **Without --jinja, tools silently fail.** The model outputs tool-call text, agent ignores it, returns the text as the response. No error thrown.
|
||||
2. **Large context = slow.** 65K context on 14B Q4 with full system prompt can take 3+ minutes per turn. Use 8192 for tight loops.
|
||||
3. **-np 4 silently starves slots.** llama-server defaults to `n_parallel = auto` (usually 4). With `-c 8192`, each slot gets only 2048 tokens. Tool schemas + system prompt exceed that. The request hangs with `n_decoded: 0` and no error. Always use `-np 1` for tool-calling workloads unless you've verified per-slot context is sufficient.
|
||||
4. **Custom provider CLI.** `--provider custom` is not a valid CLI arg. Use env vars: `OPENAI_BASE_URL=http://localhost:8081/v1 OPENAI_API_KEY=none hermes chat ...`
|
||||
5. **System python on macOS is 3.9.** Use the venv python (`~/.hermes/hermes-agent/venv/bin/python3`) for scripts that use 3.10+ syntax like `X | None`.
|
||||
6. **Cron jobs may inherit cloud config.** If `model=null` in a cron job, it inherits the default. After cutting to local, verify all cron jobs have explicit local model/provider.
|
||||
7. **Gemini fallback may still be active.** Check `fallback_model` in config.yaml — it may silently route to cloud when local times out.
|
||||
8. **OpenClaw model routing is separate.** Even if Hermes config points local, OpenClaw may still route Telegram DMs to cloud. Check with `openclaw models status --json`.
|
||||
Reference in New Issue
Block a user