161 lines
6.0 KiB
Markdown
161 lines
6.0 KiB
Markdown
---
|
|
name: local-timmy-overnight-loop
|
|
description: Deploy an unattended overnight loop that runs grounded tasks against local llama-server via Hermes, logging every result with timing. Produces rich capability data for morning review.
|
|
version: 1.0.0
|
|
author: Ezra
|
|
license: MIT
|
|
metadata:
|
|
hermes:
|
|
tags: [local-model, llama.cpp, overnight, data-generation, sovereignty, timmy]
|
|
related_skills: [local-llama-tool-calling-debug, wizard-house-remote-triage]
|
|
---
|
|
|
|
# Local Timmy Overnight Loop
|
|
|
|
## When to Use
|
|
|
|
- Local Timmy needs to generate capability data overnight
|
|
- You want to measure tool-call success rates, response times, and failure modes
|
|
- The model is too slow for interactive use but can produce useful data unattended
|
|
- Issue #93 (proof test) needs empirical evidence from many runs
|
|
|
|
## Prerequisites
|
|
|
|
- llama-server running with `--jinja` flag (required for tool calls)
|
|
- Hermes agent installed at `~/.hermes/hermes-agent/`
|
|
- Timmy workspace at `~/.timmy/`
|
|
- Model path known (e.g., `/Users/apayne/models/hermes4-14b/NousResearch_Hermes-4-14B-Q4_K_M.gguf`)
|
|
|
|
## Key Design Decisions
|
|
|
|
### Strip the system prompt
|
|
The default Timmy system prompt is ~12K tokens (SOUL.md + skills list + memory). On a 14B Q4 model, this causes multi-minute prompt processing. The overnight loop uses a minimal prompt (~100 tokens):
|
|
|
|
```
|
|
You are Timmy. You run locally on llama.cpp.
|
|
You MUST use the tools provided. Do not narrate tool calls as text.
|
|
When asked to read a file, call the read_file tool.
|
|
When asked to write a file, call the write_file tool.
|
|
When asked to search, call the search_files tool.
|
|
Be brief. Do the task. Report what you found.
|
|
```
|
|
|
|
### Skip context files and memory
|
|
Pass `skip_context_files=True` and `skip_memory=True` to AIAgent to prevent injecting AGENTS.md, project context, skills, and memory into the prompt.
|
|
|
|
### Restrict toolsets
|
|
Each task specifies only the toolsets it needs (usually just `file`). Fewer tool schemas = less context = faster processing.
|
|
|
|
### Reduce context length and use single slot
|
|
Start llama-server with `-c 8192 -np 1` instead of `-c 65536`. The `-np 1` is critical — without it, llama-server defaults to 4 parallel slots, splitting 8192 into 2048 per slot. That's not enough for tool schemas + prompt, and the server silently hangs with `n_decoded: 0`. Single slot gives the full context to the loop's requests.
|
|
|
|
### Use the venv python
|
|
macOS system python is 3.9 which lacks `X | None` syntax. Always use `~/.hermes/hermes-agent/venv/bin/python3`.
|
|
|
|
## Script Location
|
|
|
|
Deploy to: `~/.timmy/scripts/timmy_overnight_loop.py`
|
|
Results in: `~/.timmy/overnight-loop/`
|
|
|
|
## Output Format
|
|
|
|
- `overnight_run_YYYYMMDD_HHMMSS.jsonl` — one JSON line per task with full result
|
|
- `overnight_summary_YYYYMMDD_HHMMSS.md` — rolling human-readable summary
|
|
|
|
Each JSONL entry contains:
|
|
```json
|
|
{
|
|
"task_id": "read-soul",
|
|
"run": 1,
|
|
"started_at": "...",
|
|
"finished_at": "...",
|
|
"elapsed_seconds": 45.2,
|
|
"status": "pass|empty|error",
|
|
"response": "...",
|
|
"session_id": "...",
|
|
"provider": "custom",
|
|
"base_url": "http://localhost:8081/v1",
|
|
"model": "hermes4:14b",
|
|
"prompt": "...",
|
|
"error": null
|
|
}
|
|
```
|
|
|
|
## Task Design
|
|
|
|
Good overnight tasks are:
|
|
1. **Single tool call** — read one file, search one pattern
|
|
2. **Verifiable** — expected output is known (file exists, content is deterministic)
|
|
3. **Varied** — mix of read_file, write_file, search_files
|
|
4. **Grounded** — require actual file operations, not knowledge recall
|
|
5. **Short prompt** — under 100 words
|
|
|
|
Example tasks:
|
|
- "Read ~/.timmy/SOUL.md. Quote the first sentence of the Prime Directive."
|
|
- "Search ~/.hermes/bin/ for the string 'chatgpt.com'. Report which files."
|
|
- "Write a file to ~/.timmy/overnight-loop/timmy_wrote_this.md with content: ..."
|
|
- "Read ~/.hermes/config.yaml. What model is configured as default?"
|
|
|
|
## Starting the Loop
|
|
|
|
```bash
|
|
cd ~/.hermes/hermes-agent
|
|
nohup venv/bin/python3 ~/.timmy/scripts/timmy_overnight_loop.py \
|
|
> ~/.timmy/overnight-loop/loop_stdout.log 2>&1 &
|
|
echo "PID: $!"
|
|
```
|
|
|
|
## Monitoring
|
|
|
|
```bash
|
|
# Check if running
|
|
pgrep -f timmy_overnight_loop
|
|
|
|
# Live progress
|
|
tail -f ~/.timmy/overnight-loop/loop_stdout.log
|
|
|
|
# Latest summary
|
|
cat ~/.timmy/overnight-loop/overnight_summary_*.md | tail -30
|
|
|
|
# Count completed tasks
|
|
wc -l ~/.timmy/overnight-loop/overnight_run_*.jsonl
|
|
```
|
|
|
|
## Stopping
|
|
|
|
```bash
|
|
pkill -f timmy_overnight_loop
|
|
```
|
|
|
|
## Morning Analysis
|
|
|
|
Key metrics to extract:
|
|
1. **Tool call success rate** — did the model actually use tools?
|
|
2. **Average response time** — baseline for performance tuning
|
|
3. **Error patterns** — which tasks fail and why?
|
|
4. **Pass/empty ratio** — empty responses mean the model responded but didn't use tools
|
|
5. **Time-series trend** — does performance degrade over cycles?
|
|
|
|
```bash
|
|
# Quick stats
|
|
python3 -c "
|
|
import json
|
|
results = [json.loads(l) for l in open('overnight_run_*.jsonl')]
|
|
passes = sum(1 for r in results if r['status'] == 'pass')
|
|
total = len(results)
|
|
avg = sum(r.get('elapsed_seconds',0) for r in results) / max(total,1)
|
|
print(f'Pass: {passes}/{total} ({100*passes//max(total,1)}%)')
|
|
print(f'Avg time: {avg:.1f}s')
|
|
print(f'Errors: {sum(1 for r in results if r[\"status\"]==\"error\")}')
|
|
"
|
|
```
|
|
|
|
## Pitfalls
|
|
|
|
1. **Kill stale hermes processes first.** Old stuck sessions compete for llama-server slots. Run `pkill -f "hermes chat"` before starting the loop. Also kill legacy loops: `pkill -f gemini-loop; pkill -f ops-dashboard; pkill -f timmy-status`.
|
|
2. **Also kill legacy loops.** gemini-loop.sh, ops-dashboard.sh, timmy-status.sh may be running. They waste resources.
|
|
3. **Check llama-server health before starting.** `curl -s http://localhost:8081/health` — if it's processing a stale request, restart it.
|
|
4. **The loop sleeps 30s between cycles.** This prevents hammering the model. Adjust if needed.
|
|
5. **Gemini fallback may silently activate.** If `fallback_model` in config.yaml points to Gemini, slow/failed local requests may route to cloud. Check config before running.
|
|
6. **Security guards block remote process kills.** If running remotely via SSH, `pkill` commands on the Mac may need user approval. Have Alexander run kill commands directly.
|