ezra/ezra-environment

Fork 0

Files

Ezra 56aa692d1c Add stuck initiatives audit report

2026-04-03 22:42:06 +00:00

6.0 KiB

Raw Blame History

name, description, version, author, license, metadata

name

description

version

author

license

metadata

local-timmy-overnight-loop

Deploy an unattended overnight loop that runs grounded tasks against local llama-server via Hermes, logging every result with timing. Produces rich capability data for morning review.

1.0.0

Ezra

MIT

hermes

Local Timmy Overnight Loop

When to Use

Local Timmy needs to generate capability data overnight
You want to measure tool-call success rates, response times, and failure modes
The model is too slow for interactive use but can produce useful data unattended
Issue #93 (proof test) needs empirical evidence from many runs

Prerequisites

llama-server running with --jinja flag (required for tool calls)
Hermes agent installed at ~/.hermes/hermes-agent/
Timmy workspace at ~/.timmy/
Model path known (e.g., /Users/apayne/models/hermes4-14b/NousResearch_Hermes-4-14B-Q4_K_M.gguf)

Key Design Decisions

Strip the system prompt

The default Timmy system prompt is ~12K tokens (SOUL.md + skills list + memory). On a 14B Q4 model, this causes multi-minute prompt processing. The overnight loop uses a minimal prompt (~100 tokens):

You are Timmy. You run locally on llama.cpp.
You MUST use the tools provided. Do not narrate tool calls as text.
When asked to read a file, call the read_file tool.
When asked to write a file, call the write_file tool.
When asked to search, call the search_files tool.
Be brief. Do the task. Report what you found.

Skip context files and memory

Pass skip_context_files=True and skip_memory=True to AIAgent to prevent injecting AGENTS.md, project context, skills, and memory into the prompt.

Restrict toolsets

Each task specifies only the toolsets it needs (usually just file). Fewer tool schemas = less context = faster processing.

Reduce context length and use single slot

Start llama-server with -c 8192 -np 1 instead of -c 65536. The -np 1 is critical — without it, llama-server defaults to 4 parallel slots, splitting 8192 into 2048 per slot. That's not enough for tool schemas + prompt, and the server silently hangs with n_decoded: 0. Single slot gives the full context to the loop's requests.

Use the venv python

macOS system python is 3.9 which lacks X | None syntax. Always use ~/.hermes/hermes-agent/venv/bin/python3.

Script Location

Deploy to: ~/.timmy/scripts/timmy_overnight_loop.py Results in: ~/.timmy/overnight-loop/

Output Format

overnight_run_YYYYMMDD_HHMMSS.jsonl — one JSON line per task with full result
overnight_summary_YYYYMMDD_HHMMSS.md — rolling human-readable summary

Each JSONL entry contains:

{
  "task_id": "read-soul",
  "run": 1,
  "started_at": "...",
  "finished_at": "...",
  "elapsed_seconds": 45.2,
  "status": "pass|empty|error",
  "response": "...",
  "session_id": "...",
  "provider": "custom",
  "base_url": "http://localhost:8081/v1",
  "model": "hermes4:14b",
  "prompt": "...",
  "error": null
}

Task Design

Good overnight tasks are:

Single tool call — read one file, search one pattern
Verifiable — expected output is known (file exists, content is deterministic)
Varied — mix of read_file, write_file, search_files
Grounded — require actual file operations, not knowledge recall
Short prompt — under 100 words

Example tasks:

"Read ~/.timmy/SOUL.md. Quote the first sentence of the Prime Directive."
"Search ~/.hermes/bin/ for the string 'chatgpt.com'. Report which files."
"Write a file to ~/.timmy/overnight-loop/timmy_wrote_this.md with content: ..."
"Read ~/.hermes/config.yaml. What model is configured as default?"

Starting the Loop

cd ~/.hermes/hermes-agent
nohup venv/bin/python3 ~/.timmy/scripts/timmy_overnight_loop.py \
  > ~/.timmy/overnight-loop/loop_stdout.log 2>&1 &
echo "PID: $!"

Monitoring

# Check if running
pgrep -f timmy_overnight_loop

# Live progress
tail -f ~/.timmy/overnight-loop/loop_stdout.log

# Latest summary
cat ~/.timmy/overnight-loop/overnight_summary_*.md | tail -30

# Count completed tasks
wc -l ~/.timmy/overnight-loop/overnight_run_*.jsonl

Stopping

pkill -f timmy_overnight_loop

Morning Analysis

Key metrics to extract:

Tool call success rate — did the model actually use tools?
Average response time — baseline for performance tuning
Error patterns — which tasks fail and why?
Pass/empty ratio — empty responses mean the model responded but didn't use tools
Time-series trend — does performance degrade over cycles?

# Quick stats
python3 -c "
import json
results = [json.loads(l) for l in open('overnight_run_*.jsonl')]
passes = sum(1 for r in results if r['status'] == 'pass')
total = len(results)
avg = sum(r.get('elapsed_seconds',0) for r in results) / max(total,1)
print(f'Pass: {passes}/{total} ({100*passes//max(total,1)}%)')
print(f'Avg time: {avg:.1f}s')
print(f'Errors: {sum(1 for r in results if r[\"status\"]==\"error\")}')
"

Pitfalls

Kill stale hermes processes first. Old stuck sessions compete for llama-server slots. Run pkill -f "hermes chat" before starting the loop. Also kill legacy loops: pkill -f gemini-loop; pkill -f ops-dashboard; pkill -f timmy-status.
Also kill legacy loops. gemini-loop.sh, ops-dashboard.sh, timmy-status.sh may be running. They waste resources.
Check llama-server health before starting. curl -s http://localhost:8081/health — if it's processing a stale request, restart it.
The loop sleeps 30s between cycles. This prevents hammering the model. Adjust if needed.
Gemini fallback may silently activate. If fallback_model in config.yaml points to Gemini, slow/failed local requests may route to cloud. Check config before running.
Security guards block remote process kills. If running remotely via SSH, pkill commands on the Mac may need user approval. Have Alexander run kill commands directly.

6.0 KiB Raw Blame History