Add stuck initiatives audit report
This commit is contained in:
@@ -0,0 +1,160 @@
|
||||
---
|
||||
name: local-timmy-overnight-loop
|
||||
description: Deploy an unattended overnight loop that runs grounded tasks against local llama-server via Hermes, logging every result with timing. Produces rich capability data for morning review.
|
||||
version: 1.0.0
|
||||
author: Ezra
|
||||
license: MIT
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [local-model, llama.cpp, overnight, data-generation, sovereignty, timmy]
|
||||
related_skills: [local-llama-tool-calling-debug, wizard-house-remote-triage]
|
||||
---
|
||||
|
||||
# Local Timmy Overnight Loop
|
||||
|
||||
## When to Use
|
||||
|
||||
- Local Timmy needs to generate capability data overnight
|
||||
- You want to measure tool-call success rates, response times, and failure modes
|
||||
- The model is too slow for interactive use but can produce useful data unattended
|
||||
- Issue #93 (proof test) needs empirical evidence from many runs
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- llama-server running with `--jinja` flag (required for tool calls)
|
||||
- Hermes agent installed at `~/.hermes/hermes-agent/`
|
||||
- Timmy workspace at `~/.timmy/`
|
||||
- Model path known (e.g., `/Users/apayne/models/hermes4-14b/NousResearch_Hermes-4-14B-Q4_K_M.gguf`)
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
### Strip the system prompt
|
||||
The default Timmy system prompt is ~12K tokens (SOUL.md + skills list + memory). On a 14B Q4 model, this causes multi-minute prompt processing. The overnight loop uses a minimal prompt (~100 tokens):
|
||||
|
||||
```
|
||||
You are Timmy. You run locally on llama.cpp.
|
||||
You MUST use the tools provided. Do not narrate tool calls as text.
|
||||
When asked to read a file, call the read_file tool.
|
||||
When asked to write a file, call the write_file tool.
|
||||
When asked to search, call the search_files tool.
|
||||
Be brief. Do the task. Report what you found.
|
||||
```
|
||||
|
||||
### Skip context files and memory
|
||||
Pass `skip_context_files=True` and `skip_memory=True` to AIAgent to prevent injecting AGENTS.md, project context, skills, and memory into the prompt.
|
||||
|
||||
### Restrict toolsets
|
||||
Each task specifies only the toolsets it needs (usually just `file`). Fewer tool schemas = less context = faster processing.
|
||||
|
||||
### Reduce context length and use single slot
|
||||
Start llama-server with `-c 8192 -np 1` instead of `-c 65536`. The `-np 1` is critical — without it, llama-server defaults to 4 parallel slots, splitting 8192 into 2048 per slot. That's not enough for tool schemas + prompt, and the server silently hangs with `n_decoded: 0`. Single slot gives the full context to the loop's requests.
|
||||
|
||||
### Use the venv python
|
||||
macOS system python is 3.9 which lacks `X | None` syntax. Always use `~/.hermes/hermes-agent/venv/bin/python3`.
|
||||
|
||||
## Script Location
|
||||
|
||||
Deploy to: `~/.timmy/scripts/timmy_overnight_loop.py`
|
||||
Results in: `~/.timmy/overnight-loop/`
|
||||
|
||||
## Output Format
|
||||
|
||||
- `overnight_run_YYYYMMDD_HHMMSS.jsonl` — one JSON line per task with full result
|
||||
- `overnight_summary_YYYYMMDD_HHMMSS.md` — rolling human-readable summary
|
||||
|
||||
Each JSONL entry contains:
|
||||
```json
|
||||
{
|
||||
"task_id": "read-soul",
|
||||
"run": 1,
|
||||
"started_at": "...",
|
||||
"finished_at": "...",
|
||||
"elapsed_seconds": 45.2,
|
||||
"status": "pass|empty|error",
|
||||
"response": "...",
|
||||
"session_id": "...",
|
||||
"provider": "custom",
|
||||
"base_url": "http://localhost:8081/v1",
|
||||
"model": "hermes4:14b",
|
||||
"prompt": "...",
|
||||
"error": null
|
||||
}
|
||||
```
|
||||
|
||||
## Task Design
|
||||
|
||||
Good overnight tasks are:
|
||||
1. **Single tool call** — read one file, search one pattern
|
||||
2. **Verifiable** — expected output is known (file exists, content is deterministic)
|
||||
3. **Varied** — mix of read_file, write_file, search_files
|
||||
4. **Grounded** — require actual file operations, not knowledge recall
|
||||
5. **Short prompt** — under 100 words
|
||||
|
||||
Example tasks:
|
||||
- "Read ~/.timmy/SOUL.md. Quote the first sentence of the Prime Directive."
|
||||
- "Search ~/.hermes/bin/ for the string 'chatgpt.com'. Report which files."
|
||||
- "Write a file to ~/.timmy/overnight-loop/timmy_wrote_this.md with content: ..."
|
||||
- "Read ~/.hermes/config.yaml. What model is configured as default?"
|
||||
|
||||
## Starting the Loop
|
||||
|
||||
```bash
|
||||
cd ~/.hermes/hermes-agent
|
||||
nohup venv/bin/python3 ~/.timmy/scripts/timmy_overnight_loop.py \
|
||||
> ~/.timmy/overnight-loop/loop_stdout.log 2>&1 &
|
||||
echo "PID: $!"
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
```bash
|
||||
# Check if running
|
||||
pgrep -f timmy_overnight_loop
|
||||
|
||||
# Live progress
|
||||
tail -f ~/.timmy/overnight-loop/loop_stdout.log
|
||||
|
||||
# Latest summary
|
||||
cat ~/.timmy/overnight-loop/overnight_summary_*.md | tail -30
|
||||
|
||||
# Count completed tasks
|
||||
wc -l ~/.timmy/overnight-loop/overnight_run_*.jsonl
|
||||
```
|
||||
|
||||
## Stopping
|
||||
|
||||
```bash
|
||||
pkill -f timmy_overnight_loop
|
||||
```
|
||||
|
||||
## Morning Analysis
|
||||
|
||||
Key metrics to extract:
|
||||
1. **Tool call success rate** — did the model actually use tools?
|
||||
2. **Average response time** — baseline for performance tuning
|
||||
3. **Error patterns** — which tasks fail and why?
|
||||
4. **Pass/empty ratio** — empty responses mean the model responded but didn't use tools
|
||||
5. **Time-series trend** — does performance degrade over cycles?
|
||||
|
||||
```bash
|
||||
# Quick stats
|
||||
python3 -c "
|
||||
import json
|
||||
results = [json.loads(l) for l in open('overnight_run_*.jsonl')]
|
||||
passes = sum(1 for r in results if r['status'] == 'pass')
|
||||
total = len(results)
|
||||
avg = sum(r.get('elapsed_seconds',0) for r in results) / max(total,1)
|
||||
print(f'Pass: {passes}/{total} ({100*passes//max(total,1)}%)')
|
||||
print(f'Avg time: {avg:.1f}s')
|
||||
print(f'Errors: {sum(1 for r in results if r[\"status\"]==\"error\")}')
|
||||
"
|
||||
```
|
||||
|
||||
## Pitfalls
|
||||
|
||||
1. **Kill stale hermes processes first.** Old stuck sessions compete for llama-server slots. Run `pkill -f "hermes chat"` before starting the loop. Also kill legacy loops: `pkill -f gemini-loop; pkill -f ops-dashboard; pkill -f timmy-status`.
|
||||
2. **Also kill legacy loops.** gemini-loop.sh, ops-dashboard.sh, timmy-status.sh may be running. They waste resources.
|
||||
3. **Check llama-server health before starting.** `curl -s http://localhost:8081/health` — if it's processing a stale request, restart it.
|
||||
4. **The loop sleeps 30s between cycles.** This prevents hammering the model. Adjust if needed.
|
||||
5. **Gemini fallback may silently activate.** If `fallback_model` in config.yaml points to Gemini, slow/failed local requests may route to cloud. Check config before running.
|
||||
6. **Security guards block remote process kills.** If running remotely via SSH, `pkill` commands on the Mac may need user approval. Have Alexander run kill commands directly.
|
||||
@@ -0,0 +1,235 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Timmy overnight tightening loop.
|
||||
|
||||
Runs a series of small, grounded tasks against local llama-server via Hermes.
|
||||
Each task is deliberately simple: one or two tool calls max.
|
||||
Logs every result with timing data.
|
||||
|
||||
Goal: rich data about local Timmy's tool-use capability by morning.
|
||||
|
||||
Deploy to: ~/.timmy/scripts/timmy_overnight_loop.py
|
||||
Start with: cd ~/.hermes/hermes-agent && nohup venv/bin/python3 ~/.timmy/scripts/timmy_overnight_loop.py > ~/.timmy/overnight-loop/loop_stdout.log 2>&1 &
|
||||
"""
|
||||
|
||||
import io
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
import traceback
|
||||
from contextlib import redirect_stderr, redirect_stdout
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
# ── Config ──────────────────────────────────────────────────────────
|
||||
AGENT_DIR = Path.home() / ".hermes" / "hermes-agent"
|
||||
RESULTS_DIR = Path.home() / ".timmy" / "overnight-loop"
|
||||
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
SYSTEM_PROMPT = """You are Timmy. You run locally on llama.cpp.
|
||||
You MUST use the tools provided. Do not narrate tool calls as text.
|
||||
When asked to read a file, call the read_file tool.
|
||||
When asked to write a file, call the write_file tool.
|
||||
When asked to search, call the search_files tool.
|
||||
Be brief. Do the task. Report what you found."""
|
||||
|
||||
MAX_TURNS_PER_TASK = 5
|
||||
TASK_TIMEOUT = 120 # seconds
|
||||
|
||||
# ── Tasks ───────────────────────────────────────────────────────────
|
||||
TASKS = [
|
||||
{
|
||||
"id": "read-soul",
|
||||
"toolsets": "file",
|
||||
"prompt": "Read the file ~/.timmy/SOUL.md. Quote the first sentence of the Prime Directive section.",
|
||||
},
|
||||
{
|
||||
"id": "read-operations",
|
||||
"toolsets": "file",
|
||||
"prompt": "Read the file ~/.timmy/OPERATIONS.md. How many sections does it have? List their headings.",
|
||||
},
|
||||
{
|
||||
"id": "read-decisions",
|
||||
"toolsets": "file",
|
||||
"prompt": "Read the file ~/.timmy/decisions.md. What is the most recent decision entry? Quote its date and title.",
|
||||
},
|
||||
{
|
||||
"id": "read-config",
|
||||
"toolsets": "file",
|
||||
"prompt": "Read the file ~/.hermes/config.yaml. What model and provider are configured as default?",
|
||||
},
|
||||
{
|
||||
"id": "write-observation",
|
||||
"toolsets": "file",
|
||||
"prompt": "Write a file to {results_dir}/timmy_wrote_this.md with exactly this content:\n# Timmy was here\nTimestamp: {{timestamp}}\nI wrote this file using the write_file tool.\nSovereignty and service always.".format(results_dir=RESULTS_DIR),
|
||||
},
|
||||
{
|
||||
"id": "search-cloud-markers",
|
||||
"toolsets": "file",
|
||||
"prompt": "Search files in ~/.hermes/bin/ for the string 'chatgpt.com'. Report which files contain it and on which lines.",
|
||||
},
|
||||
{
|
||||
"id": "search-soul-keyword",
|
||||
"toolsets": "file",
|
||||
"prompt": "Search ~/.timmy/SOUL.md for the word 'sovereignty'. How many times does it appear?",
|
||||
},
|
||||
{
|
||||
"id": "list-bin-scripts",
|
||||
"toolsets": "file",
|
||||
"prompt": "Search for files matching *.sh in ~/.hermes/bin/. List the first 10 filenames.",
|
||||
},
|
||||
{
|
||||
"id": "read-and-summarize",
|
||||
"toolsets": "file",
|
||||
"prompt": "Read ~/.timmy/SOUL.md. In exactly one sentence, what is Timmy's position on honesty?",
|
||||
},
|
||||
{
|
||||
"id": "multi-read",
|
||||
"toolsets": "file",
|
||||
"prompt": "Read both ~/.timmy/SOUL.md and ~/.hermes/config.yaml. Does the config honor the soul's requirement to not phone home? Answer yes or no with one sentence of evidence.",
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def run_task(task, run_number):
|
||||
"""Run a single task and return result dict."""
|
||||
task_id = task["id"]
|
||||
prompt = task["prompt"].replace("{timestamp}", datetime.now().isoformat())
|
||||
toolsets = task["toolsets"]
|
||||
|
||||
result = {
|
||||
"task_id": task_id,
|
||||
"run": run_number,
|
||||
"started_at": datetime.now().isoformat(),
|
||||
"prompt": prompt,
|
||||
"toolsets": toolsets,
|
||||
}
|
||||
|
||||
sys.path.insert(0, str(AGENT_DIR))
|
||||
start = time.time()
|
||||
try:
|
||||
from hermes_cli.runtime_provider import resolve_runtime_provider
|
||||
from run_agent import AIAgent
|
||||
|
||||
runtime = resolve_runtime_provider()
|
||||
|
||||
buf_out = io.StringIO()
|
||||
buf_err = io.StringIO()
|
||||
|
||||
agent = AIAgent(
|
||||
model=runtime.get("model", "hermes4:14b"),
|
||||
api_key=runtime.get("api_key"),
|
||||
base_url=runtime.get("base_url"),
|
||||
provider=runtime.get("provider"),
|
||||
api_mode=runtime.get("api_mode"),
|
||||
max_iterations=MAX_TURNS_PER_TASK,
|
||||
quiet_mode=True,
|
||||
ephemeral_system_prompt=SYSTEM_PROMPT,
|
||||
skip_context_files=True,
|
||||
skip_memory=True,
|
||||
enabled_toolsets=[toolsets] if toolsets else None,
|
||||
)
|
||||
|
||||
with redirect_stdout(buf_out), redirect_stderr(buf_err):
|
||||
conv_result = agent.run_conversation(prompt, sync_honcho=False)
|
||||
elapsed = time.time() - start
|
||||
|
||||
result["elapsed_seconds"] = round(elapsed, 2)
|
||||
result["response"] = conv_result.get("final_response", "")[:2000]
|
||||
result["session_id"] = getattr(agent, "session_id", None)
|
||||
result["provider"] = runtime.get("provider")
|
||||
result["base_url"] = runtime.get("base_url")
|
||||
result["model"] = runtime.get("model")
|
||||
result["tool_calls_made"] = conv_result.get("tool_calls_count", 0)
|
||||
result["status"] = "pass" if conv_result.get("final_response") else "empty"
|
||||
result["stdout"] = buf_out.getvalue()[:500]
|
||||
result["stderr"] = buf_err.getvalue()[:500]
|
||||
|
||||
except Exception as exc:
|
||||
elapsed = time.time() - start
|
||||
result["elapsed_seconds"] = round(elapsed, 2)
|
||||
result["status"] = "error"
|
||||
result["error"] = str(exc)
|
||||
result["traceback"] = traceback.format_exc()[-1000:]
|
||||
|
||||
result["finished_at"] = datetime.now().isoformat()
|
||||
return result
|
||||
|
||||
|
||||
def main():
|
||||
run_id = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
log_path = RESULTS_DIR / f"overnight_run_{run_id}.jsonl"
|
||||
summary_path = RESULTS_DIR / f"overnight_summary_{run_id}.md"
|
||||
|
||||
print(f"=== Timmy Overnight Loop ===")
|
||||
print(f"Run ID: {run_id}")
|
||||
print(f"Tasks: {len(TASKS)}")
|
||||
print(f"Log: {log_path}")
|
||||
print(f"Max turns per task: {MAX_TURNS_PER_TASK}")
|
||||
print()
|
||||
|
||||
results = []
|
||||
cycle = 0
|
||||
|
||||
while True:
|
||||
cycle += 1
|
||||
print(f"--- Cycle {cycle} ({datetime.now().strftime('%H:%M:%S')}) ---")
|
||||
|
||||
for task in TASKS:
|
||||
task_id = task["id"]
|
||||
print(f" [{task_id}] ", end="", flush=True)
|
||||
|
||||
result = run_task(task, cycle)
|
||||
results.append(result)
|
||||
|
||||
with open(log_path, "a") as f:
|
||||
f.write(json.dumps(result) + "\n")
|
||||
|
||||
status = result["status"]
|
||||
elapsed = result.get("elapsed_seconds", "?")
|
||||
print(f"{status} ({elapsed}s)")
|
||||
|
||||
time.sleep(2)
|
||||
|
||||
# Write summary
|
||||
passes = sum(1 for r in results if r["status"] == "pass")
|
||||
errors = sum(1 for r in results if r["status"] == "error")
|
||||
empties = sum(1 for r in results if r["status"] == "empty")
|
||||
total = len(results)
|
||||
avg_time = sum(r.get("elapsed_seconds", 0) for r in results) / max(total, 1)
|
||||
|
||||
summary = f"""# Timmy Overnight Loop — Summary
|
||||
Run ID: {run_id}
|
||||
Generated: {datetime.now().isoformat()}
|
||||
Cycles completed: {cycle}
|
||||
Total tasks run: {total}
|
||||
|
||||
## Aggregate
|
||||
- Pass: {passes}/{total} ({100*passes//max(total,1)}%)
|
||||
- Empty: {empties}/{total}
|
||||
- Error: {errors}/{total}
|
||||
- Avg response time: {avg_time:.1f}s
|
||||
|
||||
## Per-task results (latest cycle)
|
||||
"""
|
||||
cycle_results = [r for r in results if r["run"] == cycle]
|
||||
for r in cycle_results:
|
||||
resp_preview = r.get("response", "")[:100].replace("\n", " ")
|
||||
summary += f"- **{r['task_id']}**: {r['status']} ({r.get('elapsed_seconds','?')}s) — {resp_preview}\n"
|
||||
|
||||
summary += f"\n## Error details\n"
|
||||
for r in results:
|
||||
if r["status"] == "error":
|
||||
summary += f"- {r['task_id']} (cycle {r['run']}): {r.get('error','?')}\n"
|
||||
|
||||
with open(summary_path, "w") as f:
|
||||
f.write(summary)
|
||||
|
||||
print(f"\n Cycle {cycle} done. Pass={passes} Error={errors} Empty={empties} Avg={avg_time:.1f}s")
|
||||
print(f" Summary: {summary_path}")
|
||||
print(f" Sleeping 30s before next cycle...\n")
|
||||
time.sleep(30)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user