439 lines
14 KiB
Markdown
439 lines
14 KiB
Markdown
|
|
# Local Model Integration Sketch v2
|
||
|
|
# Hermes4-14B in the Heartbeat Loop — No New Telemetry
|
||
|
|
|
||
|
|
## Principle
|
||
|
|
|
||
|
|
No new inference layer. Huey tasks call `hermes chat -q` pointed at
|
||
|
|
Ollama. Hermes handles sessions, token tracking, cost logging.
|
||
|
|
The dashboard reads what Hermes already stores.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Why Not Ollama Directly?
|
||
|
|
|
||
|
|
Ollama is fine as a serving backend. The issue isn't Ollama — it's that
|
||
|
|
calling Ollama directly with urllib bypasses the harness. The harness
|
||
|
|
already tracks sessions, tokens, model/provider, platform. Building a
|
||
|
|
second telemetry layer is owning code we don't need.
|
||
|
|
|
||
|
|
Ollama as a named provider isn't wired into the --provider flag yet,
|
||
|
|
but routing works via env vars:
|
||
|
|
|
||
|
|
HERMES_MODEL="hermes4:14b" \
|
||
|
|
HERMES_PROVIDER="custom" \
|
||
|
|
HERMES_BASE_URL="http://localhost:11434/v1" \
|
||
|
|
hermes chat -q "prompt here" -Q
|
||
|
|
|
||
|
|
This creates a tracked session, logs tokens, and returns the response.
|
||
|
|
That's our local inference call.
|
||
|
|
|
||
|
|
### Alternatives to Ollama for serving:
|
||
|
|
- **llama.cpp server** — lighter, no Python, raw HTTP. Good for single
|
||
|
|
model serving. Less convenient for model switching.
|
||
|
|
- **vLLM** — best throughput, but needs NVIDIA GPU. Not for M3 Mac.
|
||
|
|
- **MLX serving** — native Apple Silicon, but no OpenAI-compat API yet.
|
||
|
|
MLX is for training, not serving (our current policy).
|
||
|
|
- **llamafile** — single binary, portable. Good for distribution.
|
||
|
|
|
||
|
|
Verdict: Ollama is fine. It's the standard OpenAI-compat local server
|
||
|
|
on Mac. The issue was never Ollama — it was bypassing the harness.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 1. The Call Pattern
|
||
|
|
|
||
|
|
One function in tasks.py that all Huey tasks use:
|
||
|
|
|
||
|
|
```python
|
||
|
|
import subprocess
|
||
|
|
import json
|
||
|
|
|
||
|
|
HERMES_BIN = "hermes"
|
||
|
|
LOCAL_ENV = {
|
||
|
|
"HERMES_MODEL": "hermes4:14b",
|
||
|
|
"HERMES_PROVIDER": "custom",
|
||
|
|
"HERMES_BASE_URL": "http://localhost:11434/v1",
|
||
|
|
}
|
||
|
|
|
||
|
|
def hermes_local(prompt, caller_tag=None, max_retries=2):
|
||
|
|
"""Call hermes with local Ollama model. Returns response text.
|
||
|
|
|
||
|
|
Every call creates a hermes session with full telemetry.
|
||
|
|
caller_tag gets prepended to prompt for searchability.
|
||
|
|
"""
|
||
|
|
import os
|
||
|
|
env = os.environ.copy()
|
||
|
|
env.update(LOCAL_ENV)
|
||
|
|
|
||
|
|
tagged_prompt = prompt
|
||
|
|
if caller_tag:
|
||
|
|
tagged_prompt = f"[{caller_tag}] {prompt}"
|
||
|
|
|
||
|
|
for attempt in range(max_retries + 1):
|
||
|
|
try:
|
||
|
|
result = subprocess.run(
|
||
|
|
[HERMES_BIN, "chat", "-q", tagged_prompt, "-Q", "-t", "none"],
|
||
|
|
capture_output=True, text=True,
|
||
|
|
timeout=120, env=env,
|
||
|
|
)
|
||
|
|
if result.returncode == 0 and result.stdout.strip():
|
||
|
|
# Strip the session_id line from -Q output
|
||
|
|
lines = result.stdout.strip().split("\n")
|
||
|
|
response_lines = [l for l in lines if not l.startswith("session_id:")]
|
||
|
|
return "\n".join(response_lines).strip()
|
||
|
|
except subprocess.TimeoutExpired:
|
||
|
|
if attempt == max_retries:
|
||
|
|
return None
|
||
|
|
continue
|
||
|
|
return None
|
||
|
|
```
|
||
|
|
|
||
|
|
Notes:
|
||
|
|
- `-t none` disables all toolsets — the heartbeat model shouldn't
|
||
|
|
have terminal/file access. Pure reasoning only.
|
||
|
|
- `-Q` quiet mode suppresses banner/spinner, gives clean output.
|
||
|
|
- Every call creates a session in Hermes session store. Searchable,
|
||
|
|
exportable, countable.
|
||
|
|
- The `[caller_tag]` prefix lets you filter sessions by which Huey
|
||
|
|
task generated them: `hermes sessions list | grep heartbeat`
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 2. Heartbeat DECIDE Phase
|
||
|
|
|
||
|
|
Replace the hardcoded if/else with a model call:
|
||
|
|
|
||
|
|
```python
|
||
|
|
# In heartbeat_tick(), replace the DECIDE + ACT section:
|
||
|
|
|
||
|
|
# DECIDE: let hermes4:14b reason about what to do
|
||
|
|
decide_prompt = f"""System state at {now.isoformat()}:
|
||
|
|
|
||
|
|
{json.dumps(perception, indent=2)}
|
||
|
|
|
||
|
|
Previous tick: {last_tick.get('tick_id', 'none')}
|
||
|
|
|
||
|
|
You are the heartbeat monitor. Based on this state:
|
||
|
|
1. List any actions needed (alerts, restarts, escalations). Empty if all OK.
|
||
|
|
2. Rate severity: ok, warning, or critical.
|
||
|
|
3. One sentence of reasoning.
|
||
|
|
|
||
|
|
Respond ONLY with JSON:
|
||
|
|
{{"actions": [], "severity": "ok", "reasoning": "..."}}"""
|
||
|
|
|
||
|
|
decision = None
|
||
|
|
try:
|
||
|
|
raw = hermes_local(decide_prompt, caller_tag="heartbeat_tick")
|
||
|
|
if raw:
|
||
|
|
# Try to parse JSON from the response
|
||
|
|
# Model might wrap it in markdown, so extract
|
||
|
|
for line in raw.split("\n"):
|
||
|
|
line = line.strip()
|
||
|
|
if line.startswith("{"):
|
||
|
|
decision = json.loads(line)
|
||
|
|
break
|
||
|
|
if not decision:
|
||
|
|
decision = json.loads(raw)
|
||
|
|
except (json.JSONDecodeError, Exception) as e:
|
||
|
|
decision = None
|
||
|
|
|
||
|
|
# Fallback to hardcoded logic if model fails or is down
|
||
|
|
if decision is None:
|
||
|
|
actions = []
|
||
|
|
if not perception.get("gitea_alive"):
|
||
|
|
actions.append("ALERT: Gitea unreachable")
|
||
|
|
health = perception.get("model_health", {})
|
||
|
|
if isinstance(health, dict) and not health.get("ollama_running"):
|
||
|
|
actions.append("ALERT: Ollama not running")
|
||
|
|
decision = {
|
||
|
|
"actions": actions,
|
||
|
|
"severity": "fallback",
|
||
|
|
"reasoning": "model unavailable, used hardcoded checks"
|
||
|
|
}
|
||
|
|
|
||
|
|
tick_record["decision"] = decision
|
||
|
|
actions = decision.get("actions", [])
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 3. DPO Candidate Collection
|
||
|
|
|
||
|
|
No new database. Hermes sessions ARE the DPO candidates.
|
||
|
|
|
||
|
|
Every `hermes_local()` call creates a session. To extract DPO pairs:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Export all local-model sessions
|
||
|
|
hermes sessions export --output /tmp/local-sessions.jsonl
|
||
|
|
|
||
|
|
# Filter for heartbeat decisions
|
||
|
|
grep "heartbeat_tick" /tmp/local-sessions.jsonl > heartbeat_decisions.jsonl
|
||
|
|
```
|
||
|
|
|
||
|
|
The existing `session_export` Huey task (runs every 4h) already extracts
|
||
|
|
user→assistant pairs. It just needs to be aware that some sessions are
|
||
|
|
now local-model decisions instead of human conversations.
|
||
|
|
|
||
|
|
For DPO annotation, add a simple review script:
|
||
|
|
|
||
|
|
```python
|
||
|
|
# review_decisions.py — reads heartbeat tick logs, shows model decisions,
|
||
|
|
# asks Alexander to mark chosen/rejected
|
||
|
|
# Writes annotations back to the tick log files
|
||
|
|
|
||
|
|
import json
|
||
|
|
from pathlib import Path
|
||
|
|
|
||
|
|
TICK_DIR = Path.home() / ".timmy" / "heartbeat"
|
||
|
|
|
||
|
|
for log_file in sorted(TICK_DIR.glob("ticks_*.jsonl")):
|
||
|
|
for line in log_file.read_text().strip().split("\n"):
|
||
|
|
tick = json.loads(line)
|
||
|
|
decision = tick.get("decision", {})
|
||
|
|
if decision.get("severity") == "fallback":
|
||
|
|
continue # skip fallback entries
|
||
|
|
|
||
|
|
print(f"\n--- Tick {tick['tick_id']} ---")
|
||
|
|
print(f"Perception: {json.dumps(tick['perception'], indent=2)}")
|
||
|
|
print(f"Decision: {json.dumps(decision, indent=2)}")
|
||
|
|
|
||
|
|
rating = input("Rate (c=chosen, r=rejected, s=skip): ").strip()
|
||
|
|
if rating in ("c", "r"):
|
||
|
|
tick["dpo_label"] = "chosen" if rating == "c" else "rejected"
|
||
|
|
# write back... (append to annotated file)
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 4. Dashboard — Reads Hermes Data
|
||
|
|
|
||
|
|
```python
|
||
|
|
#!/usr/bin/env python3
|
||
|
|
"""Timmy Model Dashboard — reads from Hermes, owns nothing."""
|
||
|
|
|
||
|
|
import json
|
||
|
|
import os
|
||
|
|
import subprocess
|
||
|
|
import sys
|
||
|
|
import time
|
||
|
|
import urllib.request
|
||
|
|
from datetime import datetime
|
||
|
|
from pathlib import Path
|
||
|
|
|
||
|
|
HERMES_HOME = Path.home() / ".hermes"
|
||
|
|
TIMMY_HOME = Path.home() / ".timmy"
|
||
|
|
|
||
|
|
|
||
|
|
def get_ollama_models():
|
||
|
|
"""What's available in Ollama."""
|
||
|
|
try:
|
||
|
|
req = urllib.request.Request("http://localhost:11434/api/tags")
|
||
|
|
with urllib.request.urlopen(req, timeout=5) as resp:
|
||
|
|
return json.loads(resp.read()).get("models", [])
|
||
|
|
except Exception:
|
||
|
|
return []
|
||
|
|
|
||
|
|
|
||
|
|
def get_loaded_models():
|
||
|
|
"""What's actually in VRAM right now."""
|
||
|
|
try:
|
||
|
|
req = urllib.request.Request("http://localhost:11434/api/ps")
|
||
|
|
with urllib.request.urlopen(req, timeout=5) as resp:
|
||
|
|
return json.loads(resp.read()).get("models", [])
|
||
|
|
except Exception:
|
||
|
|
return []
|
||
|
|
|
||
|
|
|
||
|
|
def get_huey_status():
|
||
|
|
try:
|
||
|
|
r = subprocess.run(["pgrep", "-f", "huey_consumer"],
|
||
|
|
capture_output=True, timeout=5)
|
||
|
|
return r.returncode == 0
|
||
|
|
except Exception:
|
||
|
|
return False
|
||
|
|
|
||
|
|
|
||
|
|
def get_hermes_sessions(hours=24):
|
||
|
|
"""Read session metadata from Hermes session store."""
|
||
|
|
sessions_file = HERMES_HOME / "sessions" / "sessions.json"
|
||
|
|
if not sessions_file.exists():
|
||
|
|
return []
|
||
|
|
try:
|
||
|
|
data = json.loads(sessions_file.read_text())
|
||
|
|
return list(data.values())
|
||
|
|
except Exception:
|
||
|
|
return []
|
||
|
|
|
||
|
|
|
||
|
|
def get_heartbeat_ticks(date_str=None):
|
||
|
|
"""Read today's heartbeat ticks."""
|
||
|
|
if not date_str:
|
||
|
|
date_str = datetime.now().strftime("%Y%m%d")
|
||
|
|
tick_file = TIMMY_HOME / "heartbeat" / f"ticks_{date_str}.jsonl"
|
||
|
|
if not tick_file.exists():
|
||
|
|
return []
|
||
|
|
ticks = []
|
||
|
|
for line in tick_file.read_text().strip().split("\n"):
|
||
|
|
try:
|
||
|
|
ticks.append(json.loads(line))
|
||
|
|
except Exception:
|
||
|
|
continue
|
||
|
|
return ticks
|
||
|
|
|
||
|
|
|
||
|
|
def render(hours=24):
|
||
|
|
models = get_ollama_models()
|
||
|
|
loaded = get_loaded_models()
|
||
|
|
huey = get_huey_status()
|
||
|
|
sessions = get_hermes_sessions(hours)
|
||
|
|
ticks = get_heartbeat_ticks()
|
||
|
|
|
||
|
|
loaded_names = {m.get("name", "") for m in loaded}
|
||
|
|
|
||
|
|
print("\033[2J\033[H")
|
||
|
|
print("=" * 70)
|
||
|
|
print(" TIMMY MODEL DASHBOARD")
|
||
|
|
now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
||
|
|
print(f" {now} | Huey: {'UP' if huey else 'DOWN'} | Ollama models: {len(models)}")
|
||
|
|
print("=" * 70)
|
||
|
|
|
||
|
|
# DEPLOYMENTS
|
||
|
|
print("\n LOCAL MODELS")
|
||
|
|
print(" " + "-" * 55)
|
||
|
|
for m in models:
|
||
|
|
name = m.get("name", "?")
|
||
|
|
size_gb = m.get("size", 0) / 1e9
|
||
|
|
status = "IN VRAM" if name in loaded_names else "on disk"
|
||
|
|
print(f" {name:35s} {size_gb:5.1f}GB {status}")
|
||
|
|
if not models:
|
||
|
|
print(" (Ollama not responding)")
|
||
|
|
|
||
|
|
# HERMES SESSION ACTIVITY
|
||
|
|
# Count sessions by platform/provider
|
||
|
|
print(f"\n HERMES SESSIONS (recent)")
|
||
|
|
print(" " + "-" * 55)
|
||
|
|
local_sessions = [s for s in sessions
|
||
|
|
if "localhost" in str(s.get("origin", {}))]
|
||
|
|
cli_sessions = [s for s in sessions
|
||
|
|
if s.get("platform") == "cli" or s.get("origin", {}).get("platform") == "cli"]
|
||
|
|
|
||
|
|
total_tokens = sum(s.get("total_tokens", 0) for s in sessions)
|
||
|
|
print(f" Total sessions: {len(sessions)}")
|
||
|
|
print(f" CLI sessions: {len(cli_sessions)}")
|
||
|
|
print(f" Total tokens: {total_tokens:,}")
|
||
|
|
|
||
|
|
# HEARTBEAT STATUS
|
||
|
|
print(f"\n HEARTBEAT ({len(ticks)} ticks today)")
|
||
|
|
print(" " + "-" * 55)
|
||
|
|
if ticks:
|
||
|
|
last = ticks[-1]
|
||
|
|
decision = last.get("decision", {})
|
||
|
|
severity = decision.get("severity", "unknown")
|
||
|
|
reasoning = decision.get("reasoning", "no model decision yet")
|
||
|
|
print(f" Last tick: {last.get('tick_id', '?')}")
|
||
|
|
print(f" Severity: {severity}")
|
||
|
|
print(f" Reasoning: {reasoning[:60]}")
|
||
|
|
|
||
|
|
# Count model vs fallback decisions
|
||
|
|
model_decisions = sum(1 for t in ticks
|
||
|
|
if t.get("decision", {}).get("severity") != "fallback")
|
||
|
|
fallback = len(ticks) - model_decisions
|
||
|
|
print(f" Model decisions: {model_decisions} | Fallback: {fallback}")
|
||
|
|
|
||
|
|
# DPO labels if any
|
||
|
|
labeled = sum(1 for t in ticks if "dpo_label" in t)
|
||
|
|
if labeled:
|
||
|
|
chosen = sum(1 for t in ticks if t.get("dpo_label") == "chosen")
|
||
|
|
rejected = sum(1 for t in ticks if t.get("dpo_label") == "rejected")
|
||
|
|
print(f" DPO labeled: {labeled} (chosen: {chosen}, rejected: {rejected})")
|
||
|
|
else:
|
||
|
|
print(" (no ticks today)")
|
||
|
|
|
||
|
|
# ACTIVE LOOPS
|
||
|
|
print(f"\n ACTIVE LOOPS USING LOCAL MODELS")
|
||
|
|
print(" " + "-" * 55)
|
||
|
|
print(" heartbeat_tick 10m hermes4:14b DECIDE phase")
|
||
|
|
print(" (future) 15m hermes4:14b issue triage")
|
||
|
|
print(" (future) daily timmy:v0.1 morning report")
|
||
|
|
|
||
|
|
print(f"\n NON-LOCAL LOOPS (Gemini/Grok API)")
|
||
|
|
print(" " + "-" * 55)
|
||
|
|
print(" gemini_worker 20m gemini-2.5-pro aider")
|
||
|
|
print(" grok_worker 20m grok-3-fast opencode")
|
||
|
|
print(" cross_review 30m both PR review")
|
||
|
|
|
||
|
|
print("\n" + "=" * 70)
|
||
|
|
|
||
|
|
|
||
|
|
if __name__ == "__main__":
|
||
|
|
watch = "--watch" in sys.argv
|
||
|
|
hours = 24
|
||
|
|
for a in sys.argv[1:]:
|
||
|
|
if a.startswith("--hours="):
|
||
|
|
hours = int(a.split("=")[1])
|
||
|
|
if watch:
|
||
|
|
while True:
|
||
|
|
render(hours)
|
||
|
|
time.sleep(30)
|
||
|
|
else:
|
||
|
|
render(hours)
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 5. Implementation Steps
|
||
|
|
|
||
|
|
### Step 1: Add hermes_local() to tasks.py
|
||
|
|
- One function, ~20 lines
|
||
|
|
- Calls `hermes chat -q` with Ollama env vars
|
||
|
|
- All telemetry comes from Hermes for free
|
||
|
|
|
||
|
|
### Step 2: Wire heartbeat_tick DECIDE phase
|
||
|
|
- Replace 6 lines of if/else with hermes_local() call
|
||
|
|
- Keep hardcoded fallback when model is down
|
||
|
|
- Decision stored in tick record for DPO review
|
||
|
|
|
||
|
|
### Step 3: Fix the MCP server warning
|
||
|
|
- The orchestration MCP server path is broken — harmless but noisy
|
||
|
|
- Either fix the path or remove from config
|
||
|
|
|
||
|
|
### Step 4: Drop model_dashboard.py in timmy-config/bin/
|
||
|
|
- Reads Ollama API, Hermes sessions, heartbeat ticks
|
||
|
|
- No new data stores — just views over existing ones
|
||
|
|
- `python3 model_dashboard.py --watch` for live view
|
||
|
|
|
||
|
|
### Step 5: Expand to more Huey tasks
|
||
|
|
- triage_issues: model reads issue, picks agent
|
||
|
|
- good_morning_report: model writes the "From Timmy" section
|
||
|
|
- Each expansion is just calling hermes_local() with a different prompt
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What Gets Hotfixed in Hermes Config
|
||
|
|
|
||
|
|
If `hermes insights` is broken (the cache_read_tokens column error),
|
||
|
|
that needs a fix. The dashboard falls back to reading sessions.json
|
||
|
|
directly, but insights would be the better data source.
|
||
|
|
|
||
|
|
The `providers.ollama` section in config.yaml exists but isn't wired
|
||
|
|
to the --provider flag. Filing this upstream or patching locally would
|
||
|
|
let us do `hermes chat -q "..." --provider ollama` cleanly instead
|
||
|
|
of relying on env vars. Not blocking — env vars work today.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What This Owns
|
||
|
|
|
||
|
|
- hermes_local() — 20-line wrapper around a subprocess call
|
||
|
|
- model_dashboard.py — read-only views over existing data
|
||
|
|
- review_decisions.py — optional DPO annotation CLI
|
||
|
|
|
||
|
|
## What This Does NOT Own
|
||
|
|
|
||
|
|
- Inference. Ollama does that.
|
||
|
|
- Telemetry. Hermes does that.
|
||
|
|
- Session storage. Hermes does that.
|
||
|
|
- Token counting. Hermes does that.
|
||
|
|
- Training pipeline. Already exists in timmy-config/training/.
|