# Local Model Integration Sketch v2 # Hermes4-14B in the Heartbeat Loop — No New Telemetry ## Principle No new inference layer. Huey tasks call `hermes chat -q` pointed at Ollama. Hermes handles sessions, token tracking, cost logging. The dashboard reads what Hermes already stores. --- ## Why Not Ollama Directly? Ollama is fine as a serving backend. The issue isn't Ollama — it's that calling Ollama directly with urllib bypasses the harness. The harness already tracks sessions, tokens, model/provider, platform. Building a second telemetry layer is owning code we don't need. Ollama as a named provider isn't wired into the --provider flag yet, but routing works via env vars: HERMES_MODEL="hermes4:14b" \ HERMES_PROVIDER="custom" \ HERMES_BASE_URL="http://localhost:11434/v1" \ hermes chat -q "prompt here" -Q This creates a tracked session, logs tokens, and returns the response. That's our local inference call. ### Alternatives to Ollama for serving: - **llama.cpp server** — lighter, no Python, raw HTTP. Good for single model serving. Less convenient for model switching. - **vLLM** — best throughput, but needs NVIDIA GPU. Not for M3 Mac. - **MLX serving** — native Apple Silicon, but no OpenAI-compat API yet. MLX is for training, not serving (our current policy). - **llamafile** — single binary, portable. Good for distribution. Verdict: Ollama is fine. It's the standard OpenAI-compat local server on Mac. The issue was never Ollama — it was bypassing the harness. --- ## 1. The Call Pattern One function in tasks.py that all Huey tasks use: ```python import subprocess import json HERMES_BIN = "hermes" LOCAL_ENV = { "HERMES_MODEL": "hermes4:14b", "HERMES_PROVIDER": "custom", "HERMES_BASE_URL": "http://localhost:11434/v1", } def hermes_local(prompt, caller_tag=None, max_retries=2): """Call hermes with local Ollama model. Returns response text. Every call creates a hermes session with full telemetry. caller_tag gets prepended to prompt for searchability. """ import os env = os.environ.copy() env.update(LOCAL_ENV) tagged_prompt = prompt if caller_tag: tagged_prompt = f"[{caller_tag}] {prompt}" for attempt in range(max_retries + 1): try: result = subprocess.run( [HERMES_BIN, "chat", "-q", tagged_prompt, "-Q", "-t", "none"], capture_output=True, text=True, timeout=120, env=env, ) if result.returncode == 0 and result.stdout.strip(): # Strip the session_id line from -Q output lines = result.stdout.strip().split("\n") response_lines = [l for l in lines if not l.startswith("session_id:")] return "\n".join(response_lines).strip() except subprocess.TimeoutExpired: if attempt == max_retries: return None continue return None ``` Notes: - `-t none` disables all toolsets — the heartbeat model shouldn't have terminal/file access. Pure reasoning only. - `-Q` quiet mode suppresses banner/spinner, gives clean output. - Every call creates a session in Hermes session store. Searchable, exportable, countable. - The `[caller_tag]` prefix lets you filter sessions by which Huey task generated them: `hermes sessions list | grep heartbeat` --- ## 2. Heartbeat DECIDE Phase Replace the hardcoded if/else with a model call: ```python # In heartbeat_tick(), replace the DECIDE + ACT section: # DECIDE: let hermes4:14b reason about what to do decide_prompt = f"""System state at {now.isoformat()}: {json.dumps(perception, indent=2)} Previous tick: {last_tick.get('tick_id', 'none')} You are the heartbeat monitor. Based on this state: 1. List any actions needed (alerts, restarts, escalations). Empty if all OK. 2. Rate severity: ok, warning, or critical. 3. One sentence of reasoning. Respond ONLY with JSON: {{"actions": [], "severity": "ok", "reasoning": "..."}}""" decision = None try: raw = hermes_local(decide_prompt, caller_tag="heartbeat_tick") if raw: # Try to parse JSON from the response # Model might wrap it in markdown, so extract for line in raw.split("\n"): line = line.strip() if line.startswith("{"): decision = json.loads(line) break if not decision: decision = json.loads(raw) except (json.JSONDecodeError, Exception) as e: decision = None # Fallback to hardcoded logic if model fails or is down if decision is None: actions = [] if not perception.get("gitea_alive"): actions.append("ALERT: Gitea unreachable") health = perception.get("model_health", {}) if isinstance(health, dict) and not health.get("ollama_running"): actions.append("ALERT: Ollama not running") decision = { "actions": actions, "severity": "fallback", "reasoning": "model unavailable, used hardcoded checks" } tick_record["decision"] = decision actions = decision.get("actions", []) ``` --- ## 3. DPO Candidate Collection No new database. Hermes sessions ARE the DPO candidates. Every `hermes_local()` call creates a session. To extract DPO pairs: ```bash # Export all local-model sessions hermes sessions export --output /tmp/local-sessions.jsonl # Filter for heartbeat decisions grep "heartbeat_tick" /tmp/local-sessions.jsonl > heartbeat_decisions.jsonl ``` The existing `session_export` Huey task (runs every 4h) already extracts user→assistant pairs. It just needs to be aware that some sessions are now local-model decisions instead of human conversations. For DPO annotation, add a simple review script: ```python # review_decisions.py — reads heartbeat tick logs, shows model decisions, # asks Alexander to mark chosen/rejected # Writes annotations back to the tick log files import json from pathlib import Path TICK_DIR = Path.home() / ".timmy" / "heartbeat" for log_file in sorted(TICK_DIR.glob("ticks_*.jsonl")): for line in log_file.read_text().strip().split("\n"): tick = json.loads(line) decision = tick.get("decision", {}) if decision.get("severity") == "fallback": continue # skip fallback entries print(f"\n--- Tick {tick['tick_id']} ---") print(f"Perception: {json.dumps(tick['perception'], indent=2)}") print(f"Decision: {json.dumps(decision, indent=2)}") rating = input("Rate (c=chosen, r=rejected, s=skip): ").strip() if rating in ("c", "r"): tick["dpo_label"] = "chosen" if rating == "c" else "rejected" # write back... (append to annotated file) ``` --- ## 4. Dashboard — Reads Hermes Data ```python #!/usr/bin/env python3 """Timmy Model Dashboard — reads from Hermes, owns nothing.""" import json import os import subprocess import sys import time import urllib.request from datetime import datetime from pathlib import Path HERMES_HOME = Path.home() / ".hermes" TIMMY_HOME = Path.home() / ".timmy" def get_ollama_models(): """What's available in Ollama.""" try: req = urllib.request.Request("http://localhost:11434/api/tags") with urllib.request.urlopen(req, timeout=5) as resp: return json.loads(resp.read()).get("models", []) except Exception: return [] def get_loaded_models(): """What's actually in VRAM right now.""" try: req = urllib.request.Request("http://localhost:11434/api/ps") with urllib.request.urlopen(req, timeout=5) as resp: return json.loads(resp.read()).get("models", []) except Exception: return [] def get_huey_status(): try: r = subprocess.run(["pgrep", "-f", "huey_consumer"], capture_output=True, timeout=5) return r.returncode == 0 except Exception: return False def get_hermes_sessions(hours=24): """Read session metadata from Hermes session store.""" sessions_file = HERMES_HOME / "sessions" / "sessions.json" if not sessions_file.exists(): return [] try: data = json.loads(sessions_file.read_text()) return list(data.values()) except Exception: return [] def get_heartbeat_ticks(date_str=None): """Read today's heartbeat ticks.""" if not date_str: date_str = datetime.now().strftime("%Y%m%d") tick_file = TIMMY_HOME / "heartbeat" / f"ticks_{date_str}.jsonl" if not tick_file.exists(): return [] ticks = [] for line in tick_file.read_text().strip().split("\n"): try: ticks.append(json.loads(line)) except Exception: continue return ticks def render(hours=24): models = get_ollama_models() loaded = get_loaded_models() huey = get_huey_status() sessions = get_hermes_sessions(hours) ticks = get_heartbeat_ticks() loaded_names = {m.get("name", "") for m in loaded} print("\033[2J\033[H") print("=" * 70) print(" TIMMY MODEL DASHBOARD") now = datetime.now().strftime("%Y-%m-%d %H:%M:%S") print(f" {now} | Huey: {'UP' if huey else 'DOWN'} | Ollama models: {len(models)}") print("=" * 70) # DEPLOYMENTS print("\n LOCAL MODELS") print(" " + "-" * 55) for m in models: name = m.get("name", "?") size_gb = m.get("size", 0) / 1e9 status = "IN VRAM" if name in loaded_names else "on disk" print(f" {name:35s} {size_gb:5.1f}GB {status}") if not models: print(" (Ollama not responding)") # HERMES SESSION ACTIVITY # Count sessions by platform/provider print(f"\n HERMES SESSIONS (recent)") print(" " + "-" * 55) local_sessions = [s for s in sessions if "localhost" in str(s.get("origin", {}))] cli_sessions = [s for s in sessions if s.get("platform") == "cli" or s.get("origin", {}).get("platform") == "cli"] total_tokens = sum(s.get("total_tokens", 0) for s in sessions) print(f" Total sessions: {len(sessions)}") print(f" CLI sessions: {len(cli_sessions)}") print(f" Total tokens: {total_tokens:,}") # HEARTBEAT STATUS print(f"\n HEARTBEAT ({len(ticks)} ticks today)") print(" " + "-" * 55) if ticks: last = ticks[-1] decision = last.get("decision", {}) severity = decision.get("severity", "unknown") reasoning = decision.get("reasoning", "no model decision yet") print(f" Last tick: {last.get('tick_id', '?')}") print(f" Severity: {severity}") print(f" Reasoning: {reasoning[:60]}") # Count model vs fallback decisions model_decisions = sum(1 for t in ticks if t.get("decision", {}).get("severity") != "fallback") fallback = len(ticks) - model_decisions print(f" Model decisions: {model_decisions} | Fallback: {fallback}") # DPO labels if any labeled = sum(1 for t in ticks if "dpo_label" in t) if labeled: chosen = sum(1 for t in ticks if t.get("dpo_label") == "chosen") rejected = sum(1 for t in ticks if t.get("dpo_label") == "rejected") print(f" DPO labeled: {labeled} (chosen: {chosen}, rejected: {rejected})") else: print(" (no ticks today)") # ACTIVE LOOPS print(f"\n ACTIVE LOOPS USING LOCAL MODELS") print(" " + "-" * 55) print(" heartbeat_tick 10m hermes4:14b DECIDE phase") print(" (future) 15m hermes4:14b issue triage") print(" (future) daily timmy:v0.1 morning report") print(f"\n NON-LOCAL LOOPS (Gemini/Grok API)") print(" " + "-" * 55) print(" gemini_worker 20m gemini-2.5-pro aider") print(" grok_worker 20m grok-3-fast opencode") print(" cross_review 30m both PR review") print("\n" + "=" * 70) if __name__ == "__main__": watch = "--watch" in sys.argv hours = 24 for a in sys.argv[1:]: if a.startswith("--hours="): hours = int(a.split("=")[1]) if watch: while True: render(hours) time.sleep(30) else: render(hours) ``` --- ## 5. Implementation Steps ### Step 1: Add hermes_local() to tasks.py - One function, ~20 lines - Calls `hermes chat -q` with Ollama env vars - All telemetry comes from Hermes for free ### Step 2: Wire heartbeat_tick DECIDE phase - Replace 6 lines of if/else with hermes_local() call - Keep hardcoded fallback when model is down - Decision stored in tick record for DPO review ### Step 3: Fix the MCP server warning - The orchestration MCP server path is broken — harmless but noisy - Either fix the path or remove from config ### Step 4: Drop model_dashboard.py in timmy-config/bin/ - Reads Ollama API, Hermes sessions, heartbeat ticks - No new data stores — just views over existing ones - `python3 model_dashboard.py --watch` for live view ### Step 5: Expand to more Huey tasks - triage_issues: model reads issue, picks agent - good_morning_report: model writes the "From Timmy" section - Each expansion is just calling hermes_local() with a different prompt --- ## What Gets Hotfixed in Hermes Config If `hermes insights` is broken (the cache_read_tokens column error), that needs a fix. The dashboard falls back to reading sessions.json directly, but insights would be the better data source. The `providers.ollama` section in config.yaml exists but isn't wired to the --provider flag. Filing this upstream or patching locally would let us do `hermes chat -q "..." --provider ollama` cleanly instead of relying on env vars. Not blocking — env vars work today. --- ## What This Owns - hermes_local() — 20-line wrapper around a subprocess call - model_dashboard.py — read-only views over existing data - review_decisions.py — optional DPO annotation CLI ## What This Does NOT Own - Inference. Ollama does that. - Telemetry. Hermes does that. - Session storage. Hermes does that. - Token counting. Hermes does that. - Training pipeline. Already exists in timmy-config/training/.