Files

Alexander Whitestone 49020b34d9 config: update bin/timmy-dashboard,config.yaml,docs/local-model-integration-sketch.md,tasks.py

2026-03-26 17:00:22 -04:00

14 KiB

Raw Blame History

Local Model Integration Sketch v2

Hermes4-14B in the Heartbeat Loop — No New Telemetry

Principle

No new inference layer. Huey tasks call hermes chat -q pointed at Ollama. Hermes handles sessions, token tracking, cost logging. The dashboard reads what Hermes already stores.

Why Not Ollama Directly?

Ollama is fine as a serving backend. The issue isn't Ollama — it's that calling Ollama directly with urllib bypasses the harness. The harness already tracks sessions, tokens, model/provider, platform. Building a second telemetry layer is owning code we don't need.

Ollama as a named provider isn't wired into the --provider flag yet, but routing works via env vars:

HERMES_MODEL="hermes4:14b" \
HERMES_PROVIDER="custom" \
HERMES_BASE_URL="http://localhost:11434/v1" \
hermes chat -q "prompt here" -Q

This creates a tracked session, logs tokens, and returns the response. That's our local inference call.

Alternatives to Ollama for serving:

llama.cpp server — lighter, no Python, raw HTTP. Good for single model serving. Less convenient for model switching.
vLLM — best throughput, but needs NVIDIA GPU. Not for M3 Mac.
MLX serving — native Apple Silicon, but no OpenAI-compat API yet. MLX is for training, not serving (our current policy).
llamafile — single binary, portable. Good for distribution.

Verdict: Ollama is fine. It's the standard OpenAI-compat local server on Mac. The issue was never Ollama — it was bypassing the harness.

1. The Call Pattern

One function in tasks.py that all Huey tasks use:

import subprocess
import json

HERMES_BIN = "hermes"
LOCAL_ENV = {
    "HERMES_MODEL": "hermes4:14b",
    "HERMES_PROVIDER": "custom",
    "HERMES_BASE_URL": "http://localhost:11434/v1",
}

def hermes_local(prompt, caller_tag=None, max_retries=2):
    """Call hermes with local Ollama model. Returns response text.
    
    Every call creates a hermes session with full telemetry.
    caller_tag gets prepended to prompt for searchability.
    """
    import os
    env = os.environ.copy()
    env.update(LOCAL_ENV)
    
    tagged_prompt = prompt
    if caller_tag:
        tagged_prompt = f"[{caller_tag}] {prompt}"
    
    for attempt in range(max_retries + 1):
        try:
            result = subprocess.run(
                [HERMES_BIN, "chat", "-q", tagged_prompt, "-Q", "-t", "none"],
                capture_output=True, text=True,
                timeout=120, env=env,
            )
            if result.returncode == 0 and result.stdout.strip():
                # Strip the session_id line from -Q output
                lines = result.stdout.strip().split("\n")
                response_lines = [l for l in lines if not l.startswith("session_id:")]
                return "\n".join(response_lines).strip()
        except subprocess.TimeoutExpired:
            if attempt == max_retries:
                return None
            continue
    return None

Notes:

-t none disables all toolsets — the heartbeat model shouldn't have terminal/file access. Pure reasoning only.
-Q quiet mode suppresses banner/spinner, gives clean output.
Every call creates a session in Hermes session store. Searchable, exportable, countable.
The [caller_tag] prefix lets you filter sessions by which Huey task generated them: hermes sessions list | grep heartbeat

2. Heartbeat DECIDE Phase

Replace the hardcoded if/else with a model call:

# In heartbeat_tick(), replace the DECIDE + ACT section:

    # DECIDE: let hermes4:14b reason about what to do
    decide_prompt = f"""System state at {now.isoformat()}:

{json.dumps(perception, indent=2)}

Previous tick: {last_tick.get('tick_id', 'none')}

You are the heartbeat monitor. Based on this state:
1. List any actions needed (alerts, restarts, escalations). Empty if all OK.
2. Rate severity: ok, warning, or critical.
3. One sentence of reasoning.

Respond ONLY with JSON:
{{"actions": [], "severity": "ok", "reasoning": "..."}}"""

    decision = None
    try:
        raw = hermes_local(decide_prompt, caller_tag="heartbeat_tick")
        if raw:
            # Try to parse JSON from the response
            # Model might wrap it in markdown, so extract
            for line in raw.split("\n"):
                line = line.strip()
                if line.startswith("{"):
                    decision = json.loads(line)
                    break
            if not decision:
                decision = json.loads(raw)
    except (json.JSONDecodeError, Exception) as e:
        decision = None

    # Fallback to hardcoded logic if model fails or is down
    if decision is None:
        actions = []
        if not perception.get("gitea_alive"):
            actions.append("ALERT: Gitea unreachable")
        health = perception.get("model_health", {})
        if isinstance(health, dict) and not health.get("ollama_running"):
            actions.append("ALERT: Ollama not running")
        decision = {
            "actions": actions,
            "severity": "fallback",
            "reasoning": "model unavailable, used hardcoded checks"
        }

    tick_record["decision"] = decision
    actions = decision.get("actions", [])

3. DPO Candidate Collection

No new database. Hermes sessions ARE the DPO candidates.

Every hermes_local() call creates a session. To extract DPO pairs:

# Export all local-model sessions
hermes sessions export --output /tmp/local-sessions.jsonl

# Filter for heartbeat decisions
grep "heartbeat_tick" /tmp/local-sessions.jsonl > heartbeat_decisions.jsonl

The existing session_export Huey task (runs every 4h) already extracts user→assistant pairs. It just needs to be aware that some sessions are now local-model decisions instead of human conversations.

For DPO annotation, add a simple review script:

# review_decisions.py — reads heartbeat tick logs, shows model decisions,
# asks Alexander to mark chosen/rejected
# Writes annotations back to the tick log files

import json
from pathlib import Path

TICK_DIR = Path.home() / ".timmy" / "heartbeat"

for log_file in sorted(TICK_DIR.glob("ticks_*.jsonl")):
    for line in log_file.read_text().strip().split("\n"):
        tick = json.loads(line)
        decision = tick.get("decision", {})
        if decision.get("severity") == "fallback":
            continue  # skip fallback entries
        
        print(f"\n--- Tick {tick['tick_id']} ---")
        print(f"Perception: {json.dumps(tick['perception'], indent=2)}")
        print(f"Decision:   {json.dumps(decision, indent=2)}")
        
        rating = input("Rate (c=chosen, r=rejected, s=skip): ").strip()
        if rating in ("c", "r"):
            tick["dpo_label"] = "chosen" if rating == "c" else "rejected"
            # write back... (append to annotated file)

4. Dashboard — Reads Hermes Data

#!/usr/bin/env python3
"""Timmy Model Dashboard — reads from Hermes, owns nothing."""

import json
import os
import subprocess
import sys
import time
import urllib.request
from datetime import datetime
from pathlib import Path

HERMES_HOME = Path.home() / ".hermes"
TIMMY_HOME = Path.home() / ".timmy"


def get_ollama_models():
    """What's available in Ollama."""
    try:
        req = urllib.request.Request("http://localhost:11434/api/tags")
        with urllib.request.urlopen(req, timeout=5) as resp:
            return json.loads(resp.read()).get("models", [])
    except Exception:
        return []


def get_loaded_models():
    """What's actually in VRAM right now."""
    try:
        req = urllib.request.Request("http://localhost:11434/api/ps")
        with urllib.request.urlopen(req, timeout=5) as resp:
            return json.loads(resp.read()).get("models", [])
    except Exception:
        return []


def get_huey_status():
    try:
        r = subprocess.run(["pgrep", "-f", "huey_consumer"],
                          capture_output=True, timeout=5)
        return r.returncode == 0
    except Exception:
        return False


def get_hermes_sessions(hours=24):
    """Read session metadata from Hermes session store."""
    sessions_file = HERMES_HOME / "sessions" / "sessions.json"
    if not sessions_file.exists():
        return []
    try:
        data = json.loads(sessions_file.read_text())
        return list(data.values())
    except Exception:
        return []


def get_heartbeat_ticks(date_str=None):
    """Read today's heartbeat ticks."""
    if not date_str:
        date_str = datetime.now().strftime("%Y%m%d")
    tick_file = TIMMY_HOME / "heartbeat" / f"ticks_{date_str}.jsonl"
    if not tick_file.exists():
        return []
    ticks = []
    for line in tick_file.read_text().strip().split("\n"):
        try:
            ticks.append(json.loads(line))
        except Exception:
            continue
    return ticks


def render(hours=24):
    models = get_ollama_models()
    loaded = get_loaded_models()
    huey = get_huey_status()
    sessions = get_hermes_sessions(hours)
    ticks = get_heartbeat_ticks()

    loaded_names = {m.get("name", "") for m in loaded}

    print("\033[2J\033[H")
    print("=" * 70)
    print("  TIMMY MODEL DASHBOARD")
    now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(f"  {now}  |  Huey: {'UP' if huey else 'DOWN'}  |  Ollama models: {len(models)}")
    print("=" * 70)

    # DEPLOYMENTS
    print("\n  LOCAL MODELS")
    print("  " + "-" * 55)
    for m in models:
        name = m.get("name", "?")
        size_gb = m.get("size", 0) / 1e9
        status = "IN VRAM" if name in loaded_names else "on disk"
        print(f"    {name:35s} {size_gb:5.1f}GB  {status}")
    if not models:
        print("    (Ollama not responding)")

    # HERMES SESSION ACTIVITY
    # Count sessions by platform/provider
    print(f"\n  HERMES SESSIONS (recent)")
    print("  " + "-" * 55)
    local_sessions = [s for s in sessions
                     if "localhost" in str(s.get("origin", {}))]
    cli_sessions = [s for s in sessions
                   if s.get("platform") == "cli" or s.get("origin", {}).get("platform") == "cli"]

    total_tokens = sum(s.get("total_tokens", 0) for s in sessions)
    print(f"    Total sessions: {len(sessions)}")
    print(f"    CLI sessions: {len(cli_sessions)}")
    print(f"    Total tokens: {total_tokens:,}")

    # HEARTBEAT STATUS
    print(f"\n  HEARTBEAT ({len(ticks)} ticks today)")
    print("  " + "-" * 55)
    if ticks:
        last = ticks[-1]
        decision = last.get("decision", {})
        severity = decision.get("severity", "unknown")
        reasoning = decision.get("reasoning", "no model decision yet")
        print(f"    Last tick: {last.get('tick_id', '?')}")
        print(f"    Severity:  {severity}")
        print(f"    Reasoning: {reasoning[:60]}")

        # Count model vs fallback decisions
        model_decisions = sum(1 for t in ticks
                            if t.get("decision", {}).get("severity") != "fallback")
        fallback = len(ticks) - model_decisions
        print(f"    Model decisions: {model_decisions}  |  Fallback: {fallback}")

        # DPO labels if any
        labeled = sum(1 for t in ticks if "dpo_label" in t)
        if labeled:
            chosen = sum(1 for t in ticks if t.get("dpo_label") == "chosen")
            rejected = sum(1 for t in ticks if t.get("dpo_label") == "rejected")
            print(f"    DPO labeled: {labeled} (chosen: {chosen}, rejected: {rejected})")
    else:
        print("    (no ticks today)")

    # ACTIVE LOOPS
    print(f"\n  ACTIVE LOOPS USING LOCAL MODELS")
    print("  " + "-" * 55)
    print("    heartbeat_tick    10m    hermes4:14b    DECIDE phase")
    print("    (future)          15m    hermes4:14b    issue triage")
    print("    (future)          daily  timmy:v0.1     morning report")

    print(f"\n  NON-LOCAL LOOPS (Gemini/Grok API)")
    print("  " + "-" * 55)
    print("    gemini_worker     20m    gemini-2.5-pro   aider")
    print("    grok_worker       20m    grok-3-fast      opencode")
    print("    cross_review      30m    both             PR review")

    print("\n" + "=" * 70)


if __name__ == "__main__":
    watch = "--watch" in sys.argv
    hours = 24
    for a in sys.argv[1:]:
        if a.startswith("--hours="):
            hours = int(a.split("=")[1])
    if watch:
        while True:
            render(hours)
            time.sleep(30)
    else:
        render(hours)

5. Implementation Steps

Step 1: Add hermes_local() to tasks.py

One function, ~20 lines
Calls hermes chat -q with Ollama env vars
All telemetry comes from Hermes for free

Step 2: Wire heartbeat_tick DECIDE phase

Replace 6 lines of if/else with hermes_local() call
Keep hardcoded fallback when model is down
Decision stored in tick record for DPO review

Step 3: Fix the MCP server warning

The orchestration MCP server path is broken — harmless but noisy
Either fix the path or remove from config

Step 4: Drop model_dashboard.py in timmy-config/bin/

Reads Ollama API, Hermes sessions, heartbeat ticks
No new data stores — just views over existing ones
python3 model_dashboard.py --watch for live view

Step 5: Expand to more Huey tasks

triage_issues: model reads issue, picks agent
good_morning_report: model writes the "From Timmy" section
Each expansion is just calling hermes_local() with a different prompt

What Gets Hotfixed in Hermes Config

If hermes insights is broken (the cache_read_tokens column error), that needs a fix. The dashboard falls back to reading sessions.json directly, but insights would be the better data source.

The providers.ollama section in config.yaml exists but isn't wired to the --provider flag. Filing this upstream or patching locally would let us do hermes chat -q "..." --provider ollama cleanly instead of relying on env vars. Not blocking — env vars work today.

What This Owns

hermes_local() — 20-line wrapper around a subprocess call
model_dashboard.py — read-only views over existing data
review_decisions.py — optional DPO annotation CLI

What This Does NOT Own

Inference. Ollama does that.
Telemetry. Hermes does that.
Session storage. Hermes does that.
Token counting. Hermes does that.
Training pipeline. Already exists in timmy-config/training/.

14 KiB Raw Blame History