Build Hermes health monitor — system resources + model management #1073

Closed
opened 2026-03-23 13:11:55 +00:00 by perplexity · 1 comment
Collaborator

Parent: #1070

Objective

Timmy monitors the health of Hermes (the M3 Max) and keeps it running smoothly. This includes system resources, Ollama model management, and cleanup.

Monitors

  1. Memory pressure — track unified memory usage, alert if <4GB free, unload models if needed
  2. Disk usage — alert if <10GB free, identify and clean large temp files/logs
  3. Ollama status — verify models are loaded, restart if crashed, manage OLLAMA_MAX_LOADED_MODELS
  4. Process health — detect stuck/zombie processes, kill and restart
  5. Network — verify Gitea connectivity, API endpoints responding

Actions Timmy Can Take

  • Unload unused Ollama models to free memory
  • Clean /tmp, old logs, stale build artifacts
  • Restart Ollama if it crashes
  • Kill stuck agent processes
  • Alert Alex if something needs human intervention

Acceptance Criteria

  • Health check runs every 5 minutes
  • Memory/disk/process issues are auto-resolved where possible
  • Alerts sent to Alex only when Timmy can't fix it himself
  • Health status visible in dashboard or Gitea journal
Parent: #1070 ## Objective Timmy monitors the health of Hermes (the M3 Max) and keeps it running smoothly. This includes system resources, Ollama model management, and cleanup. ## Monitors 1. **Memory pressure** — track unified memory usage, alert if <4GB free, unload models if needed 2. **Disk usage** — alert if <10GB free, identify and clean large temp files/logs 3. **Ollama status** — verify models are loaded, restart if crashed, manage `OLLAMA_MAX_LOADED_MODELS` 4. **Process health** — detect stuck/zombie processes, kill and restart 5. **Network** — verify Gitea connectivity, API endpoints responding ## Actions Timmy Can Take - Unload unused Ollama models to free memory - Clean `/tmp`, old logs, stale build artifacts - Restart Ollama if it crashes - Kill stuck agent processes - Alert Alex if something needs human intervention ## Acceptance Criteria - Health check runs every 5 minutes - Memory/disk/process issues are auto-resolved where possible - Alerts sent to Alex only when Timmy can't fix it himself - Health status visible in dashboard or Gitea journal
claude self-assigned this 2026-03-23 13:32:53 +00:00
claude added the harnessinferencep0-critical labels 2026-03-23 13:53:03 +00:00
Collaborator

PR created: #1133

What was built:

HermesMonitor in src/infrastructure/hermes/monitor.py — a background health monitor running every 5 minutes (configurable) with five checks:

  1. Memory — reads vm_stat + sysctl hw.memsize on macOS; if free memory < 4GB, auto-unloads Ollama models via keep_alive=0 API trick
  2. Disk — uses shutil.disk_usage; if free < 10GB, cleans /tmp files older than 24 hours
  3. Ollama — hits /api/tags and /api/ps; if unreachable, attempts restart via launchctl then brew services
  4. Processes — scans ps aux for state Z (zombie) processes
  5. Network — pings settings.gitea_url/api/v1/version; alerts if unreachable

All checks run concurrently (asyncio.gather). Auto-fixes fire push notifications via notifier; critical issues trigger native macOS popups. Manual intervention alerts only fire when Timmy cannot self-heal.

New endpoints: GET /hermes/status (cached last report) and POST /hermes/check (immediate cycle).

28 unit tests cover all code paths. Full suite: 3295 passed.

PR created: #1133 **What was built:** `HermesMonitor` in `src/infrastructure/hermes/monitor.py` — a background health monitor running every 5 minutes (configurable) with five checks: 1. **Memory** — reads `vm_stat` + `sysctl hw.memsize` on macOS; if free memory < 4GB, auto-unloads Ollama models via `keep_alive=0` API trick 2. **Disk** — uses `shutil.disk_usage`; if free < 10GB, cleans `/tmp` files older than 24 hours 3. **Ollama** — hits `/api/tags` and `/api/ps`; if unreachable, attempts restart via `launchctl` then `brew services` 4. **Processes** — scans `ps aux` for state `Z` (zombie) processes 5. **Network** — pings `settings.gitea_url/api/v1/version`; alerts if unreachable All checks run concurrently (`asyncio.gather`). Auto-fixes fire push notifications via `notifier`; critical issues trigger native macOS popups. Manual intervention alerts only fire when Timmy cannot self-heal. New endpoints: `GET /hermes/status` (cached last report) and `POST /hermes/check` (immediate cycle). 28 unit tests cover all code paths. Full suite: 3295 passed.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#1073