ezra-environment/protected/skills-backup/devops/wizard-house-remote-triage/SKILL.md

---
name: wizard-house-remote-triage
description: Diagnose and fix a sibling wizard house that is unresponsive or misbehaving on a remote VPS. SSH in, check resources, inspect config, kill runaway processes, restart services.
version: 1.0.0
author: Ezra
license: MIT
metadata:
  hermes:
    tags: [wizard-house, devops, ssh, triage, remote, debugging]
    related_skills: [gitea-wizard-onboarding, telegram-bot-profile, systematic-debugging]
---

# Wizard House Remote Triage

## When to Use

- A sibling wizard (Allegro, Bezalel, etc.) is not responding on Telegram
- A wizard's VPS is suspected to be resource-starved or misconfigured
- You need to inspect or fix another wizard's deployment from your own VPS

## Prerequisites

- SSH access to the target VPS (key in authorized_keys)
- If no key exists: generate with `ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N "" -C "ezra@hermes-vps"` and have Alexander drop the pubkey on the target

## Phase 1: Can You Reach It?

```bash
# Ping first
ping -c 2 -W 2 TARGET_IP

# Try SSH
ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no root@TARGET_IP 'hostname; whoami'
```

If SSH connects but banner exchange times out: the box is resource-starved. SSHD can't respond because CPU/RAM is exhausted. You need Alexander to use the DigitalOcean web console (Droplets > droplet > Console) to free resources first.

Quick commands for Alexander to run on DO console:
```bash
# Check what's eating RAM
ps aux --sort=-%mem | head -10
free -h

# Common resource hogs to kill
systemctl stop docker containerd nginx
systemctl disable docker containerd
```

## Phase 2: Inspect the Box

Once SSH works:

```bash
# Resources
ssh root@TARGET 'free -h; echo "---"; ps aux --sort=-%mem | head -8'

# Wizard service status
ssh root@TARGET 'systemctl status hermes-WIZARD.service 2>&1 | head -20'

# Recent logs (look for errors, stuck sessions, long tool calls)
ssh root@TARGET 'journalctl -u hermes-WIZARD.service --no-pager -n 50'
```

## Phase 3: Inspect Config

Check if Telegram and provider are properly wired:

```bash
# Environment file - must have bot token AND provider key
ssh root@TARGET 'cat /root/wizards/WIZARD/home/.env'

# Config - check model, provider, platforms section
ssh root@TARGET 'cat /root/wizards/WIZARD/home/config.yaml'
```

Required .env vars for Telegram-connected wizard:
- Provider API key (e.g., KIMI_API_KEY, ANTHROPIC_API_KEY)
- TELEGRAM_BOT_TOKEN
- TELEGRAM_HOME_CHANNEL
- TELEGRAM_HOME_CHANNEL_NAME
- TELEGRAM_ALLOWED_USERS

If Telegram vars are missing, the deploy script was incomplete. Add them and restart.

## Phase 4: Fix Common Problems

### Resource starvation (< 200MB available)
```bash
# Kill Docker if not needed
ssh root@TARGET 'systemctl stop docker containerd; systemctl disable docker containerd'

# Kill any runaway processes spawned by the wizard
ssh root@TARGET 'pkill -f "khatru-relay|caddy|nginx|go build" 2>/dev/null'
```

### Stuck session (wizard busy with a long-running task)
```bash
# Restart the service - clears the stuck session
ssh root@TARGET 'systemctl restart hermes-WIZARD.service && sleep 2 && systemctl is-active hermes-WIZARD.service'
```

### Wrong provider or missing key
Edit the .env and config.yaml, then restart:
```bash
ssh root@TARGET 'systemctl restart hermes-WIZARD.service'
```

## Phase 5: Verify

```bash
# Service running?
ssh root@TARGET 'systemctl is-active hermes-WIZARD.service'

# RAM healthy? (> 500MB available on 2GB box)
ssh root@TARGET 'free -h'

# Logs show Telegram connected?
ssh root@TARGET 'journalctl -u hermes-WIZARD.service --no-pager -n 10 | grep -i telegram'
```

Then ask Alexander to message the wizard on Telegram to confirm it responds.

## Reaching Alexander's Mac

If you need to diagnose local Timmy directly, Alexander's Mac is on the Tailscale network:
- Tailscale IP: `100.124.176.28` (hostname: `mm`)
- User: `apayne`
- SSH key must be in `~/.ssh/authorized_keys` on the Mac
- System python is 3.9; use `~/.hermes/hermes-agent/venv/bin/python3` for 3.10+ syntax
- llama-server model path: `/Users/apayne/models/hermes4-14b/NousResearch_Hermes-4-14B-Q4_K_M.gguf`
- Timmy workspace: `~/.timmy/`, Hermes home: `~/.hermes/`

## Pitfalls

1. **SSH banner timeout = RAM starvation.** Don't keep retrying SSH. The box needs resources freed via DO console first.
2. **2GB droplets cannot run Hermes + Docker + nginx + a relay.** Hermes alone needs ~500MB-1GB. Budget accordingly.
3. **Deploy scripts may be incomplete.** Bezalel's deploy script for Allegro only injected KIMI_API_KEY but not Telegram vars. Always verify .env has everything.
4. **Wizards build infrastructure when left unsupervised.** If a wizard was given a broad task, check what processes it spawned. Kill anything not essential.
5. **Always check the Gitea PR for the deploy script** — it's the source of truth for what SHOULD be configured. Compare against what IS configured.
6. **Restart clears stuck sessions.** If a wizard is mid-session on a slow model (Kimi), restarting the service is the fastest way to unstick it.