Files
ezra-environment/protected/skills-backup/devops/wizard-house-remote-triage/SKILL.md
2026-04-03 22:42:06 +00:00

141 lines
4.9 KiB
Markdown

---
name: wizard-house-remote-triage
description: Diagnose and fix a sibling wizard house that is unresponsive or misbehaving on a remote VPS. SSH in, check resources, inspect config, kill runaway processes, restart services.
version: 1.0.0
author: Ezra
license: MIT
metadata:
hermes:
tags: [wizard-house, devops, ssh, triage, remote, debugging]
related_skills: [gitea-wizard-onboarding, telegram-bot-profile, systematic-debugging]
---
# Wizard House Remote Triage
## When to Use
- A sibling wizard (Allegro, Bezalel, etc.) is not responding on Telegram
- A wizard's VPS is suspected to be resource-starved or misconfigured
- You need to inspect or fix another wizard's deployment from your own VPS
## Prerequisites
- SSH access to the target VPS (key in authorized_keys)
- If no key exists: generate with `ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N "" -C "ezra@hermes-vps"` and have Alexander drop the pubkey on the target
## Phase 1: Can You Reach It?
```bash
# Ping first
ping -c 2 -W 2 TARGET_IP
# Try SSH
ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no root@TARGET_IP 'hostname; whoami'
```
If SSH connects but banner exchange times out: the box is resource-starved. SSHD can't respond because CPU/RAM is exhausted. You need Alexander to use the DigitalOcean web console (Droplets > droplet > Console) to free resources first.
Quick commands for Alexander to run on DO console:
```bash
# Check what's eating RAM
ps aux --sort=-%mem | head -10
free -h
# Common resource hogs to kill
systemctl stop docker containerd nginx
systemctl disable docker containerd
```
## Phase 2: Inspect the Box
Once SSH works:
```bash
# Resources
ssh root@TARGET 'free -h; echo "---"; ps aux --sort=-%mem | head -8'
# Wizard service status
ssh root@TARGET 'systemctl status hermes-WIZARD.service 2>&1 | head -20'
# Recent logs (look for errors, stuck sessions, long tool calls)
ssh root@TARGET 'journalctl -u hermes-WIZARD.service --no-pager -n 50'
```
## Phase 3: Inspect Config
Check if Telegram and provider are properly wired:
```bash
# Environment file - must have bot token AND provider key
ssh root@TARGET 'cat /root/wizards/WIZARD/home/.env'
# Config - check model, provider, platforms section
ssh root@TARGET 'cat /root/wizards/WIZARD/home/config.yaml'
```
Required .env vars for Telegram-connected wizard:
- Provider API key (e.g., KIMI_API_KEY, ANTHROPIC_API_KEY)
- TELEGRAM_BOT_TOKEN
- TELEGRAM_HOME_CHANNEL
- TELEGRAM_HOME_CHANNEL_NAME
- TELEGRAM_ALLOWED_USERS
If Telegram vars are missing, the deploy script was incomplete. Add them and restart.
## Phase 4: Fix Common Problems
### Resource starvation (< 200MB available)
```bash
# Kill Docker if not needed
ssh root@TARGET 'systemctl stop docker containerd; systemctl disable docker containerd'
# Kill any runaway processes spawned by the wizard
ssh root@TARGET 'pkill -f "khatru-relay|caddy|nginx|go build" 2>/dev/null'
```
### Stuck session (wizard busy with a long-running task)
```bash
# Restart the service - clears the stuck session
ssh root@TARGET 'systemctl restart hermes-WIZARD.service && sleep 2 && systemctl is-active hermes-WIZARD.service'
```
### Wrong provider or missing key
Edit the .env and config.yaml, then restart:
```bash
ssh root@TARGET 'systemctl restart hermes-WIZARD.service'
```
## Phase 5: Verify
```bash
# Service running?
ssh root@TARGET 'systemctl is-active hermes-WIZARD.service'
# RAM healthy? (> 500MB available on 2GB box)
ssh root@TARGET 'free -h'
# Logs show Telegram connected?
ssh root@TARGET 'journalctl -u hermes-WIZARD.service --no-pager -n 10 | grep -i telegram'
```
Then ask Alexander to message the wizard on Telegram to confirm it responds.
## Reaching Alexander's Mac
If you need to diagnose local Timmy directly, Alexander's Mac is on the Tailscale network:
- Tailscale IP: `100.124.176.28` (hostname: `mm`)
- User: `apayne`
- SSH key must be in `~/.ssh/authorized_keys` on the Mac
- System python is 3.9; use `~/.hermes/hermes-agent/venv/bin/python3` for 3.10+ syntax
- llama-server model path: `/Users/apayne/models/hermes4-14b/NousResearch_Hermes-4-14B-Q4_K_M.gguf`
- Timmy workspace: `~/.timmy/`, Hermes home: `~/.hermes/`
## Pitfalls
1. **SSH banner timeout = RAM starvation.** Don't keep retrying SSH. The box needs resources freed via DO console first.
2. **2GB droplets cannot run Hermes + Docker + nginx + a relay.** Hermes alone needs ~500MB-1GB. Budget accordingly.
3. **Deploy scripts may be incomplete.** Bezalel's deploy script for Allegro only injected KIMI_API_KEY but not Telegram vars. Always verify .env has everything.
4. **Wizards build infrastructure when left unsupervised.** If a wizard was given a broad task, check what processes it spawned. Kill anything not essential.
5. **Always check the Gitea PR for the deploy script** — it's the source of truth for what SHOULD be configured. Compare against what IS configured.
6. **Restart clears stuck sessions.** If a wizard is mid-session on a slow model (Kimi), restarting the service is the fastest way to unstick it.