141 lines
4.9 KiB
Markdown
141 lines
4.9 KiB
Markdown
---
|
|
name: wizard-house-remote-triage
|
|
description: Diagnose and fix a sibling wizard house that is unresponsive or misbehaving on a remote VPS. SSH in, check resources, inspect config, kill runaway processes, restart services.
|
|
version: 1.0.0
|
|
author: Ezra
|
|
license: MIT
|
|
metadata:
|
|
hermes:
|
|
tags: [wizard-house, devops, ssh, triage, remote, debugging]
|
|
related_skills: [gitea-wizard-onboarding, telegram-bot-profile, systematic-debugging]
|
|
---
|
|
|
|
# Wizard House Remote Triage
|
|
|
|
## When to Use
|
|
|
|
- A sibling wizard (Allegro, Bezalel, etc.) is not responding on Telegram
|
|
- A wizard's VPS is suspected to be resource-starved or misconfigured
|
|
- You need to inspect or fix another wizard's deployment from your own VPS
|
|
|
|
## Prerequisites
|
|
|
|
- SSH access to the target VPS (key in authorized_keys)
|
|
- If no key exists: generate with `ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N "" -C "ezra@hermes-vps"` and have Alexander drop the pubkey on the target
|
|
|
|
## Phase 1: Can You Reach It?
|
|
|
|
```bash
|
|
# Ping first
|
|
ping -c 2 -W 2 TARGET_IP
|
|
|
|
# Try SSH
|
|
ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no root@TARGET_IP 'hostname; whoami'
|
|
```
|
|
|
|
If SSH connects but banner exchange times out: the box is resource-starved. SSHD can't respond because CPU/RAM is exhausted. You need Alexander to use the DigitalOcean web console (Droplets > droplet > Console) to free resources first.
|
|
|
|
Quick commands for Alexander to run on DO console:
|
|
```bash
|
|
# Check what's eating RAM
|
|
ps aux --sort=-%mem | head -10
|
|
free -h
|
|
|
|
# Common resource hogs to kill
|
|
systemctl stop docker containerd nginx
|
|
systemctl disable docker containerd
|
|
```
|
|
|
|
## Phase 2: Inspect the Box
|
|
|
|
Once SSH works:
|
|
|
|
```bash
|
|
# Resources
|
|
ssh root@TARGET 'free -h; echo "---"; ps aux --sort=-%mem | head -8'
|
|
|
|
# Wizard service status
|
|
ssh root@TARGET 'systemctl status hermes-WIZARD.service 2>&1 | head -20'
|
|
|
|
# Recent logs (look for errors, stuck sessions, long tool calls)
|
|
ssh root@TARGET 'journalctl -u hermes-WIZARD.service --no-pager -n 50'
|
|
```
|
|
|
|
## Phase 3: Inspect Config
|
|
|
|
Check if Telegram and provider are properly wired:
|
|
|
|
```bash
|
|
# Environment file - must have bot token AND provider key
|
|
ssh root@TARGET 'cat /root/wizards/WIZARD/home/.env'
|
|
|
|
# Config - check model, provider, platforms section
|
|
ssh root@TARGET 'cat /root/wizards/WIZARD/home/config.yaml'
|
|
```
|
|
|
|
Required .env vars for Telegram-connected wizard:
|
|
- Provider API key (e.g., KIMI_API_KEY, ANTHROPIC_API_KEY)
|
|
- TELEGRAM_BOT_TOKEN
|
|
- TELEGRAM_HOME_CHANNEL
|
|
- TELEGRAM_HOME_CHANNEL_NAME
|
|
- TELEGRAM_ALLOWED_USERS
|
|
|
|
If Telegram vars are missing, the deploy script was incomplete. Add them and restart.
|
|
|
|
## Phase 4: Fix Common Problems
|
|
|
|
### Resource starvation (< 200MB available)
|
|
```bash
|
|
# Kill Docker if not needed
|
|
ssh root@TARGET 'systemctl stop docker containerd; systemctl disable docker containerd'
|
|
|
|
# Kill any runaway processes spawned by the wizard
|
|
ssh root@TARGET 'pkill -f "khatru-relay|caddy|nginx|go build" 2>/dev/null'
|
|
```
|
|
|
|
### Stuck session (wizard busy with a long-running task)
|
|
```bash
|
|
# Restart the service - clears the stuck session
|
|
ssh root@TARGET 'systemctl restart hermes-WIZARD.service && sleep 2 && systemctl is-active hermes-WIZARD.service'
|
|
```
|
|
|
|
### Wrong provider or missing key
|
|
Edit the .env and config.yaml, then restart:
|
|
```bash
|
|
ssh root@TARGET 'systemctl restart hermes-WIZARD.service'
|
|
```
|
|
|
|
## Phase 5: Verify
|
|
|
|
```bash
|
|
# Service running?
|
|
ssh root@TARGET 'systemctl is-active hermes-WIZARD.service'
|
|
|
|
# RAM healthy? (> 500MB available on 2GB box)
|
|
ssh root@TARGET 'free -h'
|
|
|
|
# Logs show Telegram connected?
|
|
ssh root@TARGET 'journalctl -u hermes-WIZARD.service --no-pager -n 10 | grep -i telegram'
|
|
```
|
|
|
|
Then ask Alexander to message the wizard on Telegram to confirm it responds.
|
|
|
|
## Reaching Alexander's Mac
|
|
|
|
If you need to diagnose local Timmy directly, Alexander's Mac is on the Tailscale network:
|
|
- Tailscale IP: `100.124.176.28` (hostname: `mm`)
|
|
- User: `apayne`
|
|
- SSH key must be in `~/.ssh/authorized_keys` on the Mac
|
|
- System python is 3.9; use `~/.hermes/hermes-agent/venv/bin/python3` for 3.10+ syntax
|
|
- llama-server model path: `/Users/apayne/models/hermes4-14b/NousResearch_Hermes-4-14B-Q4_K_M.gguf`
|
|
- Timmy workspace: `~/.timmy/`, Hermes home: `~/.hermes/`
|
|
|
|
## Pitfalls
|
|
|
|
1. **SSH banner timeout = RAM starvation.** Don't keep retrying SSH. The box needs resources freed via DO console first.
|
|
2. **2GB droplets cannot run Hermes + Docker + nginx + a relay.** Hermes alone needs ~500MB-1GB. Budget accordingly.
|
|
3. **Deploy scripts may be incomplete.** Bezalel's deploy script for Allegro only injected KIMI_API_KEY but not Telegram vars. Always verify .env has everything.
|
|
4. **Wizards build infrastructure when left unsupervised.** If a wizard was given a broad task, check what processes it spawned. Kill anything not essential.
|
|
5. **Always check the Gitea PR for the deploy script** — it's the source of truth for what SHOULD be configured. Compare against what IS configured.
|
|
6. **Restart clears stuck sessions.** If a wizard is mid-session on a slow model (Kimi), restarting the service is the fastest way to unstick it.
|