feat: add desktop automation primitives to Hermes (#1125)

Implements Phase 1 & 2 of the [COMPUTER_USE] epic: - nexus/computer_use.py — four Hermes tools with safety guards and JSONL action logging: computer_screenshot(), computer_click(), computer_type(), computer_scroll() Poka-yoke: right/middle clicks require confirm=True; text containing password/token/key keywords is refused without confirm=True. pyautogui.FAILSAFE=True enabled globally (corner-abort). - nexus/computer_use_demo.py — end-to-end Phase 1 demo: baseline screenshot → open browser → navigate to Gitea → evidence screenshot. - tests/test_computer_use.py — 29 unit tests, fully headless (pyautogui mocked); all pass. - docs/computer-use.md — full Phase 1–3 documentation including API reference, safety table, action-log format, and pilot recipes. - docker-compose.desktop.yml — sandboxed Xvfb + noVNC container for safe headless desktop automation. The existing mcp_servers/desktop_control_server.py is unchanged; it remains available for external/MCP callers (Bannerlord harness etc). Fixes #1125
2026-04-08 06:29:27 -04:00
parent a1c153c095
commit a3a28aa4c2
5 changed files with 1128 additions and 0 deletions
--- a/docs/computer-use.md
+++ b/docs/computer-use.md
@@ -0,0 +1,310 @@
+# Computer Use — Desktop Automation Primitives for Hermes
+
+**Issue:** #1125
+**Status:** Phase 1 complete, Phase 2 in progress
+**Owner:** Bezalel
+**Epic:** #1120
+
+---
+
+## Overview
+
+This document describes how Hermes agents can control a desktop environment
+(screenshot, click, type, scroll) for automation and testing.  The capability
+unlocks:
+
+- Visual regression testing of fleet dashboards
+- Automated Gitea workflow verification
+- Screenshot-based incident diagnosis
+- Driving GUI-only tools from agent code
+
+---
+
+## Architecture
+
+```
+┌──────────────────────────────────────────────────────┐
+│                    Hermes Agent                      │
+│                                                      │
+│   computer_screenshot()  computer_click(x, y)        │
+│   computer_type(text)    computer_scroll(x, y, n)    │
+│                      │                               │
+│            nexus/computer_use.py                     │
+│          (safety guards · action log)                │
+└────────────────────────┬─────────────────────────────┘
+                         │
+              ┌──────────┴───────────┐
+              │       pyautogui      │
+              │  (FAILSAFE enabled)  │
+              └──────────┬───────────┘
+                         │
+              ┌──────────┴───────────┐
+              │  Desktop environment │
+              │  (Xvfb · noVNC ·    │
+              │   bare metal)        │
+              └──────────────────────┘
+```
+
+The MCP server layer (`mcp_servers/desktop_control_server.py`) is still
+available for external callers (e.g. the Bannerlord harness).  The
+`nexus/computer_use.py` module calls pyautogui directly so that safety
+guards, logging, and screenshot evidence are applied consistently for every
+Hermes agent invocation.
+
+---
+
+## Phase 1 — Environment & Primitives
+
+### Sandboxed Desktop Setup
+
+**Option A — Xvfb (lightweight, Linux/macOS)**
+
+```bash
+# Install
+sudo apt-get install xvfb   # Linux
+brew install xvfb            # macOS (via XQuartz)
+
+# Start a virtual display on :99
+Xvfb :99 -screen 0 1280x800x24 &
+export DISPLAY=:99
+
+# Run the demo
+python nexus/computer_use_demo.py
+```
+
+**Option B — Docker with noVNC**
+
+```bash
+docker-compose -f docker-compose.desktop.yml up
+# Open http://localhost:6080 to view the virtual desktop
+```
+
+See `docker-compose.desktop.yml` in the repo root.
+
+### Running the Demo
+
+The `nexus/computer_use_demo.py` script exercises the full Phase 1 loop:
+
+```
+[1/4] Capturing baseline screenshot
+[2/4] Opening browser → https://forge.alexanderwhitestone.com
+[3/4] Waiting 3s for page to load
+[4/4] Capturing evidence screenshot
+```
+
+```bash
+# Default target (Gitea forge)
+python nexus/computer_use_demo.py
+
+# Custom URL
+GITEA_URL=http://localhost:3000 python nexus/computer_use_demo.py
+```
+
+---
+
+## Phase 2 — Tool Integration
+
+### API Reference
+
+All four tools live in `nexus/computer_use.py` and follow the same contract:
+
+```python
+result = tool(...)
+# result is always:
+# {"ok": bool, "tool": str, ...fields..., "screenshot": path_or_None}
+```
+
+#### `computer_screenshot(output_path=None)`
+
+Take a screenshot of the current desktop.
+
+| Parameter     | Type            | Default              | Description                       |
+|---------------|-----------------|----------------------|-----------------------------------|
+| `output_path` | `str` or `None` | auto timestamped PNG | Where to save the captured image. |
+
+```python
+from nexus.computer_use import computer_screenshot
+
+result = computer_screenshot()
+# {"ok": True, "tool": "computer_screenshot", "path": "~/.nexus/nexus_snap_1712345678.png"}
+```
+
+#### `computer_click(x, y, *, button="left", confirm=False)`
+
+Click at screen coordinates.
+
+| Parameter | Type   | Default  | Description                              |
+|-----------|--------|----------|------------------------------------------|
+| `x`, `y`  | `int`  | required | Screen pixel coordinates.                |
+| `button`  | `str`  | `"left"` | `"left"`, `"right"`, or `"middle"`.      |
+| `confirm` | `bool` | `False`  | Required for `right`/`middle` clicks.    |
+
+```python
+from nexus.computer_use import computer_click
+
+# Simple left click (no confirm needed)
+result = computer_click(960, 540)
+
+# Right-click requires explicit confirmation
+result = computer_click(960, 540, button="right", confirm=True)
+```
+
+**Screenshot evidence:** before/after snapshots are captured and logged.
+
+#### `computer_type(text, *, confirm=False)`
+
+Type a string using keyboard simulation.
+
+| Parameter | Type   | Default  | Description                                        |
+|-----------|--------|----------|----------------------------------------------------|
+| `text`    | `str`  | required | Text to type.                                      |
+| `confirm` | `bool` | `False`  | Required when text contains `password`/`token`/`key`. |
+
+```python
+from nexus.computer_use import computer_type
+
+# Safe text — no confirm needed
+computer_type("https://forge.alexanderwhitestone.com")
+
+# Sensitive text — confirm required
+computer_type("hunter2", confirm=True)
+```
+
+#### `computer_scroll(x, y, amount)`
+
+Scroll the mouse wheel at the given position.
+
+| Parameter | Type  | Description                                     |
+|-----------|-------|-------------------------------------------------|
+| `x`, `y`  | `int` | Move mouse here before scrolling.               |
+| `amount`  | `int` | Positive = scroll up; negative = scroll down.   |
+
+```python
+from nexus.computer_use import computer_scroll
+
+computer_scroll(640, 400, -5)   # scroll down 5 clicks
+computer_scroll(640, 400, 3)    # scroll up 3 clicks
+```
+
+### Safety (Poka-Yoke)
+
+| Situation                          | Behavior                                        |
+|------------------------------------|-------------------------------------------------|
+| `right`/`middle` click w/o confirm | Refused; returns `ok=False` with explanation    |
+| Text with `password`/`token`/`key` | Refused unless `confirm=True`                   |
+| `FAILSAFE = True`                  | Move mouse to screen corner (0, 0) to abort     |
+| pyautogui unavailable              | All tools return `ok=False` gracefully          |
+
+### Action Log
+
+Every call is appended to `~/.nexus/computer_use_log.jsonl` (one JSON record per line):
+
+```json
+{"ok": true, "tool": "computer_click", "x": 960, "y": 540, "button": "left",
+ "before_screenshot": "/home/user/.nexus/before_click_1712345.png",
+ "screenshot": "/home/user/.nexus/after_click_1712345.png",
+ "ts": "2026-04-08T10:30:00+00:00"}
+```
+
+Read recent entries from Python:
+
+```python
+from nexus.computer_use import read_action_log
+for record in read_action_log(last_n=10):
+    print(record)
+```
+
+---
+
+## Phase 3 — Use-Case Pilots
+
+### Pilot 1: Visual Regression Test (Fleet Dashboard)
+
+Open the fleet health dashboard, take a screenshot, compare pixel-level
+hashes against a golden baseline:
+
+```python
+from nexus.computer_use import computer_screenshot, computer_click
+import hashlib
+
+def screenshot_hash(path: str) -> str:
+    return hashlib.md5(open(path, "rb").read()).hexdigest()
+
+# Navigate to the dashboard
+computer_click(960, 40)   # address bar
+computer_type("http://localhost:7771/health\n")
+
+import time; time.sleep(2)
+
+result = computer_screenshot()
+current_hash = screenshot_hash(result["path"])
+
+GOLDEN_HASH = "abc123..."   # established on first run
+assert current_hash == GOLDEN_HASH, "Visual regression detected!"
+```
+
+### Pilot 2: Screenshot-Based CI Diagnosis
+
+When a CI workflow fails, agents can screenshot the Gitea workflow page and
+use the image to triage:
+
+```python
+from nexus.computer_use import computer_screenshot
+
+def diagnose_failed_workflow(run_url: str) -> str:
+    """
+    Navigate to *run_url*, screenshot it, return the screenshot path
+    for downstream LLM-based analysis.
+    """
+    computer_click(960, 40)   # address bar
+    computer_type(run_url + "\n")
+
+    import time; time.sleep(3)
+
+    result = computer_screenshot()
+    return result["path"]   # hand off to vision model or OCR
+```
+
+---
+
+## MCP Server (External Callers)
+
+The lower-level MCP server (`mcp_servers/desktop_control_server.py`) exposes
+the same capabilities over JSON-RPC stdio for callers outside Python:
+
+```bash
+# List available tools
+echo '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' \
+  | python mcp_servers/desktop_control_server.py
+
+# Take a screenshot
+echo '{"jsonrpc":"2.0","id":2,"method":"tools/call",
+       "params":{"name":"take_screenshot","arguments":{"path":"/tmp/snap.png"}}}' \
+  | python mcp_servers/desktop_control_server.py
+```
+
+---
+
+## Docker / Sandboxed Environment
+
+`docker-compose.desktop.yml` provides a safe container with:
+
+- Xvfb virtual display (1280×800)
+- noVNC for browser-based viewing
+- Python + pyautogui pre-installed
+
+```bash
+docker-compose -f docker-compose.desktop.yml up
+# noVNC → http://localhost:6080
+# Run demo inside container:
+docker exec -it nexus-desktop python nexus/computer_use_demo.py
+```
+
+---
+
+## Development Notes
+
+- `NEXUS_HOME` env var overrides the log/snapshot directory (default `~/.nexus`)
+- `GITEA_URL` env var overrides the target in the demo script
+- `BROWSER_OPEN_WAIT` controls how long the demo waits after opening the browser
+- Tests in `tests/test_computer_use.py` run headless — pyautogui is fully mocked