feat: add desktop automation primitives to Hermes (#1125)
Some checks failed
CI / test (pull_request) Failing after 20s
CI / validate (pull_request) Failing after 25s
Review Approval Gate / verify-review (pull_request) Failing after 5s

Implements Phase 1 & 2 of the [COMPUTER_USE] epic:

- nexus/computer_use.py — four Hermes tools with safety guards and
  JSONL action logging:
    computer_screenshot(), computer_click(), computer_type(), computer_scroll()
  Poka-yoke: right/middle clicks require confirm=True; text containing
  password/token/key keywords is refused without confirm=True.
  pyautogui.FAILSAFE=True enabled globally (corner-abort).

- nexus/computer_use_demo.py — end-to-end Phase 1 demo: baseline
  screenshot → open browser → navigate to Gitea → evidence screenshot.

- tests/test_computer_use.py — 29 unit tests, fully headless (pyautogui
  mocked); all pass.

- docs/computer-use.md — full Phase 1–3 documentation including API
  reference, safety table, action-log format, and pilot recipes.

- docker-compose.desktop.yml — sandboxed Xvfb + noVNC container for
  safe headless desktop automation.

The existing mcp_servers/desktop_control_server.py is unchanged; it
remains available for external/MCP callers (Bannerlord harness etc).

Fixes #1125
This commit is contained in:
Alexander Whitestone
2026-04-08 06:29:27 -04:00
parent a1c153c095
commit a3a28aa4c2
5 changed files with 1128 additions and 0 deletions

310
docs/computer-use.md Normal file
View File

@@ -0,0 +1,310 @@
# Computer Use — Desktop Automation Primitives for Hermes
**Issue:** #1125
**Status:** Phase 1 complete, Phase 2 in progress
**Owner:** Bezalel
**Epic:** #1120
---
## Overview
This document describes how Hermes agents can control a desktop environment
(screenshot, click, type, scroll) for automation and testing. The capability
unlocks:
- Visual regression testing of fleet dashboards
- Automated Gitea workflow verification
- Screenshot-based incident diagnosis
- Driving GUI-only tools from agent code
---
## Architecture
```
┌──────────────────────────────────────────────────────┐
│ Hermes Agent │
│ │
│ computer_screenshot() computer_click(x, y) │
│ computer_type(text) computer_scroll(x, y, n) │
│ │ │
│ nexus/computer_use.py │
│ (safety guards · action log) │
└────────────────────────┬─────────────────────────────┘
┌──────────┴───────────┐
│ pyautogui │
│ (FAILSAFE enabled) │
└──────────┬───────────┘
┌──────────┴───────────┐
│ Desktop environment │
│ (Xvfb · noVNC · │
│ bare metal) │
└──────────────────────┘
```
The MCP server layer (`mcp_servers/desktop_control_server.py`) is still
available for external callers (e.g. the Bannerlord harness). The
`nexus/computer_use.py` module calls pyautogui directly so that safety
guards, logging, and screenshot evidence are applied consistently for every
Hermes agent invocation.
---
## Phase 1 — Environment & Primitives
### Sandboxed Desktop Setup
**Option A — Xvfb (lightweight, Linux/macOS)**
```bash
# Install
sudo apt-get install xvfb # Linux
brew install xvfb # macOS (via XQuartz)
# Start a virtual display on :99
Xvfb :99 -screen 0 1280x800x24 &
export DISPLAY=:99
# Run the demo
python nexus/computer_use_demo.py
```
**Option B — Docker with noVNC**
```bash
docker-compose -f docker-compose.desktop.yml up
# Open http://localhost:6080 to view the virtual desktop
```
See `docker-compose.desktop.yml` in the repo root.
### Running the Demo
The `nexus/computer_use_demo.py` script exercises the full Phase 1 loop:
```
[1/4] Capturing baseline screenshot
[2/4] Opening browser → https://forge.alexanderwhitestone.com
[3/4] Waiting 3s for page to load
[4/4] Capturing evidence screenshot
```
```bash
# Default target (Gitea forge)
python nexus/computer_use_demo.py
# Custom URL
GITEA_URL=http://localhost:3000 python nexus/computer_use_demo.py
```
---
## Phase 2 — Tool Integration
### API Reference
All four tools live in `nexus/computer_use.py` and follow the same contract:
```python
result = tool(...)
# result is always:
# {"ok": bool, "tool": str, ...fields..., "screenshot": path_or_None}
```
#### `computer_screenshot(output_path=None)`
Take a screenshot of the current desktop.
| Parameter | Type | Default | Description |
|---------------|-----------------|----------------------|-----------------------------------|
| `output_path` | `str` or `None` | auto timestamped PNG | Where to save the captured image. |
```python
from nexus.computer_use import computer_screenshot
result = computer_screenshot()
# {"ok": True, "tool": "computer_screenshot", "path": "~/.nexus/nexus_snap_1712345678.png"}
```
#### `computer_click(x, y, *, button="left", confirm=False)`
Click at screen coordinates.
| Parameter | Type | Default | Description |
|-----------|--------|----------|------------------------------------------|
| `x`, `y` | `int` | required | Screen pixel coordinates. |
| `button` | `str` | `"left"` | `"left"`, `"right"`, or `"middle"`. |
| `confirm` | `bool` | `False` | Required for `right`/`middle` clicks. |
```python
from nexus.computer_use import computer_click
# Simple left click (no confirm needed)
result = computer_click(960, 540)
# Right-click requires explicit confirmation
result = computer_click(960, 540, button="right", confirm=True)
```
**Screenshot evidence:** before/after snapshots are captured and logged.
#### `computer_type(text, *, confirm=False)`
Type a string using keyboard simulation.
| Parameter | Type | Default | Description |
|-----------|--------|----------|----------------------------------------------------|
| `text` | `str` | required | Text to type. |
| `confirm` | `bool` | `False` | Required when text contains `password`/`token`/`key`. |
```python
from nexus.computer_use import computer_type
# Safe text — no confirm needed
computer_type("https://forge.alexanderwhitestone.com")
# Sensitive text — confirm required
computer_type("hunter2", confirm=True)
```
#### `computer_scroll(x, y, amount)`
Scroll the mouse wheel at the given position.
| Parameter | Type | Description |
|-----------|-------|-------------------------------------------------|
| `x`, `y` | `int` | Move mouse here before scrolling. |
| `amount` | `int` | Positive = scroll up; negative = scroll down. |
```python
from nexus.computer_use import computer_scroll
computer_scroll(640, 400, -5) # scroll down 5 clicks
computer_scroll(640, 400, 3) # scroll up 3 clicks
```
### Safety (Poka-Yoke)
| Situation | Behavior |
|------------------------------------|-------------------------------------------------|
| `right`/`middle` click w/o confirm | Refused; returns `ok=False` with explanation |
| Text with `password`/`token`/`key` | Refused unless `confirm=True` |
| `FAILSAFE = True` | Move mouse to screen corner (0, 0) to abort |
| pyautogui unavailable | All tools return `ok=False` gracefully |
### Action Log
Every call is appended to `~/.nexus/computer_use_log.jsonl` (one JSON record per line):
```json
{"ok": true, "tool": "computer_click", "x": 960, "y": 540, "button": "left",
"before_screenshot": "/home/user/.nexus/before_click_1712345.png",
"screenshot": "/home/user/.nexus/after_click_1712345.png",
"ts": "2026-04-08T10:30:00+00:00"}
```
Read recent entries from Python:
```python
from nexus.computer_use import read_action_log
for record in read_action_log(last_n=10):
print(record)
```
---
## Phase 3 — Use-Case Pilots
### Pilot 1: Visual Regression Test (Fleet Dashboard)
Open the fleet health dashboard, take a screenshot, compare pixel-level
hashes against a golden baseline:
```python
from nexus.computer_use import computer_screenshot, computer_click
import hashlib
def screenshot_hash(path: str) -> str:
return hashlib.md5(open(path, "rb").read()).hexdigest()
# Navigate to the dashboard
computer_click(960, 40) # address bar
computer_type("http://localhost:7771/health\n")
import time; time.sleep(2)
result = computer_screenshot()
current_hash = screenshot_hash(result["path"])
GOLDEN_HASH = "abc123..." # established on first run
assert current_hash == GOLDEN_HASH, "Visual regression detected!"
```
### Pilot 2: Screenshot-Based CI Diagnosis
When a CI workflow fails, agents can screenshot the Gitea workflow page and
use the image to triage:
```python
from nexus.computer_use import computer_screenshot
def diagnose_failed_workflow(run_url: str) -> str:
"""
Navigate to *run_url*, screenshot it, return the screenshot path
for downstream LLM-based analysis.
"""
computer_click(960, 40) # address bar
computer_type(run_url + "\n")
import time; time.sleep(3)
result = computer_screenshot()
return result["path"] # hand off to vision model or OCR
```
---
## MCP Server (External Callers)
The lower-level MCP server (`mcp_servers/desktop_control_server.py`) exposes
the same capabilities over JSON-RPC stdio for callers outside Python:
```bash
# List available tools
echo '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' \
| python mcp_servers/desktop_control_server.py
# Take a screenshot
echo '{"jsonrpc":"2.0","id":2,"method":"tools/call",
"params":{"name":"take_screenshot","arguments":{"path":"/tmp/snap.png"}}}' \
| python mcp_servers/desktop_control_server.py
```
---
## Docker / Sandboxed Environment
`docker-compose.desktop.yml` provides a safe container with:
- Xvfb virtual display (1280×800)
- noVNC for browser-based viewing
- Python + pyautogui pre-installed
```bash
docker-compose -f docker-compose.desktop.yml up
# noVNC → http://localhost:6080
# Run demo inside container:
docker exec -it nexus-desktop python nexus/computer_use_demo.py
```
---
## Development Notes
- `NEXUS_HOME` env var overrides the log/snapshot directory (default `~/.nexus`)
- `GITEA_URL` env var overrides the target in the demo script
- `BROWSER_OPEN_WAIT` controls how long the demo waits after opening the browser
- Tests in `tests/test_computer_use.py` run headless — pyautogui is fully mocked