feat: add desktop automation primitives to Hermes (#1125)
Implements Phase 1 & 2 of the [COMPUTER_USE] epic:
- nexus/computer_use.py — four Hermes tools with safety guards and
JSONL action logging:
computer_screenshot(), computer_click(), computer_type(), computer_scroll()
Poka-yoke: right/middle clicks require confirm=True; text containing
password/token/key keywords is refused without confirm=True.
pyautogui.FAILSAFE=True enabled globally (corner-abort).
- nexus/computer_use_demo.py — end-to-end Phase 1 demo: baseline
screenshot → open browser → navigate to Gitea → evidence screenshot.
- tests/test_computer_use.py — 29 unit tests, fully headless (pyautogui
mocked); all pass.
- docs/computer-use.md — full Phase 1–3 documentation including API
reference, safety table, action-log format, and pilot recipes.
- docker-compose.desktop.yml — sandboxed Xvfb + noVNC container for
safe headless desktop automation.
The existing mcp_servers/desktop_control_server.py is unchanged; it
remains available for external/MCP callers (Bannerlord harness etc).
Fixes #1125
This commit is contained in:
310
docs/computer-use.md
Normal file
310
docs/computer-use.md
Normal file
@@ -0,0 +1,310 @@
|
||||
# Computer Use — Desktop Automation Primitives for Hermes
|
||||
|
||||
**Issue:** #1125
|
||||
**Status:** Phase 1 complete, Phase 2 in progress
|
||||
**Owner:** Bezalel
|
||||
**Epic:** #1120
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This document describes how Hermes agents can control a desktop environment
|
||||
(screenshot, click, type, scroll) for automation and testing. The capability
|
||||
unlocks:
|
||||
|
||||
- Visual regression testing of fleet dashboards
|
||||
- Automated Gitea workflow verification
|
||||
- Screenshot-based incident diagnosis
|
||||
- Driving GUI-only tools from agent code
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────┐
|
||||
│ Hermes Agent │
|
||||
│ │
|
||||
│ computer_screenshot() computer_click(x, y) │
|
||||
│ computer_type(text) computer_scroll(x, y, n) │
|
||||
│ │ │
|
||||
│ nexus/computer_use.py │
|
||||
│ (safety guards · action log) │
|
||||
└────────────────────────┬─────────────────────────────┘
|
||||
│
|
||||
┌──────────┴───────────┐
|
||||
│ pyautogui │
|
||||
│ (FAILSAFE enabled) │
|
||||
└──────────┬───────────┘
|
||||
│
|
||||
┌──────────┴───────────┐
|
||||
│ Desktop environment │
|
||||
│ (Xvfb · noVNC · │
|
||||
│ bare metal) │
|
||||
└──────────────────────┘
|
||||
```
|
||||
|
||||
The MCP server layer (`mcp_servers/desktop_control_server.py`) is still
|
||||
available for external callers (e.g. the Bannerlord harness). The
|
||||
`nexus/computer_use.py` module calls pyautogui directly so that safety
|
||||
guards, logging, and screenshot evidence are applied consistently for every
|
||||
Hermes agent invocation.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Environment & Primitives
|
||||
|
||||
### Sandboxed Desktop Setup
|
||||
|
||||
**Option A — Xvfb (lightweight, Linux/macOS)**
|
||||
|
||||
```bash
|
||||
# Install
|
||||
sudo apt-get install xvfb # Linux
|
||||
brew install xvfb # macOS (via XQuartz)
|
||||
|
||||
# Start a virtual display on :99
|
||||
Xvfb :99 -screen 0 1280x800x24 &
|
||||
export DISPLAY=:99
|
||||
|
||||
# Run the demo
|
||||
python nexus/computer_use_demo.py
|
||||
```
|
||||
|
||||
**Option B — Docker with noVNC**
|
||||
|
||||
```bash
|
||||
docker-compose -f docker-compose.desktop.yml up
|
||||
# Open http://localhost:6080 to view the virtual desktop
|
||||
```
|
||||
|
||||
See `docker-compose.desktop.yml` in the repo root.
|
||||
|
||||
### Running the Demo
|
||||
|
||||
The `nexus/computer_use_demo.py` script exercises the full Phase 1 loop:
|
||||
|
||||
```
|
||||
[1/4] Capturing baseline screenshot
|
||||
[2/4] Opening browser → https://forge.alexanderwhitestone.com
|
||||
[3/4] Waiting 3s for page to load
|
||||
[4/4] Capturing evidence screenshot
|
||||
```
|
||||
|
||||
```bash
|
||||
# Default target (Gitea forge)
|
||||
python nexus/computer_use_demo.py
|
||||
|
||||
# Custom URL
|
||||
GITEA_URL=http://localhost:3000 python nexus/computer_use_demo.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Tool Integration
|
||||
|
||||
### API Reference
|
||||
|
||||
All four tools live in `nexus/computer_use.py` and follow the same contract:
|
||||
|
||||
```python
|
||||
result = tool(...)
|
||||
# result is always:
|
||||
# {"ok": bool, "tool": str, ...fields..., "screenshot": path_or_None}
|
||||
```
|
||||
|
||||
#### `computer_screenshot(output_path=None)`
|
||||
|
||||
Take a screenshot of the current desktop.
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|---------------|-----------------|----------------------|-----------------------------------|
|
||||
| `output_path` | `str` or `None` | auto timestamped PNG | Where to save the captured image. |
|
||||
|
||||
```python
|
||||
from nexus.computer_use import computer_screenshot
|
||||
|
||||
result = computer_screenshot()
|
||||
# {"ok": True, "tool": "computer_screenshot", "path": "~/.nexus/nexus_snap_1712345678.png"}
|
||||
```
|
||||
|
||||
#### `computer_click(x, y, *, button="left", confirm=False)`
|
||||
|
||||
Click at screen coordinates.
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|--------|----------|------------------------------------------|
|
||||
| `x`, `y` | `int` | required | Screen pixel coordinates. |
|
||||
| `button` | `str` | `"left"` | `"left"`, `"right"`, or `"middle"`. |
|
||||
| `confirm` | `bool` | `False` | Required for `right`/`middle` clicks. |
|
||||
|
||||
```python
|
||||
from nexus.computer_use import computer_click
|
||||
|
||||
# Simple left click (no confirm needed)
|
||||
result = computer_click(960, 540)
|
||||
|
||||
# Right-click requires explicit confirmation
|
||||
result = computer_click(960, 540, button="right", confirm=True)
|
||||
```
|
||||
|
||||
**Screenshot evidence:** before/after snapshots are captured and logged.
|
||||
|
||||
#### `computer_type(text, *, confirm=False)`
|
||||
|
||||
Type a string using keyboard simulation.
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|--------|----------|----------------------------------------------------|
|
||||
| `text` | `str` | required | Text to type. |
|
||||
| `confirm` | `bool` | `False` | Required when text contains `password`/`token`/`key`. |
|
||||
|
||||
```python
|
||||
from nexus.computer_use import computer_type
|
||||
|
||||
# Safe text — no confirm needed
|
||||
computer_type("https://forge.alexanderwhitestone.com")
|
||||
|
||||
# Sensitive text — confirm required
|
||||
computer_type("hunter2", confirm=True)
|
||||
```
|
||||
|
||||
#### `computer_scroll(x, y, amount)`
|
||||
|
||||
Scroll the mouse wheel at the given position.
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|-------|-------------------------------------------------|
|
||||
| `x`, `y` | `int` | Move mouse here before scrolling. |
|
||||
| `amount` | `int` | Positive = scroll up; negative = scroll down. |
|
||||
|
||||
```python
|
||||
from nexus.computer_use import computer_scroll
|
||||
|
||||
computer_scroll(640, 400, -5) # scroll down 5 clicks
|
||||
computer_scroll(640, 400, 3) # scroll up 3 clicks
|
||||
```
|
||||
|
||||
### Safety (Poka-Yoke)
|
||||
|
||||
| Situation | Behavior |
|
||||
|------------------------------------|-------------------------------------------------|
|
||||
| `right`/`middle` click w/o confirm | Refused; returns `ok=False` with explanation |
|
||||
| Text with `password`/`token`/`key` | Refused unless `confirm=True` |
|
||||
| `FAILSAFE = True` | Move mouse to screen corner (0, 0) to abort |
|
||||
| pyautogui unavailable | All tools return `ok=False` gracefully |
|
||||
|
||||
### Action Log
|
||||
|
||||
Every call is appended to `~/.nexus/computer_use_log.jsonl` (one JSON record per line):
|
||||
|
||||
```json
|
||||
{"ok": true, "tool": "computer_click", "x": 960, "y": 540, "button": "left",
|
||||
"before_screenshot": "/home/user/.nexus/before_click_1712345.png",
|
||||
"screenshot": "/home/user/.nexus/after_click_1712345.png",
|
||||
"ts": "2026-04-08T10:30:00+00:00"}
|
||||
```
|
||||
|
||||
Read recent entries from Python:
|
||||
|
||||
```python
|
||||
from nexus.computer_use import read_action_log
|
||||
for record in read_action_log(last_n=10):
|
||||
print(record)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — Use-Case Pilots
|
||||
|
||||
### Pilot 1: Visual Regression Test (Fleet Dashboard)
|
||||
|
||||
Open the fleet health dashboard, take a screenshot, compare pixel-level
|
||||
hashes against a golden baseline:
|
||||
|
||||
```python
|
||||
from nexus.computer_use import computer_screenshot, computer_click
|
||||
import hashlib
|
||||
|
||||
def screenshot_hash(path: str) -> str:
|
||||
return hashlib.md5(open(path, "rb").read()).hexdigest()
|
||||
|
||||
# Navigate to the dashboard
|
||||
computer_click(960, 40) # address bar
|
||||
computer_type("http://localhost:7771/health\n")
|
||||
|
||||
import time; time.sleep(2)
|
||||
|
||||
result = computer_screenshot()
|
||||
current_hash = screenshot_hash(result["path"])
|
||||
|
||||
GOLDEN_HASH = "abc123..." # established on first run
|
||||
assert current_hash == GOLDEN_HASH, "Visual regression detected!"
|
||||
```
|
||||
|
||||
### Pilot 2: Screenshot-Based CI Diagnosis
|
||||
|
||||
When a CI workflow fails, agents can screenshot the Gitea workflow page and
|
||||
use the image to triage:
|
||||
|
||||
```python
|
||||
from nexus.computer_use import computer_screenshot
|
||||
|
||||
def diagnose_failed_workflow(run_url: str) -> str:
|
||||
"""
|
||||
Navigate to *run_url*, screenshot it, return the screenshot path
|
||||
for downstream LLM-based analysis.
|
||||
"""
|
||||
computer_click(960, 40) # address bar
|
||||
computer_type(run_url + "\n")
|
||||
|
||||
import time; time.sleep(3)
|
||||
|
||||
result = computer_screenshot()
|
||||
return result["path"] # hand off to vision model or OCR
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## MCP Server (External Callers)
|
||||
|
||||
The lower-level MCP server (`mcp_servers/desktop_control_server.py`) exposes
|
||||
the same capabilities over JSON-RPC stdio for callers outside Python:
|
||||
|
||||
```bash
|
||||
# List available tools
|
||||
echo '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' \
|
||||
| python mcp_servers/desktop_control_server.py
|
||||
|
||||
# Take a screenshot
|
||||
echo '{"jsonrpc":"2.0","id":2,"method":"tools/call",
|
||||
"params":{"name":"take_screenshot","arguments":{"path":"/tmp/snap.png"}}}' \
|
||||
| python mcp_servers/desktop_control_server.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Docker / Sandboxed Environment
|
||||
|
||||
`docker-compose.desktop.yml` provides a safe container with:
|
||||
|
||||
- Xvfb virtual display (1280×800)
|
||||
- noVNC for browser-based viewing
|
||||
- Python + pyautogui pre-installed
|
||||
|
||||
```bash
|
||||
docker-compose -f docker-compose.desktop.yml up
|
||||
# noVNC → http://localhost:6080
|
||||
# Run demo inside container:
|
||||
docker exec -it nexus-desktop python nexus/computer_use_demo.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Development Notes
|
||||
|
||||
- `NEXUS_HOME` env var overrides the log/snapshot directory (default `~/.nexus`)
|
||||
- `GITEA_URL` env var overrides the target in the demo script
|
||||
- `BROWSER_OPEN_WAIT` controls how long the demo waits after opening the browser
|
||||
- Tests in `tests/test_computer_use.py` run headless — pyautogui is fully mocked
|
||||
Reference in New Issue
Block a user