Implements Phase 1 & 2 of the [COMPUTER_USE] epic:
- nexus/computer_use.py — four Hermes tools with safety guards and
JSONL action logging:
computer_screenshot(), computer_click(), computer_type(), computer_scroll()
Poka-yoke: right/middle clicks require confirm=True; text containing
password/token/key keywords is refused without confirm=True.
pyautogui.FAILSAFE=True enabled globally (corner-abort).
- nexus/computer_use_demo.py — end-to-end Phase 1 demo: baseline
screenshot → open browser → navigate to Gitea → evidence screenshot.
- tests/test_computer_use.py — 29 unit tests, fully headless (pyautogui
mocked); all pass.
- docs/computer-use.md — full Phase 1–3 documentation including API
reference, safety table, action-log format, and pilot recipes.
- docker-compose.desktop.yml — sandboxed Xvfb + noVNC container for
safe headless desktop automation.
The existing mcp_servers/desktop_control_server.py is unchanged; it
remains available for external/MCP callers (Bannerlord harness etc).
Fixes #1125
311 lines
9.6 KiB
Markdown
311 lines
9.6 KiB
Markdown
# Computer Use — Desktop Automation Primitives for Hermes
|
||
|
||
**Issue:** #1125
|
||
**Status:** Phase 1 complete, Phase 2 in progress
|
||
**Owner:** Bezalel
|
||
**Epic:** #1120
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
This document describes how Hermes agents can control a desktop environment
|
||
(screenshot, click, type, scroll) for automation and testing. The capability
|
||
unlocks:
|
||
|
||
- Visual regression testing of fleet dashboards
|
||
- Automated Gitea workflow verification
|
||
- Screenshot-based incident diagnosis
|
||
- Driving GUI-only tools from agent code
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────┐
|
||
│ Hermes Agent │
|
||
│ │
|
||
│ computer_screenshot() computer_click(x, y) │
|
||
│ computer_type(text) computer_scroll(x, y, n) │
|
||
│ │ │
|
||
│ nexus/computer_use.py │
|
||
│ (safety guards · action log) │
|
||
└────────────────────────┬─────────────────────────────┘
|
||
│
|
||
┌──────────┴───────────┐
|
||
│ pyautogui │
|
||
│ (FAILSAFE enabled) │
|
||
└──────────┬───────────┘
|
||
│
|
||
┌──────────┴───────────┐
|
||
│ Desktop environment │
|
||
│ (Xvfb · noVNC · │
|
||
│ bare metal) │
|
||
└──────────────────────┘
|
||
```
|
||
|
||
The MCP server layer (`mcp_servers/desktop_control_server.py`) is still
|
||
available for external callers (e.g. the Bannerlord harness). The
|
||
`nexus/computer_use.py` module calls pyautogui directly so that safety
|
||
guards, logging, and screenshot evidence are applied consistently for every
|
||
Hermes agent invocation.
|
||
|
||
---
|
||
|
||
## Phase 1 — Environment & Primitives
|
||
|
||
### Sandboxed Desktop Setup
|
||
|
||
**Option A — Xvfb (lightweight, Linux/macOS)**
|
||
|
||
```bash
|
||
# Install
|
||
sudo apt-get install xvfb # Linux
|
||
brew install xvfb # macOS (via XQuartz)
|
||
|
||
# Start a virtual display on :99
|
||
Xvfb :99 -screen 0 1280x800x24 &
|
||
export DISPLAY=:99
|
||
|
||
# Run the demo
|
||
python nexus/computer_use_demo.py
|
||
```
|
||
|
||
**Option B — Docker with noVNC**
|
||
|
||
```bash
|
||
docker-compose -f docker-compose.desktop.yml up
|
||
# Open http://localhost:6080 to view the virtual desktop
|
||
```
|
||
|
||
See `docker-compose.desktop.yml` in the repo root.
|
||
|
||
### Running the Demo
|
||
|
||
The `nexus/computer_use_demo.py` script exercises the full Phase 1 loop:
|
||
|
||
```
|
||
[1/4] Capturing baseline screenshot
|
||
[2/4] Opening browser → https://forge.alexanderwhitestone.com
|
||
[3/4] Waiting 3s for page to load
|
||
[4/4] Capturing evidence screenshot
|
||
```
|
||
|
||
```bash
|
||
# Default target (Gitea forge)
|
||
python nexus/computer_use_demo.py
|
||
|
||
# Custom URL
|
||
GITEA_URL=http://localhost:3000 python nexus/computer_use_demo.py
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 2 — Tool Integration
|
||
|
||
### API Reference
|
||
|
||
All four tools live in `nexus/computer_use.py` and follow the same contract:
|
||
|
||
```python
|
||
result = tool(...)
|
||
# result is always:
|
||
# {"ok": bool, "tool": str, ...fields..., "screenshot": path_or_None}
|
||
```
|
||
|
||
#### `computer_screenshot(output_path=None)`
|
||
|
||
Take a screenshot of the current desktop.
|
||
|
||
| Parameter | Type | Default | Description |
|
||
|---------------|-----------------|----------------------|-----------------------------------|
|
||
| `output_path` | `str` or `None` | auto timestamped PNG | Where to save the captured image. |
|
||
|
||
```python
|
||
from nexus.computer_use import computer_screenshot
|
||
|
||
result = computer_screenshot()
|
||
# {"ok": True, "tool": "computer_screenshot", "path": "~/.nexus/nexus_snap_1712345678.png"}
|
||
```
|
||
|
||
#### `computer_click(x, y, *, button="left", confirm=False)`
|
||
|
||
Click at screen coordinates.
|
||
|
||
| Parameter | Type | Default | Description |
|
||
|-----------|--------|----------|------------------------------------------|
|
||
| `x`, `y` | `int` | required | Screen pixel coordinates. |
|
||
| `button` | `str` | `"left"` | `"left"`, `"right"`, or `"middle"`. |
|
||
| `confirm` | `bool` | `False` | Required for `right`/`middle` clicks. |
|
||
|
||
```python
|
||
from nexus.computer_use import computer_click
|
||
|
||
# Simple left click (no confirm needed)
|
||
result = computer_click(960, 540)
|
||
|
||
# Right-click requires explicit confirmation
|
||
result = computer_click(960, 540, button="right", confirm=True)
|
||
```
|
||
|
||
**Screenshot evidence:** before/after snapshots are captured and logged.
|
||
|
||
#### `computer_type(text, *, confirm=False)`
|
||
|
||
Type a string using keyboard simulation.
|
||
|
||
| Parameter | Type | Default | Description |
|
||
|-----------|--------|----------|----------------------------------------------------|
|
||
| `text` | `str` | required | Text to type. |
|
||
| `confirm` | `bool` | `False` | Required when text contains `password`/`token`/`key`. |
|
||
|
||
```python
|
||
from nexus.computer_use import computer_type
|
||
|
||
# Safe text — no confirm needed
|
||
computer_type("https://forge.alexanderwhitestone.com")
|
||
|
||
# Sensitive text — confirm required
|
||
computer_type("hunter2", confirm=True)
|
||
```
|
||
|
||
#### `computer_scroll(x, y, amount)`
|
||
|
||
Scroll the mouse wheel at the given position.
|
||
|
||
| Parameter | Type | Description |
|
||
|-----------|-------|-------------------------------------------------|
|
||
| `x`, `y` | `int` | Move mouse here before scrolling. |
|
||
| `amount` | `int` | Positive = scroll up; negative = scroll down. |
|
||
|
||
```python
|
||
from nexus.computer_use import computer_scroll
|
||
|
||
computer_scroll(640, 400, -5) # scroll down 5 clicks
|
||
computer_scroll(640, 400, 3) # scroll up 3 clicks
|
||
```
|
||
|
||
### Safety (Poka-Yoke)
|
||
|
||
| Situation | Behavior |
|
||
|------------------------------------|-------------------------------------------------|
|
||
| `right`/`middle` click w/o confirm | Refused; returns `ok=False` with explanation |
|
||
| Text with `password`/`token`/`key` | Refused unless `confirm=True` |
|
||
| `FAILSAFE = True` | Move mouse to screen corner (0, 0) to abort |
|
||
| pyautogui unavailable | All tools return `ok=False` gracefully |
|
||
|
||
### Action Log
|
||
|
||
Every call is appended to `~/.nexus/computer_use_log.jsonl` (one JSON record per line):
|
||
|
||
```json
|
||
{"ok": true, "tool": "computer_click", "x": 960, "y": 540, "button": "left",
|
||
"before_screenshot": "/home/user/.nexus/before_click_1712345.png",
|
||
"screenshot": "/home/user/.nexus/after_click_1712345.png",
|
||
"ts": "2026-04-08T10:30:00+00:00"}
|
||
```
|
||
|
||
Read recent entries from Python:
|
||
|
||
```python
|
||
from nexus.computer_use import read_action_log
|
||
for record in read_action_log(last_n=10):
|
||
print(record)
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 3 — Use-Case Pilots
|
||
|
||
### Pilot 1: Visual Regression Test (Fleet Dashboard)
|
||
|
||
Open the fleet health dashboard, take a screenshot, compare pixel-level
|
||
hashes against a golden baseline:
|
||
|
||
```python
|
||
from nexus.computer_use import computer_screenshot, computer_click
|
||
import hashlib
|
||
|
||
def screenshot_hash(path: str) -> str:
|
||
return hashlib.md5(open(path, "rb").read()).hexdigest()
|
||
|
||
# Navigate to the dashboard
|
||
computer_click(960, 40) # address bar
|
||
computer_type("http://localhost:7771/health\n")
|
||
|
||
import time; time.sleep(2)
|
||
|
||
result = computer_screenshot()
|
||
current_hash = screenshot_hash(result["path"])
|
||
|
||
GOLDEN_HASH = "abc123..." # established on first run
|
||
assert current_hash == GOLDEN_HASH, "Visual regression detected!"
|
||
```
|
||
|
||
### Pilot 2: Screenshot-Based CI Diagnosis
|
||
|
||
When a CI workflow fails, agents can screenshot the Gitea workflow page and
|
||
use the image to triage:
|
||
|
||
```python
|
||
from nexus.computer_use import computer_screenshot
|
||
|
||
def diagnose_failed_workflow(run_url: str) -> str:
|
||
"""
|
||
Navigate to *run_url*, screenshot it, return the screenshot path
|
||
for downstream LLM-based analysis.
|
||
"""
|
||
computer_click(960, 40) # address bar
|
||
computer_type(run_url + "\n")
|
||
|
||
import time; time.sleep(3)
|
||
|
||
result = computer_screenshot()
|
||
return result["path"] # hand off to vision model or OCR
|
||
```
|
||
|
||
---
|
||
|
||
## MCP Server (External Callers)
|
||
|
||
The lower-level MCP server (`mcp_servers/desktop_control_server.py`) exposes
|
||
the same capabilities over JSON-RPC stdio for callers outside Python:
|
||
|
||
```bash
|
||
# List available tools
|
||
echo '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' \
|
||
| python mcp_servers/desktop_control_server.py
|
||
|
||
# Take a screenshot
|
||
echo '{"jsonrpc":"2.0","id":2,"method":"tools/call",
|
||
"params":{"name":"take_screenshot","arguments":{"path":"/tmp/snap.png"}}}' \
|
||
| python mcp_servers/desktop_control_server.py
|
||
```
|
||
|
||
---
|
||
|
||
## Docker / Sandboxed Environment
|
||
|
||
`docker-compose.desktop.yml` provides a safe container with:
|
||
|
||
- Xvfb virtual display (1280×800)
|
||
- noVNC for browser-based viewing
|
||
- Python + pyautogui pre-installed
|
||
|
||
```bash
|
||
docker-compose -f docker-compose.desktop.yml up
|
||
# noVNC → http://localhost:6080
|
||
# Run demo inside container:
|
||
docker exec -it nexus-desktop python nexus/computer_use_demo.py
|
||
```
|
||
|
||
---
|
||
|
||
## Development Notes
|
||
|
||
- `NEXUS_HOME` env var overrides the log/snapshot directory (default `~/.nexus`)
|
||
- `GITEA_URL` env var overrides the target in the demo script
|
||
- `BROWSER_OPEN_WAIT` controls how long the demo waits after opening the browser
|
||
- Tests in `tests/test_computer_use.py` run headless — pyautogui is fully mocked
|