Files
the-nexus/docs/computer-use.md
Alexander Whitestone a3a28aa4c2
Some checks failed
CI / test (pull_request) Failing after 20s
CI / validate (pull_request) Failing after 25s
Review Approval Gate / verify-review (pull_request) Failing after 5s
feat: add desktop automation primitives to Hermes (#1125)
Implements Phase 1 & 2 of the [COMPUTER_USE] epic:

- nexus/computer_use.py — four Hermes tools with safety guards and
  JSONL action logging:
    computer_screenshot(), computer_click(), computer_type(), computer_scroll()
  Poka-yoke: right/middle clicks require confirm=True; text containing
  password/token/key keywords is refused without confirm=True.
  pyautogui.FAILSAFE=True enabled globally (corner-abort).

- nexus/computer_use_demo.py — end-to-end Phase 1 demo: baseline
  screenshot → open browser → navigate to Gitea → evidence screenshot.

- tests/test_computer_use.py — 29 unit tests, fully headless (pyautogui
  mocked); all pass.

- docs/computer-use.md — full Phase 1–3 documentation including API
  reference, safety table, action-log format, and pilot recipes.

- docker-compose.desktop.yml — sandboxed Xvfb + noVNC container for
  safe headless desktop automation.

The existing mcp_servers/desktop_control_server.py is unchanged; it
remains available for external/MCP callers (Bannerlord harness etc).

Fixes #1125
2026-04-08 06:29:27 -04:00

311 lines
9.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Computer Use — Desktop Automation Primitives for Hermes
**Issue:** #1125
**Status:** Phase 1 complete, Phase 2 in progress
**Owner:** Bezalel
**Epic:** #1120
---
## Overview
This document describes how Hermes agents can control a desktop environment
(screenshot, click, type, scroll) for automation and testing. The capability
unlocks:
- Visual regression testing of fleet dashboards
- Automated Gitea workflow verification
- Screenshot-based incident diagnosis
- Driving GUI-only tools from agent code
---
## Architecture
```
┌──────────────────────────────────────────────────────┐
│ Hermes Agent │
│ │
│ computer_screenshot() computer_click(x, y) │
│ computer_type(text) computer_scroll(x, y, n) │
│ │ │
│ nexus/computer_use.py │
│ (safety guards · action log) │
└────────────────────────┬─────────────────────────────┘
┌──────────┴───────────┐
│ pyautogui │
│ (FAILSAFE enabled) │
└──────────┬───────────┘
┌──────────┴───────────┐
│ Desktop environment │
│ (Xvfb · noVNC · │
│ bare metal) │
└──────────────────────┘
```
The MCP server layer (`mcp_servers/desktop_control_server.py`) is still
available for external callers (e.g. the Bannerlord harness). The
`nexus/computer_use.py` module calls pyautogui directly so that safety
guards, logging, and screenshot evidence are applied consistently for every
Hermes agent invocation.
---
## Phase 1 — Environment & Primitives
### Sandboxed Desktop Setup
**Option A — Xvfb (lightweight, Linux/macOS)**
```bash
# Install
sudo apt-get install xvfb # Linux
brew install xvfb # macOS (via XQuartz)
# Start a virtual display on :99
Xvfb :99 -screen 0 1280x800x24 &
export DISPLAY=:99
# Run the demo
python nexus/computer_use_demo.py
```
**Option B — Docker with noVNC**
```bash
docker-compose -f docker-compose.desktop.yml up
# Open http://localhost:6080 to view the virtual desktop
```
See `docker-compose.desktop.yml` in the repo root.
### Running the Demo
The `nexus/computer_use_demo.py` script exercises the full Phase 1 loop:
```
[1/4] Capturing baseline screenshot
[2/4] Opening browser → https://forge.alexanderwhitestone.com
[3/4] Waiting 3s for page to load
[4/4] Capturing evidence screenshot
```
```bash
# Default target (Gitea forge)
python nexus/computer_use_demo.py
# Custom URL
GITEA_URL=http://localhost:3000 python nexus/computer_use_demo.py
```
---
## Phase 2 — Tool Integration
### API Reference
All four tools live in `nexus/computer_use.py` and follow the same contract:
```python
result = tool(...)
# result is always:
# {"ok": bool, "tool": str, ...fields..., "screenshot": path_or_None}
```
#### `computer_screenshot(output_path=None)`
Take a screenshot of the current desktop.
| Parameter | Type | Default | Description |
|---------------|-----------------|----------------------|-----------------------------------|
| `output_path` | `str` or `None` | auto timestamped PNG | Where to save the captured image. |
```python
from nexus.computer_use import computer_screenshot
result = computer_screenshot()
# {"ok": True, "tool": "computer_screenshot", "path": "~/.nexus/nexus_snap_1712345678.png"}
```
#### `computer_click(x, y, *, button="left", confirm=False)`
Click at screen coordinates.
| Parameter | Type | Default | Description |
|-----------|--------|----------|------------------------------------------|
| `x`, `y` | `int` | required | Screen pixel coordinates. |
| `button` | `str` | `"left"` | `"left"`, `"right"`, or `"middle"`. |
| `confirm` | `bool` | `False` | Required for `right`/`middle` clicks. |
```python
from nexus.computer_use import computer_click
# Simple left click (no confirm needed)
result = computer_click(960, 540)
# Right-click requires explicit confirmation
result = computer_click(960, 540, button="right", confirm=True)
```
**Screenshot evidence:** before/after snapshots are captured and logged.
#### `computer_type(text, *, confirm=False)`
Type a string using keyboard simulation.
| Parameter | Type | Default | Description |
|-----------|--------|----------|----------------------------------------------------|
| `text` | `str` | required | Text to type. |
| `confirm` | `bool` | `False` | Required when text contains `password`/`token`/`key`. |
```python
from nexus.computer_use import computer_type
# Safe text — no confirm needed
computer_type("https://forge.alexanderwhitestone.com")
# Sensitive text — confirm required
computer_type("hunter2", confirm=True)
```
#### `computer_scroll(x, y, amount)`
Scroll the mouse wheel at the given position.
| Parameter | Type | Description |
|-----------|-------|-------------------------------------------------|
| `x`, `y` | `int` | Move mouse here before scrolling. |
| `amount` | `int` | Positive = scroll up; negative = scroll down. |
```python
from nexus.computer_use import computer_scroll
computer_scroll(640, 400, -5) # scroll down 5 clicks
computer_scroll(640, 400, 3) # scroll up 3 clicks
```
### Safety (Poka-Yoke)
| Situation | Behavior |
|------------------------------------|-------------------------------------------------|
| `right`/`middle` click w/o confirm | Refused; returns `ok=False` with explanation |
| Text with `password`/`token`/`key` | Refused unless `confirm=True` |
| `FAILSAFE = True` | Move mouse to screen corner (0, 0) to abort |
| pyautogui unavailable | All tools return `ok=False` gracefully |
### Action Log
Every call is appended to `~/.nexus/computer_use_log.jsonl` (one JSON record per line):
```json
{"ok": true, "tool": "computer_click", "x": 960, "y": 540, "button": "left",
"before_screenshot": "/home/user/.nexus/before_click_1712345.png",
"screenshot": "/home/user/.nexus/after_click_1712345.png",
"ts": "2026-04-08T10:30:00+00:00"}
```
Read recent entries from Python:
```python
from nexus.computer_use import read_action_log
for record in read_action_log(last_n=10):
print(record)
```
---
## Phase 3 — Use-Case Pilots
### Pilot 1: Visual Regression Test (Fleet Dashboard)
Open the fleet health dashboard, take a screenshot, compare pixel-level
hashes against a golden baseline:
```python
from nexus.computer_use import computer_screenshot, computer_click
import hashlib
def screenshot_hash(path: str) -> str:
return hashlib.md5(open(path, "rb").read()).hexdigest()
# Navigate to the dashboard
computer_click(960, 40) # address bar
computer_type("http://localhost:7771/health\n")
import time; time.sleep(2)
result = computer_screenshot()
current_hash = screenshot_hash(result["path"])
GOLDEN_HASH = "abc123..." # established on first run
assert current_hash == GOLDEN_HASH, "Visual regression detected!"
```
### Pilot 2: Screenshot-Based CI Diagnosis
When a CI workflow fails, agents can screenshot the Gitea workflow page and
use the image to triage:
```python
from nexus.computer_use import computer_screenshot
def diagnose_failed_workflow(run_url: str) -> str:
"""
Navigate to *run_url*, screenshot it, return the screenshot path
for downstream LLM-based analysis.
"""
computer_click(960, 40) # address bar
computer_type(run_url + "\n")
import time; time.sleep(3)
result = computer_screenshot()
return result["path"] # hand off to vision model or OCR
```
---
## MCP Server (External Callers)
The lower-level MCP server (`mcp_servers/desktop_control_server.py`) exposes
the same capabilities over JSON-RPC stdio for callers outside Python:
```bash
# List available tools
echo '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' \
| python mcp_servers/desktop_control_server.py
# Take a screenshot
echo '{"jsonrpc":"2.0","id":2,"method":"tools/call",
"params":{"name":"take_screenshot","arguments":{"path":"/tmp/snap.png"}}}' \
| python mcp_servers/desktop_control_server.py
```
---
## Docker / Sandboxed Environment
`docker-compose.desktop.yml` provides a safe container with:
- Xvfb virtual display (1280×800)
- noVNC for browser-based viewing
- Python + pyautogui pre-installed
```bash
docker-compose -f docker-compose.desktop.yml up
# noVNC → http://localhost:6080
# Run demo inside container:
docker exec -it nexus-desktop python nexus/computer_use_demo.py
```
---
## Development Notes
- `NEXUS_HOME` env var overrides the log/snapshot directory (default `~/.nexus`)
- `GITEA_URL` env var overrides the target in the demo script
- `BROWSER_OPEN_WAIT` controls how long the demo waits after opening the browser
- Tests in `tests/test_computer_use.py` run headless — pyautogui is fully mocked