311 lines
9.6 KiB
Markdown
311 lines
9.6 KiB
Markdown
|
|
# Computer Use — Desktop Automation Primitives for Hermes
|
|||
|
|
|
|||
|
|
**Issue:** #1125
|
|||
|
|
**Status:** Phase 1 complete, Phase 2 in progress
|
|||
|
|
**Owner:** Bezalel
|
|||
|
|
**Epic:** #1120
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Overview
|
|||
|
|
|
|||
|
|
This document describes how Hermes agents can control a desktop environment
|
|||
|
|
(screenshot, click, type, scroll) for automation and testing. The capability
|
|||
|
|
unlocks:
|
|||
|
|
|
|||
|
|
- Visual regression testing of fleet dashboards
|
|||
|
|
- Automated Gitea workflow verification
|
|||
|
|
- Screenshot-based incident diagnosis
|
|||
|
|
- Driving GUI-only tools from agent code
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Architecture
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌──────────────────────────────────────────────────────┐
|
|||
|
|
│ Hermes Agent │
|
|||
|
|
│ │
|
|||
|
|
│ computer_screenshot() computer_click(x, y) │
|
|||
|
|
│ computer_type(text) computer_scroll(x, y, n) │
|
|||
|
|
│ │ │
|
|||
|
|
│ nexus/computer_use.py │
|
|||
|
|
│ (safety guards · action log) │
|
|||
|
|
└────────────────────────┬─────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
┌──────────┴───────────┐
|
|||
|
|
│ pyautogui │
|
|||
|
|
│ (FAILSAFE enabled) │
|
|||
|
|
└──────────┬───────────┘
|
|||
|
|
│
|
|||
|
|
┌──────────┴───────────┐
|
|||
|
|
│ Desktop environment │
|
|||
|
|
│ (Xvfb · noVNC · │
|
|||
|
|
│ bare metal) │
|
|||
|
|
└──────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The MCP server layer (`mcp_servers/desktop_control_server.py`) is still
|
|||
|
|
available for external callers (e.g. the Bannerlord harness). The
|
|||
|
|
`nexus/computer_use.py` module calls pyautogui directly so that safety
|
|||
|
|
guards, logging, and screenshot evidence are applied consistently for every
|
|||
|
|
Hermes agent invocation.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 1 — Environment & Primitives
|
|||
|
|
|
|||
|
|
### Sandboxed Desktop Setup
|
|||
|
|
|
|||
|
|
**Option A — Xvfb (lightweight, Linux/macOS)**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Install
|
|||
|
|
sudo apt-get install xvfb # Linux
|
|||
|
|
brew install xvfb # macOS (via XQuartz)
|
|||
|
|
|
|||
|
|
# Start a virtual display on :99
|
|||
|
|
Xvfb :99 -screen 0 1280x800x24 &
|
|||
|
|
export DISPLAY=:99
|
|||
|
|
|
|||
|
|
# Run the demo
|
|||
|
|
python nexus/computer_use_demo.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Option B — Docker with noVNC**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
docker-compose -f docker-compose.desktop.yml up
|
|||
|
|
# Open http://localhost:6080 to view the virtual desktop
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
See `docker-compose.desktop.yml` in the repo root.
|
|||
|
|
|
|||
|
|
### Running the Demo
|
|||
|
|
|
|||
|
|
The `nexus/computer_use_demo.py` script exercises the full Phase 1 loop:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
[1/4] Capturing baseline screenshot
|
|||
|
|
[2/4] Opening browser → https://forge.alexanderwhitestone.com
|
|||
|
|
[3/4] Waiting 3s for page to load
|
|||
|
|
[4/4] Capturing evidence screenshot
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Default target (Gitea forge)
|
|||
|
|
python nexus/computer_use_demo.py
|
|||
|
|
|
|||
|
|
# Custom URL
|
|||
|
|
GITEA_URL=http://localhost:3000 python nexus/computer_use_demo.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 2 — Tool Integration
|
|||
|
|
|
|||
|
|
### API Reference
|
|||
|
|
|
|||
|
|
All four tools live in `nexus/computer_use.py` and follow the same contract:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
result = tool(...)
|
|||
|
|
# result is always:
|
|||
|
|
# {"ok": bool, "tool": str, ...fields..., "screenshot": path_or_None}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### `computer_screenshot(output_path=None)`
|
|||
|
|
|
|||
|
|
Take a screenshot of the current desktop.
|
|||
|
|
|
|||
|
|
| Parameter | Type | Default | Description |
|
|||
|
|
|---------------|-----------------|----------------------|-----------------------------------|
|
|||
|
|
| `output_path` | `str` or `None` | auto timestamped PNG | Where to save the captured image. |
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from nexus.computer_use import computer_screenshot
|
|||
|
|
|
|||
|
|
result = computer_screenshot()
|
|||
|
|
# {"ok": True, "tool": "computer_screenshot", "path": "~/.nexus/nexus_snap_1712345678.png"}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### `computer_click(x, y, *, button="left", confirm=False)`
|
|||
|
|
|
|||
|
|
Click at screen coordinates.
|
|||
|
|
|
|||
|
|
| Parameter | Type | Default | Description |
|
|||
|
|
|-----------|--------|----------|------------------------------------------|
|
|||
|
|
| `x`, `y` | `int` | required | Screen pixel coordinates. |
|
|||
|
|
| `button` | `str` | `"left"` | `"left"`, `"right"`, or `"middle"`. |
|
|||
|
|
| `confirm` | `bool` | `False` | Required for `right`/`middle` clicks. |
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from nexus.computer_use import computer_click
|
|||
|
|
|
|||
|
|
# Simple left click (no confirm needed)
|
|||
|
|
result = computer_click(960, 540)
|
|||
|
|
|
|||
|
|
# Right-click requires explicit confirmation
|
|||
|
|
result = computer_click(960, 540, button="right", confirm=True)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Screenshot evidence:** before/after snapshots are captured and logged.
|
|||
|
|
|
|||
|
|
#### `computer_type(text, *, confirm=False)`
|
|||
|
|
|
|||
|
|
Type a string using keyboard simulation.
|
|||
|
|
|
|||
|
|
| Parameter | Type | Default | Description |
|
|||
|
|
|-----------|--------|----------|----------------------------------------------------|
|
|||
|
|
| `text` | `str` | required | Text to type. |
|
|||
|
|
| `confirm` | `bool` | `False` | Required when text contains `password`/`token`/`key`. |
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from nexus.computer_use import computer_type
|
|||
|
|
|
|||
|
|
# Safe text — no confirm needed
|
|||
|
|
computer_type("https://forge.alexanderwhitestone.com")
|
|||
|
|
|
|||
|
|
# Sensitive text — confirm required
|
|||
|
|
computer_type("hunter2", confirm=True)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### `computer_scroll(x, y, amount)`
|
|||
|
|
|
|||
|
|
Scroll the mouse wheel at the given position.
|
|||
|
|
|
|||
|
|
| Parameter | Type | Description |
|
|||
|
|
|-----------|-------|-------------------------------------------------|
|
|||
|
|
| `x`, `y` | `int` | Move mouse here before scrolling. |
|
|||
|
|
| `amount` | `int` | Positive = scroll up; negative = scroll down. |
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from nexus.computer_use import computer_scroll
|
|||
|
|
|
|||
|
|
computer_scroll(640, 400, -5) # scroll down 5 clicks
|
|||
|
|
computer_scroll(640, 400, 3) # scroll up 3 clicks
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Safety (Poka-Yoke)
|
|||
|
|
|
|||
|
|
| Situation | Behavior |
|
|||
|
|
|------------------------------------|-------------------------------------------------|
|
|||
|
|
| `right`/`middle` click w/o confirm | Refused; returns `ok=False` with explanation |
|
|||
|
|
| Text with `password`/`token`/`key` | Refused unless `confirm=True` |
|
|||
|
|
| `FAILSAFE = True` | Move mouse to screen corner (0, 0) to abort |
|
|||
|
|
| pyautogui unavailable | All tools return `ok=False` gracefully |
|
|||
|
|
|
|||
|
|
### Action Log
|
|||
|
|
|
|||
|
|
Every call is appended to `~/.nexus/computer_use_log.jsonl` (one JSON record per line):
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
{"ok": true, "tool": "computer_click", "x": 960, "y": 540, "button": "left",
|
|||
|
|
"before_screenshot": "/home/user/.nexus/before_click_1712345.png",
|
|||
|
|
"screenshot": "/home/user/.nexus/after_click_1712345.png",
|
|||
|
|
"ts": "2026-04-08T10:30:00+00:00"}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Read recent entries from Python:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from nexus.computer_use import read_action_log
|
|||
|
|
for record in read_action_log(last_n=10):
|
|||
|
|
print(record)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase 3 — Use-Case Pilots
|
|||
|
|
|
|||
|
|
### Pilot 1: Visual Regression Test (Fleet Dashboard)
|
|||
|
|
|
|||
|
|
Open the fleet health dashboard, take a screenshot, compare pixel-level
|
|||
|
|
hashes against a golden baseline:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from nexus.computer_use import computer_screenshot, computer_click
|
|||
|
|
import hashlib
|
|||
|
|
|
|||
|
|
def screenshot_hash(path: str) -> str:
|
|||
|
|
return hashlib.md5(open(path, "rb").read()).hexdigest()
|
|||
|
|
|
|||
|
|
# Navigate to the dashboard
|
|||
|
|
computer_click(960, 40) # address bar
|
|||
|
|
computer_type("http://localhost:7771/health\n")
|
|||
|
|
|
|||
|
|
import time; time.sleep(2)
|
|||
|
|
|
|||
|
|
result = computer_screenshot()
|
|||
|
|
current_hash = screenshot_hash(result["path"])
|
|||
|
|
|
|||
|
|
GOLDEN_HASH = "abc123..." # established on first run
|
|||
|
|
assert current_hash == GOLDEN_HASH, "Visual regression detected!"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Pilot 2: Screenshot-Based CI Diagnosis
|
|||
|
|
|
|||
|
|
When a CI workflow fails, agents can screenshot the Gitea workflow page and
|
|||
|
|
use the image to triage:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from nexus.computer_use import computer_screenshot
|
|||
|
|
|
|||
|
|
def diagnose_failed_workflow(run_url: str) -> str:
|
|||
|
|
"""
|
|||
|
|
Navigate to *run_url*, screenshot it, return the screenshot path
|
|||
|
|
for downstream LLM-based analysis.
|
|||
|
|
"""
|
|||
|
|
computer_click(960, 40) # address bar
|
|||
|
|
computer_type(run_url + "\n")
|
|||
|
|
|
|||
|
|
import time; time.sleep(3)
|
|||
|
|
|
|||
|
|
result = computer_screenshot()
|
|||
|
|
return result["path"] # hand off to vision model or OCR
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## MCP Server (External Callers)
|
|||
|
|
|
|||
|
|
The lower-level MCP server (`mcp_servers/desktop_control_server.py`) exposes
|
|||
|
|
the same capabilities over JSON-RPC stdio for callers outside Python:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# List available tools
|
|||
|
|
echo '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' \
|
|||
|
|
| python mcp_servers/desktop_control_server.py
|
|||
|
|
|
|||
|
|
# Take a screenshot
|
|||
|
|
echo '{"jsonrpc":"2.0","id":2,"method":"tools/call",
|
|||
|
|
"params":{"name":"take_screenshot","arguments":{"path":"/tmp/snap.png"}}}' \
|
|||
|
|
| python mcp_servers/desktop_control_server.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Docker / Sandboxed Environment
|
|||
|
|
|
|||
|
|
`docker-compose.desktop.yml` provides a safe container with:
|
|||
|
|
|
|||
|
|
- Xvfb virtual display (1280×800)
|
|||
|
|
- noVNC for browser-based viewing
|
|||
|
|
- Python + pyautogui pre-installed
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
docker-compose -f docker-compose.desktop.yml up
|
|||
|
|
# noVNC → http://localhost:6080
|
|||
|
|
# Run demo inside container:
|
|||
|
|
docker exec -it nexus-desktop python nexus/computer_use_demo.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Development Notes
|
|||
|
|
|
|||
|
|
- `NEXUS_HOME` env var overrides the log/snapshot directory (default `~/.nexus`)
|
|||
|
|
- `GITEA_URL` env var overrides the target in the demo script
|
|||
|
|
- `BROWSER_OPEN_WAIT` controls how long the demo waits after opening the browser
|
|||
|
|
- Tests in `tests/test_computer_use.py` run headless — pyautogui is fully mocked
|