Implements Phase 1 & 2 of the [COMPUTER_USE] epic:
- nexus/computer_use.py — four Hermes tools with safety guards and
JSONL action logging:
computer_screenshot(), computer_click(), computer_type(), computer_scroll()
Poka-yoke: right/middle clicks require confirm=True; text containing
password/token/key keywords is refused without confirm=True.
pyautogui.FAILSAFE=True enabled globally (corner-abort).
- nexus/computer_use_demo.py — end-to-end Phase 1 demo: baseline
screenshot → open browser → navigate to Gitea → evidence screenshot.
- tests/test_computer_use.py — 29 unit tests, fully headless (pyautogui
mocked); all pass.
- docs/computer-use.md — full Phase 1–3 documentation including API
reference, safety table, action-log format, and pilot recipes.
- docker-compose.desktop.yml — sandboxed Xvfb + noVNC container for
safe headless desktop automation.
The existing mcp_servers/desktop_control_server.py is unchanged; it
remains available for external/MCP callers (Bannerlord harness etc).
Fixes #1125
9.6 KiB
Computer Use — Desktop Automation Primitives for Hermes
Issue: #1125 Status: Phase 1 complete, Phase 2 in progress Owner: Bezalel Epic: #1120
Overview
This document describes how Hermes agents can control a desktop environment (screenshot, click, type, scroll) for automation and testing. The capability unlocks:
- Visual regression testing of fleet dashboards
- Automated Gitea workflow verification
- Screenshot-based incident diagnosis
- Driving GUI-only tools from agent code
Architecture
┌──────────────────────────────────────────────────────┐
│ Hermes Agent │
│ │
│ computer_screenshot() computer_click(x, y) │
│ computer_type(text) computer_scroll(x, y, n) │
│ │ │
│ nexus/computer_use.py │
│ (safety guards · action log) │
└────────────────────────┬─────────────────────────────┘
│
┌──────────┴───────────┐
│ pyautogui │
│ (FAILSAFE enabled) │
└──────────┬───────────┘
│
┌──────────┴───────────┐
│ Desktop environment │
│ (Xvfb · noVNC · │
│ bare metal) │
└──────────────────────┘
The MCP server layer (mcp_servers/desktop_control_server.py) is still
available for external callers (e.g. the Bannerlord harness). The
nexus/computer_use.py module calls pyautogui directly so that safety
guards, logging, and screenshot evidence are applied consistently for every
Hermes agent invocation.
Phase 1 — Environment & Primitives
Sandboxed Desktop Setup
Option A — Xvfb (lightweight, Linux/macOS)
# Install
sudo apt-get install xvfb # Linux
brew install xvfb # macOS (via XQuartz)
# Start a virtual display on :99
Xvfb :99 -screen 0 1280x800x24 &
export DISPLAY=:99
# Run the demo
python nexus/computer_use_demo.py
Option B — Docker with noVNC
docker-compose -f docker-compose.desktop.yml up
# Open http://localhost:6080 to view the virtual desktop
See docker-compose.desktop.yml in the repo root.
Running the Demo
The nexus/computer_use_demo.py script exercises the full Phase 1 loop:
[1/4] Capturing baseline screenshot
[2/4] Opening browser → https://forge.alexanderwhitestone.com
[3/4] Waiting 3s for page to load
[4/4] Capturing evidence screenshot
# Default target (Gitea forge)
python nexus/computer_use_demo.py
# Custom URL
GITEA_URL=http://localhost:3000 python nexus/computer_use_demo.py
Phase 2 — Tool Integration
API Reference
All four tools live in nexus/computer_use.py and follow the same contract:
result = tool(...)
# result is always:
# {"ok": bool, "tool": str, ...fields..., "screenshot": path_or_None}
computer_screenshot(output_path=None)
Take a screenshot of the current desktop.
| Parameter | Type | Default | Description |
|---|---|---|---|
output_path |
str or None |
auto timestamped PNG | Where to save the captured image. |
from nexus.computer_use import computer_screenshot
result = computer_screenshot()
# {"ok": True, "tool": "computer_screenshot", "path": "~/.nexus/nexus_snap_1712345678.png"}
computer_click(x, y, *, button="left", confirm=False)
Click at screen coordinates.
| Parameter | Type | Default | Description |
|---|---|---|---|
x, y |
int |
required | Screen pixel coordinates. |
button |
str |
"left" |
"left", "right", or "middle". |
confirm |
bool |
False |
Required for right/middle clicks. |
from nexus.computer_use import computer_click
# Simple left click (no confirm needed)
result = computer_click(960, 540)
# Right-click requires explicit confirmation
result = computer_click(960, 540, button="right", confirm=True)
Screenshot evidence: before/after snapshots are captured and logged.
computer_type(text, *, confirm=False)
Type a string using keyboard simulation.
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
required | Text to type. |
confirm |
bool |
False |
Required when text contains password/token/key. |
from nexus.computer_use import computer_type
# Safe text — no confirm needed
computer_type("https://forge.alexanderwhitestone.com")
# Sensitive text — confirm required
computer_type("hunter2", confirm=True)
computer_scroll(x, y, amount)
Scroll the mouse wheel at the given position.
| Parameter | Type | Description |
|---|---|---|
x, y |
int |
Move mouse here before scrolling. |
amount |
int |
Positive = scroll up; negative = scroll down. |
from nexus.computer_use import computer_scroll
computer_scroll(640, 400, -5) # scroll down 5 clicks
computer_scroll(640, 400, 3) # scroll up 3 clicks
Safety (Poka-Yoke)
| Situation | Behavior |
|---|---|
right/middle click w/o confirm |
Refused; returns ok=False with explanation |
Text with password/token/key |
Refused unless confirm=True |
FAILSAFE = True |
Move mouse to screen corner (0, 0) to abort |
| pyautogui unavailable | All tools return ok=False gracefully |
Action Log
Every call is appended to ~/.nexus/computer_use_log.jsonl (one JSON record per line):
{"ok": true, "tool": "computer_click", "x": 960, "y": 540, "button": "left",
"before_screenshot": "/home/user/.nexus/before_click_1712345.png",
"screenshot": "/home/user/.nexus/after_click_1712345.png",
"ts": "2026-04-08T10:30:00+00:00"}
Read recent entries from Python:
from nexus.computer_use import read_action_log
for record in read_action_log(last_n=10):
print(record)
Phase 3 — Use-Case Pilots
Pilot 1: Visual Regression Test (Fleet Dashboard)
Open the fleet health dashboard, take a screenshot, compare pixel-level hashes against a golden baseline:
from nexus.computer_use import computer_screenshot, computer_click
import hashlib
def screenshot_hash(path: str) -> str:
return hashlib.md5(open(path, "rb").read()).hexdigest()
# Navigate to the dashboard
computer_click(960, 40) # address bar
computer_type("http://localhost:7771/health\n")
import time; time.sleep(2)
result = computer_screenshot()
current_hash = screenshot_hash(result["path"])
GOLDEN_HASH = "abc123..." # established on first run
assert current_hash == GOLDEN_HASH, "Visual regression detected!"
Pilot 2: Screenshot-Based CI Diagnosis
When a CI workflow fails, agents can screenshot the Gitea workflow page and use the image to triage:
from nexus.computer_use import computer_screenshot
def diagnose_failed_workflow(run_url: str) -> str:
"""
Navigate to *run_url*, screenshot it, return the screenshot path
for downstream LLM-based analysis.
"""
computer_click(960, 40) # address bar
computer_type(run_url + "\n")
import time; time.sleep(3)
result = computer_screenshot()
return result["path"] # hand off to vision model or OCR
MCP Server (External Callers)
The lower-level MCP server (mcp_servers/desktop_control_server.py) exposes
the same capabilities over JSON-RPC stdio for callers outside Python:
# List available tools
echo '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' \
| python mcp_servers/desktop_control_server.py
# Take a screenshot
echo '{"jsonrpc":"2.0","id":2,"method":"tools/call",
"params":{"name":"take_screenshot","arguments":{"path":"/tmp/snap.png"}}}' \
| python mcp_servers/desktop_control_server.py
Docker / Sandboxed Environment
docker-compose.desktop.yml provides a safe container with:
- Xvfb virtual display (1280×800)
- noVNC for browser-based viewing
- Python + pyautogui pre-installed
docker-compose -f docker-compose.desktop.yml up
# noVNC → http://localhost:6080
# Run demo inside container:
docker exec -it nexus-desktop python nexus/computer_use_demo.py
Development Notes
NEXUS_HOMEenv var overrides the log/snapshot directory (default~/.nexus)GITEA_URLenv var overrides the target in the demo scriptBROWSER_OPEN_WAITcontrols how long the demo waits after opening the browser- Tests in
tests/test_computer_use.pyrun headless — pyautogui is fully mocked