Files
the-nexus/docs/computer-use.md
Alexander Whitestone a3a28aa4c2
Some checks failed
CI / test (pull_request) Failing after 20s
CI / validate (pull_request) Failing after 25s
Review Approval Gate / verify-review (pull_request) Failing after 5s
feat: add desktop automation primitives to Hermes (#1125)
Implements Phase 1 & 2 of the [COMPUTER_USE] epic:

- nexus/computer_use.py — four Hermes tools with safety guards and
  JSONL action logging:
    computer_screenshot(), computer_click(), computer_type(), computer_scroll()
  Poka-yoke: right/middle clicks require confirm=True; text containing
  password/token/key keywords is refused without confirm=True.
  pyautogui.FAILSAFE=True enabled globally (corner-abort).

- nexus/computer_use_demo.py — end-to-end Phase 1 demo: baseline
  screenshot → open browser → navigate to Gitea → evidence screenshot.

- tests/test_computer_use.py — 29 unit tests, fully headless (pyautogui
  mocked); all pass.

- docs/computer-use.md — full Phase 1–3 documentation including API
  reference, safety table, action-log format, and pilot recipes.

- docker-compose.desktop.yml — sandboxed Xvfb + noVNC container for
  safe headless desktop automation.

The existing mcp_servers/desktop_control_server.py is unchanged; it
remains available for external/MCP callers (Bannerlord harness etc).

Fixes #1125
2026-04-08 06:29:27 -04:00

9.6 KiB
Raw Blame History

Computer Use — Desktop Automation Primitives for Hermes

Issue: #1125 Status: Phase 1 complete, Phase 2 in progress Owner: Bezalel Epic: #1120


Overview

This document describes how Hermes agents can control a desktop environment (screenshot, click, type, scroll) for automation and testing. The capability unlocks:

  • Visual regression testing of fleet dashboards
  • Automated Gitea workflow verification
  • Screenshot-based incident diagnosis
  • Driving GUI-only tools from agent code

Architecture

┌──────────────────────────────────────────────────────┐
│                    Hermes Agent                      │
│                                                      │
│   computer_screenshot()  computer_click(x, y)        │
│   computer_type(text)    computer_scroll(x, y, n)    │
│                      │                               │
│            nexus/computer_use.py                     │
│          (safety guards · action log)                │
└────────────────────────┬─────────────────────────────┘
                         │
              ┌──────────┴───────────┐
              │       pyautogui      │
              │  (FAILSAFE enabled)  │
              └──────────┬───────────┘
                         │
              ┌──────────┴───────────┐
              │  Desktop environment │
              │  (Xvfb · noVNC ·    │
              │   bare metal)        │
              └──────────────────────┘

The MCP server layer (mcp_servers/desktop_control_server.py) is still available for external callers (e.g. the Bannerlord harness). The nexus/computer_use.py module calls pyautogui directly so that safety guards, logging, and screenshot evidence are applied consistently for every Hermes agent invocation.


Phase 1 — Environment & Primitives

Sandboxed Desktop Setup

Option A — Xvfb (lightweight, Linux/macOS)

# Install
sudo apt-get install xvfb   # Linux
brew install xvfb            # macOS (via XQuartz)

# Start a virtual display on :99
Xvfb :99 -screen 0 1280x800x24 &
export DISPLAY=:99

# Run the demo
python nexus/computer_use_demo.py

Option B — Docker with noVNC

docker-compose -f docker-compose.desktop.yml up
# Open http://localhost:6080 to view the virtual desktop

See docker-compose.desktop.yml in the repo root.

Running the Demo

The nexus/computer_use_demo.py script exercises the full Phase 1 loop:

[1/4] Capturing baseline screenshot
[2/4] Opening browser → https://forge.alexanderwhitestone.com
[3/4] Waiting 3s for page to load
[4/4] Capturing evidence screenshot
# Default target (Gitea forge)
python nexus/computer_use_demo.py

# Custom URL
GITEA_URL=http://localhost:3000 python nexus/computer_use_demo.py

Phase 2 — Tool Integration

API Reference

All four tools live in nexus/computer_use.py and follow the same contract:

result = tool(...)
# result is always:
# {"ok": bool, "tool": str, ...fields..., "screenshot": path_or_None}

computer_screenshot(output_path=None)

Take a screenshot of the current desktop.

Parameter Type Default Description
output_path str or None auto timestamped PNG Where to save the captured image.
from nexus.computer_use import computer_screenshot

result = computer_screenshot()
# {"ok": True, "tool": "computer_screenshot", "path": "~/.nexus/nexus_snap_1712345678.png"}

computer_click(x, y, *, button="left", confirm=False)

Click at screen coordinates.

Parameter Type Default Description
x, y int required Screen pixel coordinates.
button str "left" "left", "right", or "middle".
confirm bool False Required for right/middle clicks.
from nexus.computer_use import computer_click

# Simple left click (no confirm needed)
result = computer_click(960, 540)

# Right-click requires explicit confirmation
result = computer_click(960, 540, button="right", confirm=True)

Screenshot evidence: before/after snapshots are captured and logged.

computer_type(text, *, confirm=False)

Type a string using keyboard simulation.

Parameter Type Default Description
text str required Text to type.
confirm bool False Required when text contains password/token/key.
from nexus.computer_use import computer_type

# Safe text — no confirm needed
computer_type("https://forge.alexanderwhitestone.com")

# Sensitive text — confirm required
computer_type("hunter2", confirm=True)

computer_scroll(x, y, amount)

Scroll the mouse wheel at the given position.

Parameter Type Description
x, y int Move mouse here before scrolling.
amount int Positive = scroll up; negative = scroll down.
from nexus.computer_use import computer_scroll

computer_scroll(640, 400, -5)   # scroll down 5 clicks
computer_scroll(640, 400, 3)    # scroll up 3 clicks

Safety (Poka-Yoke)

Situation Behavior
right/middle click w/o confirm Refused; returns ok=False with explanation
Text with password/token/key Refused unless confirm=True
FAILSAFE = True Move mouse to screen corner (0, 0) to abort
pyautogui unavailable All tools return ok=False gracefully

Action Log

Every call is appended to ~/.nexus/computer_use_log.jsonl (one JSON record per line):

{"ok": true, "tool": "computer_click", "x": 960, "y": 540, "button": "left",
 "before_screenshot": "/home/user/.nexus/before_click_1712345.png",
 "screenshot": "/home/user/.nexus/after_click_1712345.png",
 "ts": "2026-04-08T10:30:00+00:00"}

Read recent entries from Python:

from nexus.computer_use import read_action_log
for record in read_action_log(last_n=10):
    print(record)

Phase 3 — Use-Case Pilots

Pilot 1: Visual Regression Test (Fleet Dashboard)

Open the fleet health dashboard, take a screenshot, compare pixel-level hashes against a golden baseline:

from nexus.computer_use import computer_screenshot, computer_click
import hashlib

def screenshot_hash(path: str) -> str:
    return hashlib.md5(open(path, "rb").read()).hexdigest()

# Navigate to the dashboard
computer_click(960, 40)   # address bar
computer_type("http://localhost:7771/health\n")

import time; time.sleep(2)

result = computer_screenshot()
current_hash = screenshot_hash(result["path"])

GOLDEN_HASH = "abc123..."   # established on first run
assert current_hash == GOLDEN_HASH, "Visual regression detected!"

Pilot 2: Screenshot-Based CI Diagnosis

When a CI workflow fails, agents can screenshot the Gitea workflow page and use the image to triage:

from nexus.computer_use import computer_screenshot

def diagnose_failed_workflow(run_url: str) -> str:
    """
    Navigate to *run_url*, screenshot it, return the screenshot path
    for downstream LLM-based analysis.
    """
    computer_click(960, 40)   # address bar
    computer_type(run_url + "\n")

    import time; time.sleep(3)

    result = computer_screenshot()
    return result["path"]   # hand off to vision model or OCR

MCP Server (External Callers)

The lower-level MCP server (mcp_servers/desktop_control_server.py) exposes the same capabilities over JSON-RPC stdio for callers outside Python:

# List available tools
echo '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' \
  | python mcp_servers/desktop_control_server.py

# Take a screenshot
echo '{"jsonrpc":"2.0","id":2,"method":"tools/call",
       "params":{"name":"take_screenshot","arguments":{"path":"/tmp/snap.png"}}}' \
  | python mcp_servers/desktop_control_server.py

Docker / Sandboxed Environment

docker-compose.desktop.yml provides a safe container with:

  • Xvfb virtual display (1280×800)
  • noVNC for browser-based viewing
  • Python + pyautogui pre-installed
docker-compose -f docker-compose.desktop.yml up
# noVNC → http://localhost:6080
# Run demo inside container:
docker exec -it nexus-desktop python nexus/computer_use_demo.py

Development Notes

  • NEXUS_HOME env var overrides the log/snapshot directory (default ~/.nexus)
  • GITEA_URL env var overrides the target in the demo script
  • BROWSER_OPEN_WAIT controls how long the demo waits after opening the browser
  • Tests in tests/test_computer_use.py run headless — pyautogui is fully mocked