# Computer Use — Desktop Automation Primitives for Hermes **Issue:** #1125 **Status:** Phase 1 complete, Phase 2 in progress **Owner:** Bezalel **Epic:** #1120 --- ## Overview This document describes how Hermes agents can control a desktop environment (screenshot, click, type, scroll) for automation and testing. The capability unlocks: - Visual regression testing of fleet dashboards - Automated Gitea workflow verification - Screenshot-based incident diagnosis - Driving GUI-only tools from agent code --- ## Architecture ``` ┌──────────────────────────────────────────────────────┐ │ Hermes Agent │ │ │ │ computer_screenshot() computer_click(x, y) │ │ computer_type(text) computer_scroll(x, y, n) │ │ │ │ │ nexus/computer_use.py │ │ (safety guards · action log) │ └────────────────────────┬─────────────────────────────┘ │ ┌──────────┴───────────┐ │ pyautogui │ │ (FAILSAFE enabled) │ └──────────┬───────────┘ │ ┌──────────┴───────────┐ │ Desktop environment │ │ (Xvfb · noVNC · │ │ bare metal) │ └──────────────────────┘ ``` The MCP server layer (`mcp_servers/desktop_control_server.py`) is still available for external callers (e.g. the Bannerlord harness). The `nexus/computer_use.py` module calls pyautogui directly so that safety guards, logging, and screenshot evidence are applied consistently for every Hermes agent invocation. --- ## Phase 1 — Environment & Primitives ### Sandboxed Desktop Setup **Option A — Xvfb (lightweight, Linux/macOS)** ```bash # Install sudo apt-get install xvfb # Linux brew install xvfb # macOS (via XQuartz) # Start a virtual display on :99 Xvfb :99 -screen 0 1280x800x24 & export DISPLAY=:99 # Run the demo python nexus/computer_use_demo.py ``` **Option B — Docker with noVNC** ```bash docker-compose -f docker-compose.desktop.yml up # Open http://localhost:6080 to view the virtual desktop ``` See `docker-compose.desktop.yml` in the repo root. ### Running the Demo The `nexus/computer_use_demo.py` script exercises the full Phase 1 loop: ``` [1/4] Capturing baseline screenshot [2/4] Opening browser → https://forge.alexanderwhitestone.com [3/4] Waiting 3s for page to load [4/4] Capturing evidence screenshot ``` ```bash # Default target (Gitea forge) python nexus/computer_use_demo.py # Custom URL GITEA_URL=http://localhost:3000 python nexus/computer_use_demo.py ``` --- ## Phase 2 — Tool Integration ### API Reference All four tools live in `nexus/computer_use.py` and follow the same contract: ```python result = tool(...) # result is always: # {"ok": bool, "tool": str, ...fields..., "screenshot": path_or_None} ``` #### `computer_screenshot(output_path=None)` Take a screenshot of the current desktop. | Parameter | Type | Default | Description | |---------------|-----------------|----------------------|-----------------------------------| | `output_path` | `str` or `None` | auto timestamped PNG | Where to save the captured image. | ```python from nexus.computer_use import computer_screenshot result = computer_screenshot() # {"ok": True, "tool": "computer_screenshot", "path": "~/.nexus/nexus_snap_1712345678.png"} ``` #### `computer_click(x, y, *, button="left", confirm=False)` Click at screen coordinates. | Parameter | Type | Default | Description | |-----------|--------|----------|------------------------------------------| | `x`, `y` | `int` | required | Screen pixel coordinates. | | `button` | `str` | `"left"` | `"left"`, `"right"`, or `"middle"`. | | `confirm` | `bool` | `False` | Required for `right`/`middle` clicks. | ```python from nexus.computer_use import computer_click # Simple left click (no confirm needed) result = computer_click(960, 540) # Right-click requires explicit confirmation result = computer_click(960, 540, button="right", confirm=True) ``` **Screenshot evidence:** before/after snapshots are captured and logged. #### `computer_type(text, *, confirm=False)` Type a string using keyboard simulation. | Parameter | Type | Default | Description | |-----------|--------|----------|----------------------------------------------------| | `text` | `str` | required | Text to type. | | `confirm` | `bool` | `False` | Required when text contains `password`/`token`/`key`. | ```python from nexus.computer_use import computer_type # Safe text — no confirm needed computer_type("https://forge.alexanderwhitestone.com") # Sensitive text — confirm required computer_type("hunter2", confirm=True) ``` #### `computer_scroll(x, y, amount)` Scroll the mouse wheel at the given position. | Parameter | Type | Description | |-----------|-------|-------------------------------------------------| | `x`, `y` | `int` | Move mouse here before scrolling. | | `amount` | `int` | Positive = scroll up; negative = scroll down. | ```python from nexus.computer_use import computer_scroll computer_scroll(640, 400, -5) # scroll down 5 clicks computer_scroll(640, 400, 3) # scroll up 3 clicks ``` ### Safety (Poka-Yoke) | Situation | Behavior | |------------------------------------|-------------------------------------------------| | `right`/`middle` click w/o confirm | Refused; returns `ok=False` with explanation | | Text with `password`/`token`/`key` | Refused unless `confirm=True` | | `FAILSAFE = True` | Move mouse to screen corner (0, 0) to abort | | pyautogui unavailable | All tools return `ok=False` gracefully | ### Action Log Every call is appended to `~/.nexus/computer_use_log.jsonl` (one JSON record per line): ```json {"ok": true, "tool": "computer_click", "x": 960, "y": 540, "button": "left", "before_screenshot": "/home/user/.nexus/before_click_1712345.png", "screenshot": "/home/user/.nexus/after_click_1712345.png", "ts": "2026-04-08T10:30:00+00:00"} ``` Read recent entries from Python: ```python from nexus.computer_use import read_action_log for record in read_action_log(last_n=10): print(record) ``` --- ## Phase 3 — Use-Case Pilots ### Pilot 1: Visual Regression Test (Fleet Dashboard) Open the fleet health dashboard, take a screenshot, compare pixel-level hashes against a golden baseline: ```python from nexus.computer_use import computer_screenshot, computer_click import hashlib def screenshot_hash(path: str) -> str: return hashlib.md5(open(path, "rb").read()).hexdigest() # Navigate to the dashboard computer_click(960, 40) # address bar computer_type("http://localhost:7771/health\n") import time; time.sleep(2) result = computer_screenshot() current_hash = screenshot_hash(result["path"]) GOLDEN_HASH = "abc123..." # established on first run assert current_hash == GOLDEN_HASH, "Visual regression detected!" ``` ### Pilot 2: Screenshot-Based CI Diagnosis When a CI workflow fails, agents can screenshot the Gitea workflow page and use the image to triage: ```python from nexus.computer_use import computer_screenshot def diagnose_failed_workflow(run_url: str) -> str: """ Navigate to *run_url*, screenshot it, return the screenshot path for downstream LLM-based analysis. """ computer_click(960, 40) # address bar computer_type(run_url + "\n") import time; time.sleep(3) result = computer_screenshot() return result["path"] # hand off to vision model or OCR ``` --- ## MCP Server (External Callers) The lower-level MCP server (`mcp_servers/desktop_control_server.py`) exposes the same capabilities over JSON-RPC stdio for callers outside Python: ```bash # List available tools echo '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' \ | python mcp_servers/desktop_control_server.py # Take a screenshot echo '{"jsonrpc":"2.0","id":2,"method":"tools/call", "params":{"name":"take_screenshot","arguments":{"path":"/tmp/snap.png"}}}' \ | python mcp_servers/desktop_control_server.py ``` --- ## Docker / Sandboxed Environment `docker-compose.desktop.yml` provides a safe container with: - Xvfb virtual display (1280×800) - noVNC for browser-based viewing - Python + pyautogui pre-installed ```bash docker-compose -f docker-compose.desktop.yml up # noVNC → http://localhost:6080 # Run demo inside container: docker exec -it nexus-desktop python nexus/computer_use_demo.py ``` --- ## Development Notes - `NEXUS_HOME` env var overrides the log/snapshot directory (default `~/.nexus`) - `GITEA_URL` env var overrides the target in the demo script - `BROWSER_OPEN_WAIT` controls how long the demo waits after opening the browser - Tests in `tests/test_computer_use.py` run headless — pyautogui is fully mocked