docs/computer-use.md

# Computer Use — Desktop Automation Primitives for Hermes

**Issue:** #1125
**Status:** Phase 1 complete, Phase 2 in progress
**Owner:** Bezalel
**Epic:** #1120

---

## Overview

This document describes how Hermes agents can control a desktop environment
(screenshot, click, type, scroll) for automation and testing.  The capability
unlocks:

- Visual regression testing of fleet dashboards
- Automated Gitea workflow verification
- Screenshot-based incident diagnosis
- Driving GUI-only tools from agent code

---

## Architecture

```
┌──────────────────────────────────────────────────────┐
│                    Hermes Agent                      │
│                                                      │
│   computer_screenshot()  computer_click(x, y)        │
│   computer_type(text)    computer_scroll(x, y, n)    │
│                      │                               │
│            nexus/computer_use.py                     │
│          (safety guards · action log)                │
└────────────────────────┬─────────────────────────────┘
                         │
              ┌──────────┴───────────┐
              │       pyautogui      │
              │  (FAILSAFE enabled)  │
              └──────────┬───────────┘
                         │
              ┌──────────┴───────────┐
              │  Desktop environment │
              │  (Xvfb · noVNC ·    │
              │   bare metal)        │
              └──────────────────────┘
```

The MCP server layer (`mcp_servers/desktop_control_server.py`) is still
available for external callers (e.g. the Bannerlord harness).  The
`nexus/computer_use.py` module calls pyautogui directly so that safety
guards, logging, and screenshot evidence are applied consistently for every
Hermes agent invocation.

---

## Phase 1 — Environment & Primitives

### Sandboxed Desktop Setup

**Option A — Xvfb (lightweight, Linux/macOS)**

```bash
# Install
sudo apt-get install xvfb   # Linux
brew install xvfb            # macOS (via XQuartz)

# Start a virtual display on :99
Xvfb :99 -screen 0 1280x800x24 &
export DISPLAY=:99

# Run the demo
python nexus/computer_use_demo.py
```

**Option B — Docker with noVNC**

```bash
docker-compose -f docker-compose.desktop.yml up
# Open http://localhost:6080 to view the virtual desktop
```

See `docker-compose.desktop.yml` in the repo root.

### Running the Demo

The `nexus/computer_use_demo.py` script exercises the full Phase 1 loop:

```
[1/4] Capturing baseline screenshot
[2/4] Opening browser → https://forge.alexanderwhitestone.com
[3/4] Waiting 3s for page to load
[4/4] Capturing evidence screenshot
```

```bash
# Default target (Gitea forge)
python nexus/computer_use_demo.py

# Custom URL
GITEA_URL=http://localhost:3000 python nexus/computer_use_demo.py
```

---

## Phase 2 — Tool Integration

### API Reference

All four tools live in `nexus/computer_use.py` and follow the same contract:

```python
result = tool(...)
# result is always:
# {"ok": bool, "tool": str, ...fields..., "screenshot": path_or_None}
```

#### `computer_screenshot(output_path=None)`

Take a screenshot of the current desktop.

| Parameter     | Type            | Default              | Description                       |
|---------------|-----------------|----------------------|-----------------------------------|
| `output_path` | `str` or `None` | auto timestamped PNG | Where to save the captured image. |

```python
from nexus.computer_use import computer_screenshot

result = computer_screenshot()
# {"ok": True, "tool": "computer_screenshot", "path": "~/.nexus/nexus_snap_1712345678.png"}
```

#### `computer_click(x, y, *, button="left", confirm=False)`

Click at screen coordinates.

| Parameter | Type   | Default  | Description                              |
|-----------|--------|----------|------------------------------------------|
| `x`, `y`  | `int`  | required | Screen pixel coordinates.                |
| `button`  | `str`  | `"left"` | `"left"`, `"right"`, or `"middle"`.      |
| `confirm` | `bool` | `False`  | Required for `right`/`middle` clicks.    |

```python
from nexus.computer_use import computer_click

# Simple left click (no confirm needed)
result = computer_click(960, 540)

# Right-click requires explicit confirmation
result = computer_click(960, 540, button="right", confirm=True)
```

**Screenshot evidence:** before/after snapshots are captured and logged.

#### `computer_type(text, *, confirm=False)`

Type a string using keyboard simulation.

| Parameter | Type   | Default  | Description                                        |
|-----------|--------|----------|----------------------------------------------------|
| `text`    | `str`  | required | Text to type.                                      |
| `confirm` | `bool` | `False`  | Required when text contains `password`/`token`/`key`. |

```python
from nexus.computer_use import computer_type

# Safe text — no confirm needed
computer_type("https://forge.alexanderwhitestone.com")

# Sensitive text — confirm required
computer_type("hunter2", confirm=True)
```

#### `computer_scroll(x, y, amount)`

Scroll the mouse wheel at the given position.

| Parameter | Type  | Description                                     |
|-----------|-------|-------------------------------------------------|
| `x`, `y`  | `int` | Move mouse here before scrolling.               |
| `amount`  | `int` | Positive = scroll up; negative = scroll down.   |

```python
from nexus.computer_use import computer_scroll

computer_scroll(640, 400, -5)   # scroll down 5 clicks
computer_scroll(640, 400, 3)    # scroll up 3 clicks
```

### Safety (Poka-Yoke)

| Situation                          | Behavior                                        |
|------------------------------------|-------------------------------------------------|
| `right`/`middle` click w/o confirm | Refused; returns `ok=False` with explanation    |
| Text with `password`/`token`/`key` | Refused unless `confirm=True`                   |
| `FAILSAFE = True`                  | Move mouse to screen corner (0, 0) to abort     |
| pyautogui unavailable              | All tools return `ok=False` gracefully          |

### Action Log

Every call is appended to `~/.nexus/computer_use_log.jsonl` (one JSON record per line):

```json
{"ok": true, "tool": "computer_click", "x": 960, "y": 540, "button": "left",
 "before_screenshot": "/home/user/.nexus/before_click_1712345.png",
 "screenshot": "/home/user/.nexus/after_click_1712345.png",
 "ts": "2026-04-08T10:30:00+00:00"}
```

Read recent entries from Python:

```python
from nexus.computer_use import read_action_log
for record in read_action_log(last_n=10):
    print(record)
```

---

## Phase 3 — Use-Case Pilots

### Pilot 1: Visual Regression Test (Fleet Dashboard)

Open the fleet health dashboard, take a screenshot, compare pixel-level
hashes against a golden baseline:

```python
from nexus.computer_use import computer_screenshot, computer_click
import hashlib

def screenshot_hash(path: str) -> str:
    return hashlib.md5(open(path, "rb").read()).hexdigest()

# Navigate to the dashboard
computer_click(960, 40)   # address bar
computer_type("http://localhost:7771/health\n")

import time; time.sleep(2)

result = computer_screenshot()
current_hash = screenshot_hash(result["path"])

GOLDEN_HASH = "abc123..."   # established on first run
assert current_hash == GOLDEN_HASH, "Visual regression detected!"
```

### Pilot 2: Screenshot-Based CI Diagnosis

When a CI workflow fails, agents can screenshot the Gitea workflow page and
use the image to triage:

```python
from nexus.computer_use import computer_screenshot

def diagnose_failed_workflow(run_url: str) -> str:
    """
    Navigate to *run_url*, screenshot it, return the screenshot path
    for downstream LLM-based analysis.
    """
    computer_click(960, 40)   # address bar
    computer_type(run_url + "\n")

    import time; time.sleep(3)

    result = computer_screenshot()
    return result["path"]   # hand off to vision model or OCR
```

---

## MCP Server (External Callers)

The lower-level MCP server (`mcp_servers/desktop_control_server.py`) exposes
the same capabilities over JSON-RPC stdio for callers outside Python:

```bash
# List available tools
echo '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' \
  | python mcp_servers/desktop_control_server.py

# Take a screenshot
echo '{"jsonrpc":"2.0","id":2,"method":"tools/call",
       "params":{"name":"take_screenshot","arguments":{"path":"/tmp/snap.png"}}}' \
  | python mcp_servers/desktop_control_server.py
```

---

## Docker / Sandboxed Environment

`docker-compose.desktop.yml` provides a safe container with:

- Xvfb virtual display (1280×800)
- noVNC for browser-based viewing
- Python + pyautogui pre-installed

```bash
docker-compose -f docker-compose.desktop.yml up
# noVNC → http://localhost:6080
# Run demo inside container:
docker exec -it nexus-desktop python nexus/computer_use_demo.py
```

---

## Development Notes

- `NEXUS_HOME` env var overrides the log/snapshot directory (default `~/.nexus`)
- `GITEA_URL` env var overrides the target in the demo script
- `BROWSER_OPEN_WAIT` controls how long the demo waits after opening the browser
- Tests in `tests/test_computer_use.py` run headless — pyautogui is fully mocked
-												feat: add desktop automation primitives to Hermes (#1125)

Implements Phase 1 & 2 of the [COMPUTER_USE] epic:

- nexus/computer_use.py — four Hermes tools with safety guards and
  JSONL action logging:
    computer_screenshot(), computer_click(), computer_type(), computer_scroll()
  Poka-yoke: right/middle clicks require confirm=True; text containing
  password/token/key keywords is refused without confirm=True.
  pyautogui.FAILSAFE=True enabled globally (corner-abort).

- nexus/computer_use_demo.py — end-to-end Phase 1 demo: baseline
  screenshot → open browser → navigate to Gitea → evidence screenshot.

- tests/test_computer_use.py — 29 unit tests, fully headless (pyautogui
  mocked); all pass.

- docs/computer-use.md — full Phase 1–3 documentation including API
  reference, safety table, action-log format, and pilot recipes.

- docker-compose.desktop.yml — sandboxed Xvfb + noVNC container for
  safe headless desktop automation.

The existing mcp_servers/desktop_control_server.py is unchanged; it
remains available for external/MCP callers (Bannerlord harness etc).

Fixes #1125

											
										
										
											2026-04-08 06:29:27 -04:00
+								# Computer Use — Desktop Automation Primitives for Hermes
 								**Issue:** #1125
 								**Status:** Phase 1 complete, Phase 2 in progress
 								**Owner:** Bezalel
 								**Epic:** #1120
 								---
 								## Overview
 								This document describes how Hermes agents can control a desktop environment
 								(screenshot, click, type, scroll) for automation and testing.  The capability
 								unlocks:
 								- Visual regression testing of fleet dashboards
 								- Automated Gitea workflow verification
 								- Screenshot-based incident diagnosis
 								- Driving GUI-only tools from agent code
 								---
 								## Architecture
 								```
 								┌──────────────────────────────────────────────────────┐
 								│                    Hermes Agent                      │
 								│                                                      │
 								│   computer_screenshot()  computer_click(x, y)        │
 								│   computer_type(text)    computer_scroll(x, y, n)    │
 								│                      │                               │
 								│            nexus/computer_use.py                     │
 								│          (safety guards · action log)                │
 								└────────────────────────┬─────────────────────────────┘
 								                         │
 								              ┌──────────┴───────────┐
 								              │       pyautogui      │
 								              │  (FAILSAFE enabled)  │
 								              └──────────┬───────────┘
 								                         │
 								              ┌──────────┴───────────┐
 								              │  Desktop environment │
 								              │  (Xvfb · noVNC ·    │
 								              │   bare metal)        │
 								              └──────────────────────┘
 								```
 								The MCP server layer (`mcp_servers/desktop_control_server.py`) is still
 								available for external callers (e.g. the Bannerlord harness).  The
 								`nexus/computer_use.py` module calls pyautogui directly so that safety
 								guards, logging, and screenshot evidence are applied consistently for every
 								Hermes agent invocation.
 								---
 								## Phase 1 — Environment & Primitives
 								### Sandboxed Desktop Setup
 								**Option A — Xvfb (lightweight, Linux/macOS)**
 								```bash
 								# Install
 								sudo apt-get install xvfb   # Linux
 								brew install xvfb            # macOS (via XQuartz)
 								# Start a virtual display on :99
 								Xvfb :99 -screen 0 1280x800x24 &
 								export DISPLAY=:99
 								# Run the demo
 								python nexus/computer_use_demo.py
 								```
 								**Option B — Docker with noVNC**
 								```bash
 								docker-compose -f docker-compose.desktop.yml up
 								# Open http://localhost:6080 to view the virtual desktop
 								```
 								See `docker-compose.desktop.yml` in the repo root.
 								### Running the Demo
 								The `nexus/computer_use_demo.py` script exercises the full Phase 1 loop:
 								```
 								[1/4] Capturing baseline screenshot
 								[2/4] Opening browser → https://forge.alexanderwhitestone.com
 								[3/4] Waiting 3s for page to load
 								[4/4] Capturing evidence screenshot
 								```
 								```bash
 								# Default target (Gitea forge)
 								python nexus/computer_use_demo.py
 								# Custom URL
 								GITEA_URL=http://localhost:3000 python nexus/computer_use_demo.py
 								```
 								---
 								## Phase 2 — Tool Integration
 								### API Reference
 								All four tools live in `nexus/computer_use.py` and follow the same contract:
 								```python
 								result = tool(...)
 								# result is always:
 								# {"ok": bool, "tool": str, ...fields..., "screenshot": path_or_None}
 								```
 								#### `computer_screenshot(output_path=None)`
 								Take a screenshot of the current desktop.
 								| Parameter     | Type            | Default              | Description                       |
 								|---------------|-----------------|----------------------|-----------------------------------|
 								| `output_path` | `str` or `None` | auto timestamped PNG | Where to save the captured image. |
 								```python
 								from nexus.computer_use import computer_screenshot
 								result = computer_screenshot()
 								# {"ok": True, "tool": "computer_screenshot", "path": "~/.nexus/nexus_snap_1712345678.png"}
 								```
 								#### `computer_click(x, y, *, button="left", confirm=False)`
 								Click at screen coordinates.
 								| Parameter | Type   | Default  | Description                              |
 								|-----------|--------|----------|------------------------------------------|
 								| `x`, `y`  | `int`  | required | Screen pixel coordinates.                |
 								| `button`  | `str`  | `"left"` | `"left"`, `"right"`, or `"middle"`.      |
 								| `confirm` | `bool` | `False`  | Required for `right`/`middle` clicks.    |
 								```python
 								from nexus.computer_use import computer_click
 								# Simple left click (no confirm needed)
 								result = computer_click(960, 540)
 								# Right-click requires explicit confirmation
 								result = computer_click(960, 540, button="right", confirm=True)
 								```
 								**Screenshot evidence:** before/after snapshots are captured and logged.
 								#### `computer_type(text, *, confirm=False)`
 								Type a string using keyboard simulation.
 								| Parameter | Type   | Default  | Description                                        |
 								|-----------|--------|----------|----------------------------------------------------|
 								| `text`    | `str`  | required | Text to type.                                      |
 								| `confirm` | `bool` | `False`  | Required when text contains `password`/`token`/`key`. |
 								```python
 								from nexus.computer_use import computer_type
 								# Safe text — no confirm needed
 								computer_type("https://forge.alexanderwhitestone.com")
 								# Sensitive text — confirm required
 								computer_type("hunter2", confirm=True)
 								```
 								#### `computer_scroll(x, y, amount)`
 								Scroll the mouse wheel at the given position.
 								| Parameter | Type  | Description                                     |
 								|-----------|-------|-------------------------------------------------|
 								| `x`, `y`  | `int` | Move mouse here before scrolling.               |
 								| `amount`  | `int` | Positive = scroll up; negative = scroll down.   |
 								```python
 								from nexus.computer_use import computer_scroll
 								computer_scroll(640, 400, -5)   # scroll down 5 clicks
 								computer_scroll(640, 400, 3)    # scroll up 3 clicks
 								```
 								### Safety (Poka-Yoke)
 								| Situation                          | Behavior                                        |
 								|------------------------------------|-------------------------------------------------|
 								| `right`/`middle` click w/o confirm | Refused; returns `ok=False` with explanation    |
 								| Text with `password`/`token`/`key` | Refused unless `confirm=True`                   |
 								| `FAILSAFE = True`                  | Move mouse to screen corner (0, 0) to abort     |
 								| pyautogui unavailable              | All tools return `ok=False` gracefully          |
 								### Action Log
 								Every call is appended to `~/.nexus/computer_use_log.jsonl` (one JSON record per line):
 								```json
 								{"ok": true, "tool": "computer_click", "x": 960, "y": 540, "button": "left",
 								 "before_screenshot": "/home/user/.nexus/before_click_1712345.png",
 								 "screenshot": "/home/user/.nexus/after_click_1712345.png",
 								 "ts": "2026-04-08T10:30:00+00:00"}
 								```
 								Read recent entries from Python:
 								```python
 								from nexus.computer_use import read_action_log
 								for record in read_action_log(last_n=10):
 								    print(record)
 								```
 								---
 								## Phase 3 — Use-Case Pilots
 								### Pilot 1: Visual Regression Test (Fleet Dashboard)
 								Open the fleet health dashboard, take a screenshot, compare pixel-level
 								hashes against a golden baseline:
 								```python
 								from nexus.computer_use import computer_screenshot, computer_click
 								import hashlib
 								def screenshot_hash(path: str) -> str:
 								    return hashlib.md5(open(path, "rb").read()).hexdigest()
 								# Navigate to the dashboard
 								computer_click(960, 40)   # address bar
 								computer_type("http://localhost:7771/health\n")
 								import time; time.sleep(2)
 								result = computer_screenshot()
 								current_hash = screenshot_hash(result["path"])
 								GOLDEN_HASH = "abc123..."   # established on first run
 								assert current_hash == GOLDEN_HASH, "Visual regression detected!"
 								```
 								### Pilot 2: Screenshot-Based CI Diagnosis
 								When a CI workflow fails, agents can screenshot the Gitea workflow page and
 								use the image to triage:
 								```python
 								from nexus.computer_use import computer_screenshot
 								def diagnose_failed_workflow(run_url: str) -> str:
 								    """
 								    Navigate to *run_url*, screenshot it, return the screenshot path
 								    for downstream LLM-based analysis.
 								    """
 								    computer_click(960, 40)   # address bar
 								    computer_type(run_url + "\n")
 								    import time; time.sleep(3)
 								    result = computer_screenshot()
 								    return result["path"]   # hand off to vision model or OCR
 								```
 								---
 								## MCP Server (External Callers)
 								The lower-level MCP server (`mcp_servers/desktop_control_server.py`) exposes
 								the same capabilities over JSON-RPC stdio for callers outside Python:
 								```bash
 								# List available tools
 								echo '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' \
 								  | python mcp_servers/desktop_control_server.py
 								# Take a screenshot
 								echo '{"jsonrpc":"2.0","id":2,"method":"tools/call",
 								       "params":{"name":"take_screenshot","arguments":{"path":"/tmp/snap.png"}}}' \
 								  | python mcp_servers/desktop_control_server.py
 								```
 								---
 								## Docker / Sandboxed Environment
 								`docker-compose.desktop.yml` provides a safe container with:
 								- Xvfb virtual display (1280×800)
 								- noVNC for browser-based viewing
 								- Python + pyautogui pre-installed
 								```bash
 								docker-compose -f docker-compose.desktop.yml up
 								# noVNC → http://localhost:6080
 								# Run demo inside container:
 								docker exec -it nexus-desktop python nexus/computer_use_demo.py
 								```
 								---
 								## Development Notes
 								- `NEXUS_HOME` env var overrides the log/snapshot directory (default `~/.nexus`)
 								- `GITEA_URL` env var overrides the target in the demo script
 								- `BROWSER_OPEN_WAIT` controls how long the demo waits after opening the browser
 								- Tests in `tests/test_computer_use.py` run headless — pyautogui is fully mocked