the-nexus/docs/computer-use.md

# Computer Use — Desktop Automation Primitives for Hermes

**Issue:** #1125
**Status:** Phase 1 complete, Phase 2 in progress
**Owner:** Bezalel
**Epic:** #1120

---

## Overview

This document describes how Hermes agents can control a desktop environment
(screenshot, click, type, scroll) for automation and testing.  The capability
unlocks:

- Visual regression testing of fleet dashboards
- Automated Gitea workflow verification
- Screenshot-based incident diagnosis
- Driving GUI-only tools from agent code

---

## Architecture

```
┌──────────────────────────────────────────────────────┐
│                    Hermes Agent                      │
│                                                      │
│   computer_screenshot()  computer_click(x, y)        │
│   computer_type(text)    computer_scroll(x, y, n)    │
│                      │                               │
│            nexus/computer_use.py                     │
│          (safety guards · action log)                │
└────────────────────────┬─────────────────────────────┘
                         │
              ┌──────────┴───────────┐
              │       pyautogui      │
              │  (FAILSAFE enabled)  │
              └──────────┬───────────┘
                         │
              ┌──────────┴───────────┐
              │  Desktop environment │
              │  (Xvfb · noVNC ·    │
              │   bare metal)        │
              └──────────────────────┘
```

The MCP server layer (`mcp_servers/desktop_control_server.py`) is still
available for external callers (e.g. the Bannerlord harness).  The
`nexus/computer_use.py` module calls pyautogui directly so that safety
guards, logging, and screenshot evidence are applied consistently for every
Hermes agent invocation.

---

## Phase 1 — Environment & Primitives

### Sandboxed Desktop Setup

**Option A — Xvfb (lightweight, Linux/macOS)**

```bash
# Install
sudo apt-get install xvfb   # Linux
brew install xvfb            # macOS (via XQuartz)

# Start a virtual display on :99
Xvfb :99 -screen 0 1280x800x24 &
export DISPLAY=:99

# Run the demo
python nexus/computer_use_demo.py
```

**Option B — Docker with noVNC**

```bash
docker-compose -f docker-compose.desktop.yml up
# Open http://localhost:6080 to view the virtual desktop
```

See `docker-compose.desktop.yml` in the repo root.

### Running the Demo

The `nexus/computer_use_demo.py` script exercises the full Phase 1 loop:

```
[1/4] Capturing baseline screenshot
[2/4] Opening browser → https://forge.alexanderwhitestone.com
[3/4] Waiting 3s for page to load
[4/4] Capturing evidence screenshot
```

```bash
# Default target (Gitea forge)
python nexus/computer_use_demo.py

# Custom URL
GITEA_URL=http://localhost:3000 python nexus/computer_use_demo.py
```

---

## Phase 2 — Tool Integration

### API Reference

All four tools live in `nexus/computer_use.py` and follow the same contract:

```python
result = tool(...)
# result is always:
# {"ok": bool, "tool": str, ...fields..., "screenshot": path_or_None}
```

#### `computer_screenshot(output_path=None)`

Take a screenshot of the current desktop.

| Parameter     | Type            | Default              | Description                       |
|---------------|-----------------|----------------------|-----------------------------------|
| `output_path` | `str` or `None` | auto timestamped PNG | Where to save the captured image. |

```python
from nexus.computer_use import computer_screenshot

result = computer_screenshot()
# {"ok": True, "tool": "computer_screenshot", "path": "~/.nexus/nexus_snap_1712345678.png"}
```

#### `computer_click(x, y, *, button="left", confirm=False)`

Click at screen coordinates.

| Parameter | Type   | Default  | Description                              |
|-----------|--------|----------|------------------------------------------|
| `x`, `y`  | `int`  | required | Screen pixel coordinates.                |
| `button`  | `str`  | `"left"` | `"left"`, `"right"`, or `"middle"`.      |
| `confirm` | `bool` | `False`  | Required for `right`/`middle` clicks.    |

```python
from nexus.computer_use import computer_click

# Simple left click (no confirm needed)
result = computer_click(960, 540)

# Right-click requires explicit confirmation
result = computer_click(960, 540, button="right", confirm=True)
```

**Screenshot evidence:** before/after snapshots are captured and logged.

#### `computer_type(text, *, confirm=False)`

Type a string using keyboard simulation.

| Parameter | Type   | Default  | Description                                        |
|-----------|--------|----------|----------------------------------------------------|
| `text`    | `str`  | required | Text to type.                                      |
| `confirm` | `bool` | `False`  | Required when text contains `password`/`token`/`key`. |

```python
from nexus.computer_use import computer_type

# Safe text — no confirm needed
computer_type("https://forge.alexanderwhitestone.com")

# Sensitive text — confirm required
computer_type("hunter2", confirm=True)
```

#### `computer_scroll(x, y, amount)`

Scroll the mouse wheel at the given position.

| Parameter | Type  | Description                                     |
|-----------|-------|-------------------------------------------------|
| `x`, `y`  | `int` | Move mouse here before scrolling.               |
| `amount`  | `int` | Positive = scroll up; negative = scroll down.   |

```python
from nexus.computer_use import computer_scroll

computer_scroll(640, 400, -5)   # scroll down 5 clicks
computer_scroll(640, 400, 3)    # scroll up 3 clicks
```

### Safety (Poka-Yoke)

| Situation                          | Behavior                                        |
|------------------------------------|-------------------------------------------------|
| `right`/`middle` click w/o confirm | Refused; returns `ok=False` with explanation    |
| Text with `password`/`token`/`key` | Refused unless `confirm=True`                   |
| `FAILSAFE = True`                  | Move mouse to screen corner (0, 0) to abort     |
| pyautogui unavailable              | All tools return `ok=False` gracefully          |

### Action Log

Every call is appended to `~/.nexus/computer_use_log.jsonl` (one JSON record per line):

```json
{"ok": true, "tool": "computer_click", "x": 960, "y": 540, "button": "left",
 "before_screenshot": "/home/user/.nexus/before_click_1712345.png",
 "screenshot": "/home/user/.nexus/after_click_1712345.png",
 "ts": "2026-04-08T10:30:00+00:00"}
```

Read recent entries from Python:

```python
from nexus.computer_use import read_action_log
for record in read_action_log(last_n=10):
    print(record)
```

---

## Phase 3 — Use-Case Pilots

### Pilot 1: Visual Regression Test (Fleet Dashboard)

Open the fleet health dashboard, take a screenshot, compare pixel-level
hashes against a golden baseline:

```python
from nexus.computer_use import computer_screenshot, computer_click
import hashlib

def screenshot_hash(path: str) -> str:
    return hashlib.md5(open(path, "rb").read()).hexdigest()

# Navigate to the dashboard
computer_click(960, 40)   # address bar
computer_type("http://localhost:7771/health\n")

import time; time.sleep(2)

result = computer_screenshot()
current_hash = screenshot_hash(result["path"])

GOLDEN_HASH = "abc123..."   # established on first run
assert current_hash == GOLDEN_HASH, "Visual regression detected!"
```

### Pilot 2: Screenshot-Based CI Diagnosis

When a CI workflow fails, agents can screenshot the Gitea workflow page and
use the image to triage:

```python
from nexus.computer_use import computer_screenshot

def diagnose_failed_workflow(run_url: str) -> str:
    """
    Navigate to *run_url*, screenshot it, return the screenshot path
    for downstream LLM-based analysis.
    """
    computer_click(960, 40)   # address bar
    computer_type(run_url + "\n")

    import time; time.sleep(3)

    result = computer_screenshot()
    return result["path"]   # hand off to vision model or OCR
```

---

## MCP Server (External Callers)

The lower-level MCP server (`mcp_servers/desktop_control_server.py`) exposes
the same capabilities over JSON-RPC stdio for callers outside Python:

```bash
# List available tools
echo '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' \
  | python mcp_servers/desktop_control_server.py

# Take a screenshot
echo '{"jsonrpc":"2.0","id":2,"method":"tools/call",
       "params":{"name":"take_screenshot","arguments":{"path":"/tmp/snap.png"}}}' \
  | python mcp_servers/desktop_control_server.py
```

---

## Docker / Sandboxed Environment

`docker-compose.desktop.yml` provides a safe container with:

- Xvfb virtual display (1280×800)
- noVNC for browser-based viewing
- Python + pyautogui pre-installed

```bash
docker-compose -f docker-compose.desktop.yml up
# noVNC → http://localhost:6080
# Run demo inside container:
docker exec -it nexus-desktop python nexus/computer_use_demo.py
```

---

## Development Notes

- `NEXUS_HOME` env var overrides the log/snapshot directory (default `~/.nexus`)
- `GITEA_URL` env var overrides the target in the demo script
- `BROWSER_OPEN_WAIT` controls how long the demo waits after opening the browser
- Tests in `tests/test_computer_use.py` run headless — pyautogui is fully mocked