Implements Phase 1 and Phase 2 tooling from issue #1125: - nexus/computer_use.py: four Hermes tools with poka-yoke safety * computer_screenshot() — capture & base64-encode desktop snapshot * computer_click(x, y, button, confirm) — right/middle require confirm=True * computer_type(text, confirm) — sensitive keywords blocked without confirm=True; text value is never written to audit log * computer_scroll(x, y, amount) — scroll wheel * read_action_log() — inspect recent JSONL audit entries * pyautogui.FAILSAFE=True; all tools degrade gracefully when headless - nexus/computer_use_demo.py: Phase 1 demo (baseline screenshot → open browser → navigate to Gitea forge → evidence screenshot) - tests/test_computer_use.py: 32 unit tests, fully headless (pyautogui mocked), all passing - docs/computer-use.md: API reference, safety table, phase roadmap, pilot recipes - docker-compose.desktop.yml: sandboxed Xvfb + noVNC container Fixes #1125 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.8 KiB
Computer Use — Desktop Automation Primitives for Hermes
Issue: #1125
Overview
nexus/computer_use.py adds desktop automation primitives to the Hermes fleet. Agents can take screenshots, click, type, and scroll — enough to drive a browser, validate a UI, or diagnose a failed workflow page visually.
All actions are logged to a JSONL audit trail at ~/.nexus/computer_use_actions.jsonl.
Quick Start
Local (requires a real display or Xvfb)
# Install dependencies
pip install pyautogui Pillow
# Run the Phase 1 demo
python -m nexus.computer_use_demo
Sandboxed (Docker + Xvfb + noVNC)
docker compose -f docker-compose.desktop.yml up -d
# Visit http://localhost:6080 in your browser to see the virtual desktop
docker compose -f docker-compose.desktop.yml run hermes-desktop \
python -m nexus.computer_use_demo
docker compose -f docker-compose.desktop.yml down
API Reference
computer_screenshot(save_path=None, log_path=...)
Capture the current desktop.
| Param | Type | Description |
|---|---|---|
save_path |
str | None |
Path to save PNG. If None, returns base64 string. |
log_path |
Path |
Audit log file. |
Returns dict:
{
"ok": true,
"image_b64": "<base64 PNG or null>",
"saved_to": "<path or null>",
"error": null
}
computer_click(x, y, button="left", confirm=False, log_path=...)
Click the mouse at screen coordinates.
| Param | Type | Description |
|---|---|---|
x |
int |
Horizontal coordinate |
y |
int |
Vertical coordinate |
button |
str |
"left" | "right" | "middle" |
confirm |
bool |
Required True for right / middle (poka-yoke) |
Returns dict:
{"ok": true, "error": null}
computer_type(text, confirm=False, interval=0.02, log_path=...)
Type text using the keyboard.
| Param | Type | Description |
|---|---|---|
text |
str |
Text to type |
confirm |
bool |
Required True when text contains a sensitive keyword |
interval |
float |
Delay between keystrokes (seconds) |
Sensitive keywords (require confirm=True): password, passwd, secret, token, api_key, apikey, key, auth
Note: the actual
textvalue is never written to the audit log — only its length and whether it was flagged as sensitive.
Returns dict:
{"ok": true, "error": null}
computer_scroll(x, y, amount=3, log_path=...)
Scroll the mouse wheel at screen coordinates.
| Param | Type | Description |
|---|---|---|
x |
int |
Horizontal coordinate |
y |
int |
Vertical coordinate |
amount |
int |
Scroll units. Positive = up, negative = down. |
Returns dict:
{"ok": true, "error": null}
read_action_log(n=20, log_path=...)
Return the most recent n audit log entries, newest first.
from nexus.computer_use import read_action_log
for entry in read_action_log(n=5):
print(entry["ts"], entry["action"], entry["result"]["ok"])
Safety Model
| Action | Safety gate |
|---|---|
computer_click(button="right") |
Requires confirm=True |
computer_click(button="middle") |
Requires confirm=True |
computer_type with sensitive text |
Requires confirm=True |
| Mouse to top-left corner | pyautogui FAILSAFE — aborts immediately |
| All actions | Written to JSONL audit log with timestamp |
| Headless environment | All tools degrade gracefully — return ok=False with error message |
Phase Roadmap
Phase 1 — Environment & Primitives ✅
- Sandboxed desktop via Xvfb + noVNC (
docker-compose.desktop.yml) computer_screenshot,computer_click,computer_type,computer_scroll- Poka-yoke safety checks on all destructive actions
- JSONL audit log for all actions
- Demo: baseline screenshot → open browser → navigate to Gitea → evidence screenshot
- 32 unit tests, fully headless (pyautogui mocked)
Phase 2 — Tool Integration (planned)
- Register tools in the Hermes tool registry
- LLM-based planner loop using screenshots as context
- Destructive action confirmation UI
Phase 3 — Use-Case Pilots (planned)
- Pilot 1: Automated visual regression test for fleet dashboard
- Pilot 2: Screenshot-based diagnosis of failed CI workflow page
File Locations
| File | Purpose |
|---|---|
nexus/computer_use.py |
Core tool primitives |
nexus/computer_use_demo.py |
Phase 1 end-to-end demo |
tests/test_computer_use.py |
32 unit tests |
docker-compose.desktop.yml |
Sandboxed desktop container |
~/.nexus/computer_use_actions.jsonl |
Runtime audit log |
~/.nexus/computer_use_evidence/ |
Screenshot evidence (demo output) |