Files
hermes-agent/website/docs/user-guide/features/browser.md
teknium1 d50e9bcef7 docs: add 11 new pages + expand 4 existing pages (26 → 37 total)
New pages (sourced from actual codebase):
- Security: command approval, DM pairing, container isolation, production checklist
- Session Management: resume, export, prune, search, per-platform tracking
- Context Files: AGENTS.md project context, discovery, size limits, security
- Personality: SOUL.md, 14 built-in personalities, custom definitions
- Browser Automation: Browserbase setup, 10 browser tools, stealth mode
- Image Generation: FLUX 2 Pro via FAL, aspect ratios, auto-upscaling
- Provider Routing: OpenRouter sort/only/ignore/order config
- Honcho: AI-native memory integration, setup, peer config
- Home Assistant: HASS setup, 4 HA tools, WebSocket gateway
- Batch Processing: trajectory generation, dataset format, checkpointing
- RL Training: Atropos/Tinker integration, environments, workflow

Expanded pages:
- code-execution: 51 → 195 lines (examples, limits, security, comparison table)
- delegation: 60 → 216 lines (context tips, batch mode, model override)
- cron: 88 → 273 lines (real-world examples, delivery options, expression cheat sheet)
- memory: 98 → 249 lines (best practices, capacity management, examples)
2026-03-05 07:28:41 -08:00

206 lines
6.5 KiB
Markdown

---
title: Browser Automation
description: Control cloud browsers with Browserbase integration for web interaction, form filling, scraping, and more.
sidebar_label: Browser
sidebar_position: 5
---
# Browser Automation
Hermes Agent includes a full browser automation toolset powered by [Browserbase](https://browserbase.com), enabling the agent to navigate websites, interact with page elements, fill forms, and extract information — all running in cloud-hosted browsers with built-in anti-bot stealth features.
## Overview
The browser tools use the `agent-browser` CLI with Browserbase cloud execution. Pages are represented as **accessibility trees** (text-based snapshots), making them ideal for LLM agents. Interactive elements get ref IDs (like `@e1`, `@e2`) that the agent uses for clicking and typing.
Key capabilities:
- **Cloud execution** — no local browser needed
- **Built-in stealth** — random fingerprints, CAPTCHA solving, residential proxies
- **Session isolation** — each task gets its own browser session
- **Automatic cleanup** — inactive sessions are closed after a timeout
- **Vision analysis** — screenshot + AI analysis for visual understanding
## Setup
### Required Environment Variables
```bash
# Add to ~/.hermes/.env
BROWSERBASE_API_KEY=your-api-key-here
BROWSERBASE_PROJECT_ID=your-project-id-here
```
Get your credentials at [browserbase.com](https://browserbase.com).
### Optional Environment Variables
```bash
# Residential proxies for better CAPTCHA solving (default: "true")
BROWSERBASE_PROXIES=true
# Advanced stealth with custom Chromium — requires Scale Plan (default: "false")
BROWSERBASE_ADVANCED_STEALTH=false
# Session reconnection after disconnects — requires paid plan (default: "true")
BROWSERBASE_KEEP_ALIVE=true
# Custom session timeout in milliseconds (default: project default)
# Examples: 600000 (10min), 1800000 (30min)
BROWSERBASE_SESSION_TIMEOUT=600000
# Inactivity timeout before auto-cleanup in seconds (default: 300)
BROWSER_INACTIVITY_TIMEOUT=300
```
### Install agent-browser CLI
```bash
npm install -g agent-browser
# Or install locally in the repo:
npm install
```
:::info
The `browser` toolset must be included in your config's `toolsets` list or enabled via `hermes config set toolsets '["hermes-cli", "browser"]'`.
:::
## Available Tools
### `browser_navigate`
Navigate to a URL. Must be called before any other browser tool. Initializes the Browserbase session.
```
Navigate to https://github.com/NousResearch
```
:::tip
For simple information retrieval, prefer `web_search` or `web_extract` — they are faster and cheaper. Use browser tools when you need to **interact** with a page (click buttons, fill forms, handle dynamic content).
:::
### `browser_snapshot`
Get a text-based snapshot of the current page's accessibility tree. Returns interactive elements with ref IDs like `@e1`, `@e2` for use with `browser_click` and `browser_type`.
- **`full=false`** (default): Compact view showing only interactive elements
- **`full=true`**: Complete page content
Snapshots over 8000 characters are automatically summarized by an LLM.
### `browser_click`
Click an element identified by its ref ID from the snapshot.
```
Click @e5 to press the "Sign In" button
```
### `browser_type`
Type text into an input field. Clears the field first, then types the new text.
```
Type "hermes agent" into the search field @e3
```
### `browser_scroll`
Scroll the page up or down to reveal more content.
```
Scroll down to see more results
```
### `browser_press`
Press a keyboard key. Useful for submitting forms or navigation.
```
Press Enter to submit the form
```
Supported keys: `Enter`, `Tab`, `Escape`, `ArrowDown`, `ArrowUp`, and more.
### `browser_back`
Navigate back to the previous page in browser history.
### `browser_get_images`
List all images on the current page with their URLs and alt text. Useful for finding images to analyze.
### `browser_vision`
Take a screenshot and analyze it with vision AI. Use this when text snapshots don't capture important visual information — especially useful for CAPTCHAs, complex layouts, or visual verification challenges.
```
What does the chart on this page show?
```
### `browser_close`
Close the browser session and release resources. Call this when done to free up Browserbase session quota.
## Practical Examples
### Filling Out a Web Form
```
User: Sign up for an account on example.com with my email john@example.com
Agent workflow:
1. browser_navigate("https://example.com/signup")
2. browser_snapshot() → sees form fields with refs
3. browser_type(ref="@e3", text="john@example.com")
4. browser_type(ref="@e5", text="SecurePass123")
5. browser_click(ref="@e8") → clicks "Create Account"
6. browser_snapshot() → confirms success
7. browser_close()
```
### Researching Dynamic Content
```
User: What are the top trending repos on GitHub right now?
Agent workflow:
1. browser_navigate("https://github.com/trending")
2. browser_snapshot(full=true) → reads trending repo list
3. Returns formatted results
4. browser_close()
```
## Stealth Features
Browserbase provides automatic stealth capabilities:
| Feature | Default | Notes |
|---------|---------|-------|
| Basic Stealth | Always on | Random fingerprints, viewport randomization, CAPTCHA solving |
| Residential Proxies | On | Routes through residential IPs for better access |
| Advanced Stealth | Off | Custom Chromium build, requires Scale Plan |
| Keep Alive | On | Session reconnection after network hiccups |
:::note
If paid features aren't available on your plan, Hermes automatically falls back — first disabling `keepAlive`, then proxies — so browsing still works on free plans.
:::
## Session Management
- Each task gets an isolated browser session via Browserbase
- Sessions are automatically cleaned up after inactivity (default: 5 minutes)
- A background thread checks every 30 seconds for stale sessions
- Emergency cleanup runs on process exit to prevent orphaned sessions
- Sessions are released via the Browserbase API (`REQUEST_RELEASE` status)
## Limitations
- **Requires Browserbase account** — no local browser fallback
- **Requires `agent-browser` CLI** — must be installed via npm
- **Text-based interaction** — relies on accessibility tree, not pixel coordinates
- **Snapshot size** — large pages may be truncated or LLM-summarized at 8000 characters
- **Session timeout** — sessions expire based on your Browserbase plan settings
- **Cost** — each session consumes Browserbase credits; use `browser_close` when done
- **No file downloads** — cannot download files from the browser