Merge pull request #745 from NousResearch/hermes/hermes-f8d56335

feat: browser console tool, annotated screenshots, auto-recording, and dogfood QA skill
This commit is contained in:
Teknium
2026-03-08 21:29:52 -07:00
committed by GitHub
11 changed files with 835 additions and 9 deletions

View File

@@ -69,7 +69,7 @@ hermes-agent/
│ ├── file_tools.py # File read/write/search/patch tools
│ ├── file_operations.py # File operations helpers
│ ├── web_tools.py # Firecrawl search/extract
│ ├── browser_tool.py # Browserbase browser automation
│ ├── browser_tool.py # Browserbase browser automation (browser_console, session recording)
│ ├── vision_tools.py # Image analysis via auxiliary LLM
│ ├── image_generation_tool.py # FLUX image generation via fal.ai
│ ├── tts_tool.py # Text-to-speech
@@ -113,7 +113,7 @@ hermes-agent/
├── cron/ # Scheduler implementation
├── environments/ # RL training environments (Atropos integration)
├── honcho_integration/ # Honcho client & session management
├── skills/ # Bundled skill sources
├── skills/ # Bundled skill sources (includes dogfood QA testing)
├── optional-skills/ # Official optional skills (not activated by default)
├── scripts/ # Install scripts, utilities
├── tests/ # Full pytest suite (~2300+ tests)

1
cli.py
View File

@@ -161,6 +161,7 @@ def load_cli_config() -> Dict[str, Any]:
},
"browser": {
"inactivity_timeout": 120, # Auto-cleanup inactive browser sessions after 2 min
"record_sessions": False, # Auto-record browser sessions as WebM videos
},
"compression": {
"enabled": True, # Auto-compress when approaching context limit

View File

@@ -81,6 +81,7 @@ DEFAULT_CONFIG = {
"browser": {
"inactivity_timeout": 120,
"record_sessions": False, # Auto-record browser sessions as WebM videos
},
"compression": {

162
skills/dogfood/SKILL.md Normal file
View File

@@ -0,0 +1,162 @@
---
name: dogfood
description: Systematic exploratory QA testing of web applications — find bugs, capture evidence, and generate structured reports
version: 1.0.0
metadata:
hermes:
tags: [qa, testing, browser, web, dogfood]
related_skills: []
---
# Dogfood: Systematic Web Application QA Testing
## Overview
This skill guides you through systematic exploratory QA testing of web applications using the browser toolset. You will navigate the application, interact with elements, capture evidence of issues, and produce a structured bug report.
## Prerequisites
- Browser toolset must be available (`browser_navigate`, `browser_snapshot`, `browser_click`, `browser_type`, `browser_vision`, `browser_console`, `browser_scroll`, `browser_back`, `browser_press`, `browser_close`)
- A target URL and testing scope from the user
## Inputs
The user provides:
1. **Target URL** — the entry point for testing
2. **Scope** — what areas/features to focus on (or "full site" for comprehensive testing)
3. **Output directory** (optional) — where to save screenshots and the report (default: `./dogfood-output`)
## Workflow
Follow this 5-phase systematic workflow:
### Phase 1: Plan
1. Create the output directory structure:
```
{output_dir}/
├── screenshots/ # Evidence screenshots
└── report.md # Final report (generated in Phase 5)
```
2. Identify the testing scope based on user input.
3. Build a rough sitemap by planning which pages and features to test:
- Landing/home page
- Navigation links (header, footer, sidebar)
- Key user flows (sign up, login, search, checkout, etc.)
- Forms and interactive elements
- Edge cases (empty states, error pages, 404s)
### Phase 2: Explore
For each page or feature in your plan:
1. **Navigate** to the page:
```
browser_navigate(url="https://example.com/page")
```
2. **Take a snapshot** to understand the DOM structure:
```
browser_snapshot()
```
3. **Check the console** for JavaScript errors:
```
browser_console(clear=true)
```
Do this after every navigation and after every significant interaction. Silent JS errors are high-value findings.
4. **Take an annotated screenshot** to visually assess the page and identify interactive elements:
```
browser_vision(question="Describe the page layout, identify any visual issues, broken elements, or accessibility concerns", annotate=true)
```
The `annotate=true` flag overlays numbered `[N]` labels on interactive elements. Each `[N]` maps to ref `@eN` for subsequent browser commands.
5. **Test interactive elements** systematically:
- Click buttons and links: `browser_click(ref="@eN")`
- Fill forms: `browser_type(ref="@eN", text="test input")`
- Test keyboard navigation: `browser_press(key="Tab")`, `browser_press(key="Enter")`
- Scroll through content: `browser_scroll(direction="down")`
- Test form validation with invalid inputs
- Test empty submissions
6. **After each interaction**, check for:
- Console errors: `browser_console()`
- Visual changes: `browser_vision(question="What changed after the interaction?")`
- Expected vs actual behavior
### Phase 3: Collect Evidence
For every issue found:
1. **Take a screenshot** showing the issue:
```
browser_vision(question="Capture and describe the issue visible on this page", annotate=false)
```
Save the `screenshot_path` from the response — you will reference it in the report.
2. **Record the details**:
- URL where the issue occurs
- Steps to reproduce
- Expected behavior
- Actual behavior
- Console errors (if any)
- Screenshot path
3. **Classify the issue** using the issue taxonomy (see `references/issue-taxonomy.md`):
- Severity: Critical / High / Medium / Low
- Category: Functional / Visual / Accessibility / Console / UX / Content
### Phase 4: Categorize
1. Review all collected issues.
2. De-duplicate — merge issues that are the same bug manifesting in different places.
3. Assign final severity and category to each issue.
4. Sort by severity (Critical first, then High, Medium, Low).
5. Count issues by severity and category for the executive summary.
### Phase 5: Report
Generate the final report using the template at `templates/dogfood-report-template.md`.
The report must include:
1. **Executive summary** with total issue count, breakdown by severity, and testing scope
2. **Per-issue sections** with:
- Issue number and title
- Severity and category badges
- URL where observed
- Description of the issue
- Steps to reproduce
- Expected vs actual behavior
- Screenshot references (use `MEDIA:<screenshot_path>` for inline images)
- Console errors if relevant
3. **Summary table** of all issues
4. **Testing notes** — what was tested, what was not, any blockers
Save the report to `{output_dir}/report.md`.
## Tools Reference
| Tool | Purpose |
|------|---------|
| `browser_navigate` | Go to a URL |
| `browser_snapshot` | Get DOM text snapshot (accessibility tree) |
| `browser_click` | Click an element by ref (`@eN`) or text |
| `browser_type` | Type into an input field |
| `browser_scroll` | Scroll up/down on the page |
| `browser_back` | Go back in browser history |
| `browser_press` | Press a keyboard key |
| `browser_vision` | Screenshot + AI analysis; use `annotate=true` for element labels |
| `browser_console` | Get JS console output and errors |
| `browser_close` | Close the browser session |
## Tips
- **Always check `browser_console()` after navigating and after significant interactions.** Silent JS errors are among the most valuable findings.
- **Use `annotate=true` with `browser_vision`** when you need to reason about interactive element positions or when the snapshot refs are unclear.
- **Test with both valid and invalid inputs** — form validation bugs are common.
- **Scroll through long pages** — content below the fold may have rendering issues.
- **Test navigation flows** — click through multi-step processes end-to-end.
- **Check responsive behavior** by noting any layout issues visible in screenshots.
- **Don't forget edge cases**: empty states, very long text, special characters, rapid clicking.
- When reporting screenshots to the user, include `MEDIA:<screenshot_path>` so they can see the evidence inline.

View File

@@ -0,0 +1,109 @@
# Issue Taxonomy
Use this taxonomy to classify issues found during dogfood QA testing.
## Severity Levels
### Critical
The issue makes a core feature completely unusable or causes data loss.
**Examples:**
- Application crashes or shows a blank white page
- Form submission silently loses user data
- Authentication is completely broken (can't log in at all)
- Payment flow fails and charges the user without completing the order
- Security vulnerability (e.g., XSS, exposed credentials in console)
### High
The issue significantly impairs functionality but a workaround may exist.
**Examples:**
- A key button does nothing when clicked (but refreshing fixes it)
- Search returns no results for valid queries
- Form validation rejects valid input
- Page loads but critical content is missing or garbled
- Navigation link leads to a 404 or wrong page
- Uncaught JavaScript exceptions in the console on core pages
### Medium
The issue is noticeable and affects user experience but doesn't block core functionality.
**Examples:**
- Layout is misaligned or overlapping on certain screen sections
- Images fail to load (broken image icons)
- Slow performance (visible loading delays > 3 seconds)
- Form field lacks proper validation feedback (no error message on bad input)
- Console warnings that suggest deprecated or misconfigured features
- Inconsistent styling between similar pages
### Low
Minor polish issues that don't affect functionality.
**Examples:**
- Typos or grammatical errors in text content
- Minor spacing or alignment inconsistencies
- Placeholder text left in production ("Lorem ipsum")
- Favicon missing
- Console info/debug messages that shouldn't be in production
- Subtle color contrast issues that don't fail WCAG requirements
## Categories
### Functional
Issues where features don't work as expected.
- Buttons/links that don't respond
- Forms that don't submit or submit incorrectly
- Broken user flows (can't complete a multi-step process)
- Incorrect data displayed
- Features that work partially
### Visual
Issues with the visual presentation of the page.
- Layout problems (overlapping elements, broken grids)
- Broken images or missing media
- Styling inconsistencies
- Responsive design failures
- Z-index issues (elements hidden behind others)
- Text overflow or truncation
### Accessibility
Issues that prevent or hinder access for users with disabilities.
- Missing alt text on meaningful images
- Poor color contrast (fails WCAG AA)
- Elements not reachable via keyboard navigation
- Missing form labels or ARIA attributes
- Focus indicators missing or unclear
- Screen reader incompatible content
### Console
Issues detected through JavaScript console output.
- Uncaught exceptions and unhandled promise rejections
- Failed network requests (4xx, 5xx errors in console)
- Deprecation warnings
- CORS errors
- Mixed content warnings (HTTP resources on HTTPS page)
- Excessive console.log output left from development
### UX (User Experience)
Issues where functionality works but the experience is poor.
- Confusing navigation or information architecture
- Missing loading indicators (user doesn't know something is happening)
- No feedback after user actions (e.g., button click with no visible result)
- Inconsistent interaction patterns
- Missing confirmation dialogs for destructive actions
- Poor error messages that don't help the user recover
### Content
Issues with the text, media, or information on the page.
- Typos and grammatical errors
- Placeholder/dummy content in production
- Outdated information
- Missing content (empty sections)
- Broken or dead links to external resources
- Incorrect or misleading labels

View File

@@ -0,0 +1,86 @@
# Dogfood QA Report
**Target:** {target_url}
**Date:** {date}
**Scope:** {scope_description}
**Tester:** Hermes Agent (automated exploratory QA)
---
## Executive Summary
| Severity | Count |
|----------|-------|
| 🔴 Critical | {critical_count} |
| 🟠 High | {high_count} |
| 🟡 Medium | {medium_count} |
| 🔵 Low | {low_count} |
| **Total** | **{total_count}** |
**Overall Assessment:** {one_sentence_assessment}
---
## Issues
<!-- Repeat this section for each issue found, sorted by severity (Critical first) -->
### Issue #{issue_number}: {issue_title}
| Field | Value |
|-------|-------|
| **Severity** | {severity} |
| **Category** | {category} |
| **URL** | {url_where_found} |
**Description:**
{detailed_description_of_the_issue}
**Steps to Reproduce:**
1. {step_1}
2. {step_2}
3. {step_3}
**Expected Behavior:**
{what_should_happen}
**Actual Behavior:**
{what_actually_happens}
**Screenshot:**
MEDIA:{screenshot_path}
**Console Errors** (if applicable):
```
{console_error_output}
```
---
<!-- End of per-issue section -->
## Issues Summary Table
| # | Title | Severity | Category | URL |
|---|-------|----------|----------|-----|
| {n} | {title} | {severity} | {category} | {url} |
## Testing Coverage
### Pages Tested
- {list_of_pages_visited}
### Features Tested
- {list_of_features_exercised}
### Not Tested / Out of Scope
- {areas_not_covered_and_why}
### Blockers
- {any_issues_that_prevented_testing_certain_areas}
---
## Notes
{any_additional_observations_or_recommendations}

View File

@@ -0,0 +1,276 @@
"""Tests for browser_console tool and browser_vision annotate param."""
import json
import os
import sys
from unittest.mock import patch, MagicMock
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
# ── browser_console ──────────────────────────────────────────────────
class TestBrowserConsole:
"""browser_console() returns console messages + JS errors in one call."""
def test_returns_console_messages_and_errors(self):
from tools.browser_tool import browser_console
console_response = {
"success": True,
"data": {
"messages": [
{"text": "hello", "type": "log", "timestamp": 1},
{"text": "oops", "type": "error", "timestamp": 2},
]
},
}
errors_response = {
"success": True,
"data": {
"errors": [
{"message": "Uncaught TypeError", "timestamp": 3},
]
},
}
with patch("tools.browser_tool._run_browser_command") as mock_cmd:
mock_cmd.side_effect = [console_response, errors_response]
result = json.loads(browser_console(task_id="test"))
assert result["success"] is True
assert result["total_messages"] == 2
assert result["total_errors"] == 1
assert result["console_messages"][0]["text"] == "hello"
assert result["console_messages"][1]["text"] == "oops"
assert result["js_errors"][0]["message"] == "Uncaught TypeError"
def test_passes_clear_flag(self):
from tools.browser_tool import browser_console
empty = {"success": True, "data": {"messages": [], "errors": []}}
with patch("tools.browser_tool._run_browser_command", return_value=empty) as mock_cmd:
browser_console(clear=True, task_id="test")
calls = mock_cmd.call_args_list
# Both console and errors should get --clear
assert calls[0][0] == ("test", "console", ["--clear"])
assert calls[1][0] == ("test", "errors", ["--clear"])
def test_no_clear_by_default(self):
from tools.browser_tool import browser_console
empty = {"success": True, "data": {"messages": [], "errors": []}}
with patch("tools.browser_tool._run_browser_command", return_value=empty) as mock_cmd:
browser_console(task_id="test")
calls = mock_cmd.call_args_list
assert calls[0][0] == ("test", "console", [])
assert calls[1][0] == ("test", "errors", [])
def test_empty_console_and_errors(self):
from tools.browser_tool import browser_console
empty = {"success": True, "data": {"messages": [], "errors": []}}
with patch("tools.browser_tool._run_browser_command", return_value=empty):
result = json.loads(browser_console(task_id="test"))
assert result["total_messages"] == 0
assert result["total_errors"] == 0
assert result["console_messages"] == []
assert result["js_errors"] == []
def test_handles_failed_commands(self):
from tools.browser_tool import browser_console
failed = {"success": False, "error": "No session"}
with patch("tools.browser_tool._run_browser_command", return_value=failed):
result = json.loads(browser_console(task_id="test"))
# Should still return success with empty data
assert result["success"] is True
assert result["total_messages"] == 0
assert result["total_errors"] == 0
# ── browser_console schema ───────────────────────────────────────────
class TestBrowserConsoleSchema:
"""browser_console is properly registered in the tool registry."""
def test_schema_in_browser_schemas(self):
from tools.browser_tool import BROWSER_TOOL_SCHEMAS
names = [s["name"] for s in BROWSER_TOOL_SCHEMAS]
assert "browser_console" in names
def test_schema_has_clear_param(self):
from tools.browser_tool import BROWSER_TOOL_SCHEMAS
schema = next(s for s in BROWSER_TOOL_SCHEMAS if s["name"] == "browser_console")
props = schema["parameters"]["properties"]
assert "clear" in props
assert props["clear"]["type"] == "boolean"
# ── browser_vision annotate ──────────────────────────────────────────
class TestBrowserVisionAnnotate:
"""browser_vision supports annotate parameter."""
def test_schema_has_annotate_param(self):
from tools.browser_tool import BROWSER_TOOL_SCHEMAS
schema = next(s for s in BROWSER_TOOL_SCHEMAS if s["name"] == "browser_vision")
props = schema["parameters"]["properties"]
assert "annotate" in props
assert props["annotate"]["type"] == "boolean"
def test_annotate_false_no_flag(self):
"""Without annotate, screenshot command has no --annotate flag."""
from tools.browser_tool import browser_vision
with (
patch("tools.browser_tool._run_browser_command") as mock_cmd,
patch("tools.browser_tool._aux_vision_client") as mock_client,
patch("tools.browser_tool._DEFAULT_VISION_MODEL", "test-model"),
patch("tools.browser_tool._get_vision_model", return_value="test-model"),
):
mock_cmd.return_value = {"success": True, "data": {}}
# Will fail at screenshot file read, but we can check the command
try:
browser_vision("test", annotate=False, task_id="test")
except Exception:
pass
if mock_cmd.called:
args = mock_cmd.call_args[0]
cmd_args = args[2] if len(args) > 2 else []
assert "--annotate" not in cmd_args
def test_annotate_true_adds_flag(self):
"""With annotate=True, screenshot command includes --annotate."""
from tools.browser_tool import browser_vision
with (
patch("tools.browser_tool._run_browser_command") as mock_cmd,
patch("tools.browser_tool._aux_vision_client") as mock_client,
patch("tools.browser_tool._DEFAULT_VISION_MODEL", "test-model"),
patch("tools.browser_tool._get_vision_model", return_value="test-model"),
):
mock_cmd.return_value = {"success": True, "data": {}}
try:
browser_vision("test", annotate=True, task_id="test")
except Exception:
pass
if mock_cmd.called:
args = mock_cmd.call_args[0]
cmd_args = args[2] if len(args) > 2 else []
assert "--annotate" in cmd_args
# ── auto-recording config ────────────────────────────────────────────
class TestRecordSessionsConfig:
"""browser.record_sessions config option."""
def test_default_config_has_record_sessions(self):
from hermes_cli.config import DEFAULT_CONFIG
browser_cfg = DEFAULT_CONFIG.get("browser", {})
assert "record_sessions" in browser_cfg
assert browser_cfg["record_sessions"] is False
def test_maybe_start_recording_disabled(self):
"""Recording doesn't start when config says record_sessions: false."""
from tools.browser_tool import _maybe_start_recording, _recording_sessions
with (
patch("tools.browser_tool._run_browser_command") as mock_cmd,
patch("builtins.open", side_effect=FileNotFoundError),
):
_maybe_start_recording("test-task")
mock_cmd.assert_not_called()
assert "test-task" not in _recording_sessions
def test_maybe_stop_recording_noop_when_not_recording(self):
"""Stopping when not recording is a no-op."""
from tools.browser_tool import _maybe_stop_recording, _recording_sessions
_recording_sessions.discard("test-task") # ensure not in set
with patch("tools.browser_tool._run_browser_command") as mock_cmd:
_maybe_stop_recording("test-task")
mock_cmd.assert_not_called()
# ── dogfood skill files ──────────────────────────────────────────────
class TestDogfoodSkill:
"""Dogfood skill files exist and have correct structure."""
@pytest.fixture(autouse=True)
def _skill_dir(self):
# Use the actual repo skills dir (not temp)
self.skill_dir = os.path.join(
os.path.dirname(__file__), "..", "..", "skills", "dogfood"
)
def test_skill_md_exists(self):
assert os.path.exists(os.path.join(self.skill_dir, "SKILL.md"))
def test_taxonomy_exists(self):
assert os.path.exists(
os.path.join(self.skill_dir, "references", "issue-taxonomy.md")
)
def test_report_template_exists(self):
assert os.path.exists(
os.path.join(self.skill_dir, "templates", "dogfood-report-template.md")
)
def test_skill_md_has_frontmatter(self):
with open(os.path.join(self.skill_dir, "SKILL.md")) as f:
content = f.read()
assert content.startswith("---")
assert "name: dogfood" in content
assert "description:" in content
def test_skill_references_browser_console(self):
with open(os.path.join(self.skill_dir, "SKILL.md")) as f:
content = f.read()
assert "browser_console" in content
def test_skill_references_annotate(self):
with open(os.path.join(self.skill_dir, "SKILL.md")) as f:
content = f.read()
assert "annotate" in content
def test_taxonomy_has_severity_levels(self):
with open(
os.path.join(self.skill_dir, "references", "issue-taxonomy.md")
) as f:
content = f.read()
assert "Critical" in content
assert "High" in content
assert "Medium" in content
assert "Low" in content
def test_taxonomy_has_categories(self):
with open(
os.path.join(self.skill_dir, "references", "issue-taxonomy.md")
) as f:
content = f.read()
assert "Functional" in content
assert "Visual" in content
assert "Accessibility" in content
assert "Console" in content

View File

@@ -144,6 +144,7 @@ def _socket_safe_tmpdir() -> str:
# Track active sessions per task
# Stores: session_name (always), bb_session_id + cdp_url (cloud mode only)
_active_sessions: Dict[str, Dict[str, str]] = {} # task_id -> {session_name, ...}
_recording_sessions: set = set() # task_ids with active recordings
# Flag to track if cleanup has been done
_cleanup_done = False
@@ -478,11 +479,31 @@ BROWSER_TOOL_SCHEMAS = [
"question": {
"type": "string",
"description": "What you want to know about the page visually. Be specific about what you're looking for."
},
"annotate": {
"type": "boolean",
"default": False,
"description": "If true, overlay numbered [N] labels on interactive elements. Each [N] maps to ref @eN for subsequent browser commands. Useful for QA and spatial reasoning about page layout."
}
},
"required": ["question"]
}
},
{
"name": "browser_console",
"description": "Get browser console output and JavaScript errors from the current page. Returns console.log/warn/error/info messages and uncaught JS exceptions. Use this to detect silent JavaScript errors, failed API calls, and application warnings. Requires browser_navigate to be called first.",
"parameters": {
"type": "object",
"properties": {
"clear": {
"type": "boolean",
"default": False,
"description": "If true, clear the message buffers after reading"
}
},
"required": []
}
},
]
@@ -998,9 +1019,10 @@ def browser_navigate(url: str, task_id: Optional[str] = None) -> str:
session_info = _get_session_info(effective_task_id)
is_first_nav = session_info.get("_first_nav", True)
# Mark that we've done at least one navigation
# Auto-start recording if configured and this is first navigation
if is_first_nav:
session_info["_first_nav"] = False
_maybe_start_recording(effective_task_id)
result = _run_browser_command(effective_task_id, "open", [url], timeout=60)
@@ -1264,6 +1286,10 @@ def browser_close(task_id: Optional[str] = None) -> str:
JSON string with close result
"""
effective_task_id = task_id or "default"
# Stop auto-recording before closing
_maybe_stop_recording(effective_task_id)
result = _run_browser_command(effective_task_id, "close", [])
# Close the backend session (Browserbase API in cloud mode, nothing extra in local mode)
@@ -1294,6 +1320,103 @@ def browser_close(task_id: Optional[str] = None) -> str:
}, ensure_ascii=False)
def browser_console(clear: bool = False, task_id: Optional[str] = None) -> str:
"""Get browser console messages and JavaScript errors.
Returns both console output (log/warn/error/info from the page's JS)
and uncaught exceptions (crashes, unhandled promise rejections).
Args:
clear: If True, clear the message/error buffers after reading
task_id: Task identifier for session isolation
Returns:
JSON string with console messages and JS errors
"""
effective_task_id = task_id or "default"
console_args = ["--clear"] if clear else []
error_args = ["--clear"] if clear else []
console_result = _run_browser_command(effective_task_id, "console", console_args)
errors_result = _run_browser_command(effective_task_id, "errors", error_args)
messages = []
if console_result.get("success"):
for msg in console_result.get("data", {}).get("messages", []):
messages.append({
"type": msg.get("type", "log"),
"text": msg.get("text", ""),
"source": "console",
})
errors = []
if errors_result.get("success"):
for err in errors_result.get("data", {}).get("errors", []):
errors.append({
"message": err.get("message", ""),
"source": "exception",
})
return json.dumps({
"success": True,
"console_messages": messages,
"js_errors": errors,
"total_messages": len(messages),
"total_errors": len(errors),
}, ensure_ascii=False)
def _maybe_start_recording(task_id: str):
"""Start recording if browser.record_sessions is enabled in config."""
if task_id in _recording_sessions:
return
try:
hermes_home = Path(os.environ.get("HERMES_HOME", Path.home() / ".hermes"))
config_path = hermes_home / "config.yaml"
record_enabled = False
if config_path.exists():
import yaml
with open(config_path) as f:
cfg = yaml.safe_load(f) or {}
record_enabled = cfg.get("browser", {}).get("record_sessions", False)
if not record_enabled:
return
recordings_dir = hermes_home / "browser_recordings"
recordings_dir.mkdir(parents=True, exist_ok=True)
_cleanup_old_recordings(max_age_hours=72)
import time
timestamp = time.strftime("%Y%m%d_%H%M%S")
recording_path = recordings_dir / f"session_{timestamp}_{task_id[:16]}.webm"
result = _run_browser_command(task_id, "record", ["start", str(recording_path)])
if result.get("success"):
_recording_sessions.add(task_id)
logger.info("Auto-recording browser session %s to %s", task_id, recording_path)
else:
logger.debug("Could not start auto-recording: %s", result.get("error"))
except Exception as e:
logger.debug("Auto-recording setup failed: %s", e)
def _maybe_stop_recording(task_id: str):
"""Stop recording if one is active for this session."""
if task_id not in _recording_sessions:
return
try:
result = _run_browser_command(task_id, "record", ["stop"])
if result.get("success"):
path = result.get("data", {}).get("path", "")
logger.info("Saved browser recording for session %s: %s", task_id, path)
except Exception as e:
logger.debug("Could not stop recording for %s: %s", task_id, e)
finally:
_recording_sessions.discard(task_id)
def browser_get_images(task_id: Optional[str] = None) -> str:
"""
Get all images on the current page.
@@ -1348,7 +1471,7 @@ def browser_get_images(task_id: Optional[str] = None) -> str:
}, ensure_ascii=False)
def browser_vision(question: str, task_id: Optional[str] = None) -> str:
def browser_vision(question: str, annotate: bool = False, task_id: Optional[str] = None) -> str:
"""
Take a screenshot of the current page and analyze it with vision AI.
@@ -1362,6 +1485,7 @@ def browser_vision(question: str, task_id: Optional[str] = None) -> str:
Args:
question: What you want to know about the page visually
annotate: If True, overlay numbered [N] labels on interactive elements
task_id: Task identifier for session isolation
Returns:
@@ -1393,10 +1517,13 @@ def browser_vision(question: str, task_id: Optional[str] = None) -> str:
_cleanup_old_screenshots(screenshots_dir, max_age_hours=24)
# Take screenshot using agent-browser
screenshot_args = [str(screenshot_path)]
if annotate:
screenshot_args.insert(0, "--annotate")
result = _run_browser_command(
effective_task_id,
"screenshot",
[str(screenshot_path)],
screenshot_args,
timeout=30
)
@@ -1456,11 +1583,15 @@ def browser_vision(question: str, task_id: Optional[str] = None) -> str:
)
analysis = response.choices[0].message.content
return json.dumps({
response_data = {
"success": True,
"analysis": analysis,
"screenshot_path": str(screenshot_path),
}, ensure_ascii=False)
}
# Include annotation data if annotated screenshot was taken
if annotate and result.get("data", {}).get("annotations"):
response_data["annotations"] = result["data"]["annotations"]
return json.dumps(response_data, ensure_ascii=False)
except Exception as e:
# Keep the screenshot if it was captured successfully — the failure is
@@ -1490,6 +1621,25 @@ def _cleanup_old_screenshots(screenshots_dir, max_age_hours=24):
pass # Non-critical — don't fail the screenshot operation
def _cleanup_old_recordings(max_age_hours=72):
"""Remove browser recordings older than max_age_hours to prevent disk bloat."""
import time
try:
hermes_home = Path(os.environ.get("HERMES_HOME", Path.home() / ".hermes"))
recordings_dir = hermes_home / "browser_recordings"
if not recordings_dir.exists():
return
cutoff = time.time() - (max_age_hours * 3600)
for f in recordings_dir.glob("session_*.webm"):
try:
if f.stat().st_mtime < cutoff:
f.unlink()
except Exception:
pass
except Exception:
pass
# ============================================================================
# Cleanup and Management Functions
# ============================================================================
@@ -1561,6 +1711,9 @@ def cleanup_browser(task_id: Optional[str] = None) -> None:
bb_session_id = session_info.get("bb_session_id", "unknown")
logger.debug("Found session for task %s: bb_session_id=%s", task_id, bb_session_id)
# Stop auto-recording before closing (saves the file)
_maybe_stop_recording(task_id)
# Try to close via agent-browser first (needs session in _active_sessions)
try:
_run_browser_command(task_id, "close", [], timeout=10)
@@ -1776,6 +1929,13 @@ registry.register(
name="browser_vision",
toolset="browser",
schema=_BROWSER_SCHEMA_MAP["browser_vision"],
handler=lambda args, **kw: browser_vision(question=args.get("question", ""), task_id=kw.get("task_id")),
handler=lambda args, **kw: browser_vision(question=args.get("question", ""), annotate=args.get("annotate", False), task_id=kw.get("task_id")),
check_fn=check_browser_requirements,
)
registry.register(
name="browser_console",
toolset="browser",
schema=_BROWSER_SCHEMA_MAP["browser_console"],
handler=lambda args, **kw: browser_console(clear=args.get("clear", False), task_id=kw.get("task_id")),
check_fn=check_browser_requirements,
)

View File

@@ -620,6 +620,16 @@ code_execution:
max_tool_calls: 50 # Max tool calls within code execution
```
## Browser
Configure browser automation behavior:
```yaml
browser:
inactivity_timeout: 120 # Seconds before auto-closing idle sessions
record_sessions: false # Auto-record browser sessions as WebM videos to ~/.hermes/browser_recordings/
```
## Delegation
Configure subagent behavior for the delegate tool:

View File

@@ -142,6 +142,16 @@ What does the chart on this page show?
Screenshots are stored in `~/.hermes/browser_screenshots/` and automatically cleaned up after 24 hours.
### `browser_console`
Get browser console output (log/warn/error messages) and uncaught JavaScript exceptions from the current page. Essential for detecting silent JS errors that don't appear in the accessibility tree.
```
Check the browser console for any JavaScript errors
```
Use `clear=True` to clear the console after reading, so subsequent calls only show new messages.
### `browser_close`
Close the browser session and release resources. Call this when done to free up Browserbase session quota.
@@ -175,6 +185,17 @@ Agent workflow:
4. browser_close()
```
## Session Recording
Automatically record browser sessions as WebM video files:
```yaml
browser:
record_sessions: true # default: false
```
When enabled, recording starts automatically on the first `browser_navigate` and saves to `~/.hermes/browser_recordings/` when the session closes. Works in both local and cloud (Browserbase) modes. Recordings older than 72 hours are automatically cleaned up.
## Stealth Features
Browserbase provides automatic stealth capabilities:

View File

@@ -15,7 +15,7 @@ Tools are functions that extend the agent's capabilities. They're organized into
| **Web** | `web_search`, `web_extract` | Search the web, extract page content |
| **Terminal** | `terminal`, `process` | Execute commands (local/docker/singularity/modal/daytona/ssh backends), manage background processes |
| **File** | `read_file`, `write_file`, `patch`, `search_files` | Read, write, edit, and search files |
| **Browser** | `browser_navigate`, `browser_click`, `browser_type`, etc. | Full browser automation via Browserbase |
| **Browser** | `browser_navigate`, `browser_click`, `browser_type`, `browser_console`, etc. | Full browser automation via Browserbase |
| **Vision** | `vision_analyze` | Image analysis via multimodal models |
| **Image Gen** | `image_generate` | Generate images (FLUX via FAL) |
| **TTS** | `text_to_speech` | Text-to-speech (Edge TTS / ElevenLabs / OpenAI) |