Files

Alexander Whitestone f85c07551a

Forge CI / smoke-and-build (pull_request) Failing after 36s

Details

feat: browser integration analysis + PoC tool (#262 )

Add docs/browser-integration-analysis.md:
- Technical analysis of Browser Use, Graphify, and Multica for Hermes
- Integration paths, security considerations, performance characteristics
- Clear recommendations: Browser Use (integrate), Graphify (investigate),
  Multica (skip)
- Phased integration roadmap

Add tools/browser_use_tool.py:
- Wraps browser-use library as Hermes tool (toolset: browser_use)
- Three tools: browser_use_run, browser_use_extract, browser_use_compare
- Autonomous multi-step browser automation from natural language tasks
- Integrates with existing url_safety and website_policy security modules
- Supports both local Playwright and cloud execution modes
- Follows existing tool registration pattern (registry.register)

Refs: #262

2026-04-10 07:10:29 -04:00

14 KiB

Raw Blame History

Browser Integration Analysis: Browser Use + Graphify + Multica

Issue: #262 — Investigation: Browser Use + Graphify + Multica — Hermes Integration Analysis Date: 2026-04-10 Author: Hermes Agent (burn branch)

Executive Summary

This document evaluates three browser-related projects for integration with hermes-agent. Each tool is assessed on capability, integration complexity, security posture, and strategic fit with Hermes's existing browser stack.

Tool	Recommendation	Integration Path
Browser Use	Integrate (PoC)	Tool + MCP server
Graphify	Investigate further	MCP server or tool
Multica	Skip (for now)	N/A — premature

1. Browser Use (`browser-use`)

What It Does

Browser Use is a Python library that wraps Playwright to provide LLM-driven browser automation. An agent describes a task in natural language, and browser-use autonomously navigates, clicks, types, and extracts data by feeding the page's accessibility tree to an LLM and executing the resulting actions in a loop.

Key capabilities:

Autonomous multi-step browser workflows from a single text instruction
Accessibility tree extraction (DOM + ARIA snapshot)
Screenshot and visual context for multimodal models
Form filling, navigation, data extraction, file downloads
Custom actions (register callable Python functions the LLM can invoke)
Parallel agent execution (multiple browser agents simultaneously)
Cloud execution via browser-use.com API (no local browser needed)

Integration with Hermes

Primary path: Custom Hermes tool wrapping browser-use as a high-level "automated browsing" capability alongside the existing browser_tool.py (low-level, agent-controlled) tools.

Why a separate tool rather than replacing browser_tool.py:

Hermes's existing browser tools (navigate, snapshot, click, type) give the LLM fine-grained step-by-step control — this is valuable for interactive tasks and debugging.
browser-use gives coarse-grained "do this task for me" autonomy — better for multi-step extraction workflows where the LLM would otherwise need 10+ tool calls.
Both modes have legitimate use cases. Offer both.

Integration architecture:

hermes-agent
  tools/
    browser_tool.py          # Existing — low-level agent-controlled browsing
    browser_use_tool.py      # NEW — high-level autonomous browsing (PoC)
      |
      +-- browser_use.run()  # Wraps browser-use Agent class
      +-- browser_use.extract()  # Wraps browser-use for data extraction

The tool registers with tools/registry.py as toolset browser_use with a check_fn that verifies browser-use is installed.

Alternative: MCP server — browser-use could also be exposed as an MCP server for multi-agent setups where subagents need independent browser access. This is a follow-up, not the initial integration.

Dependencies and Requirements

pip install browser-use          # Core library
playwright install chromium      # Playwright browser binary

Or use cloud mode with BROWSER_USE_API_KEY — no local browser needed.

Python 3.11+, Playwright. No exotic system dependencies beyond what Hermes already requires for its existing browser tool.

Security Considerations

Concern	Mitigation
Arbitrary URL access	Reuse Hermes's `website_policy` and `url_safety` modules
Data exfiltration	Browser-use agents run in isolated Playwright contexts; no access to Hermes filesystem
Prompt injection via page	browser-use feeds page content to LLM — same risk as existing browser_snapshot; already handled by Hermes prompt hardening
Credential leakage	Do not pass API keys to untrusted pages; cloud mode keeps credentials server-side
Resource exhaustion	Set max_steps on browser-use Agent to prevent infinite loops
Downloaded files	Playwright download path is sandboxed; tool should restrict to temp directory

Key security property: browser-use executes within Playwright's sandboxed browser context. The LLM controlling browser-use is Hermes itself (or a configured auxiliary model), not the page content. This is equivalent to the existing browser tool's security model.

Performance Characteristics

Startup: ~2-3s for Playwright Chromium launch (same as existing local mode)
Per-step: ~1-3s per LLM call + browser action (comparable to manual browser_navigate + browser_snapshot loop)
Full task (5-10 steps): ~15-45s depending on page complexity
Token usage: Each step sends the accessibility tree to the LLM. Browser-use supports vision mode (screenshots) which is more token-heavy.
Parallelism: Supports multiple concurrent browser agents

Comparison to existing tools: For a 10-step browser task, the existing approach requires 10+ Hermes API calls (navigate, snapshot, click, type, snapshot, click, ...). Browser-use consolidates this into a single Hermes tool call that internally runs its own LLM loop. This reduces Hermes API round-trips but shifts the LLM cost to browser-use's internal model calls.

Recommendation: INTEGRATE

Browser Use fills a clear gap — autonomous multi-step browser tasks — that complements Hermes's existing fine-grained browser tools. The integration is straightforward (Python library, same security model). A PoC tool is provided in tools/browser_use_tool.py.

2. Graphify

What It Does

Graphify is a knowledge graph extraction tool that processes unstructured text (including web content) and extracts entities, relationships, and structured knowledge into a graph format. It can:

Extract entities and relationships from text using NLP/LLM techniques
Build knowledge graphs from web-scraped content
Support incremental graph updates as new content is processed
Export graphs in standard formats (JSON-LD, RDF, etc.)

(Note: "Graphify" as a project name is used by several tools. The most relevant for browser integration is the concept of extracting structured knowledge graphs from web content during or after browsing.)

Integration with Hermes

Primary path: MCP server or Hermes tool that takes web content (from browser_tool or web_extract) and produces structured knowledge graphs.

Integration architecture:

hermes-agent
  tools/
    graphify_tool.py          # NEW — knowledge graph extraction from text
      |
      +-- graphify.extract()  # Extract entities/relations from text
      +-- graphify.merge()    # Merge into existing graph
      +-- graphify.query()    # Query the accumulated graph

Or via MCP:

hermes-agent --mcp-server graphify-mcp
  -> tools: graphify_extract, graphify_query, graphify_export

Synergy with browser tools:

browser_navigate + browser_snapshot to get page content
graphify_extract to pull entities and relationships
Repeat across multiple pages to build a domain knowledge graph
graphify_query to answer questions about accumulated knowledge

Dependencies and Requirements

Varies significantly depending on the specific Graphify implementation. Typical requirements:

Python 3.11+
spaCy or similar NLP library for entity extraction
Optional: Neo4j or NetworkX for graph storage
LLM access (can reuse Hermes's existing model configuration)

Security Considerations

Concern	Mitigation
Processing untrusted text	NLP extraction is read-only; no code execution
Graph data persistence	Store in Hermes's data directory with appropriate permissions
Information aggregation	Knowledge graphs could accumulate sensitive data; provide clear/delete commands
External graph DB access	If using Neo4j, require authentication and restrict to localhost

Performance Characteristics

Extraction: ~0.5-2s per page depending on content length and NLP model
Graph operations: Sub-second for graphs under 100K nodes
Storage: Lightweight (JSON/SQLite) for small graphs, Neo4j for large-scale
Token usage: If using LLM-based extraction, ~500-2000 tokens per page

Recommendation: INVESTIGATE FURTHER

The concept is sound — knowledge graph extraction from web content is a natural complement to browser tools. However:

Multiple competing tools exist under this name; need to identify the best-maintained option
Value proposition unclear vs. Hermes's existing memory system and file-based knowledge storage
NLP dependency adds complexity (spaCy models are ~500MB)

Suggested next steps:

Evaluate specific Graphify implementations (graphify.ai, custom NLP pipelines)
Prototype with a lightweight approach: LLM-based entity extraction + NetworkX
Assess whether Hermes's existing memory/graph_store.py can serve this role

3. Multica

What It Does

Multica is a multi-agent browser coordination framework. It enables multiple AI agents to collaboratively browse the web, with features for:

Task decomposition: splitting complex web tasks across multiple agents
Shared browser state: agents see a common view of browsing progress
Coordination protocols: agents can communicate about what they've found
Parallel web research: multiple agents researching different aspects simultaneously

Integration with Hermes

Theoretical path: Multica would integrate as a higher-level orchestration layer on top of Hermes's existing browser tools, coordinating multiple Hermes subagents (via delegate_tool) each with browser access.

Integration architecture:

hermes-agent (orchestrator)
  delegate_tool -> subagent_1 (browser_navigate, browser_snapshot, ...)
  delegate_tool -> subagent_2 (browser_navigate, browser_snapshot, ...)
  delegate_tool -> subagent_3 (browser_navigate, browser_snapshot, ...)
                    |
                    +-- Multica coordination layer (shared state, task splitting)

Dependencies and Requirements

Complex multi-agent orchestration infrastructure
Shared state management between agents
Potentially a custom runtime for agent coordination
Likely requires significant architectural changes to Hermes's delegation model

Security Considerations

Concern	Mitigation
Multiple agents on same browser	Session isolation per agent (Hermes already does this)
Coordinated exfiltration	Same per-agent restrictions apply
Amplified prompt injection	Each agent processes its own pages independently
Resource multiplication	N agents = N browser instances = Nx resource usage

Performance Characteristics

Scaling: Near-linear improvement for embarrassingly parallel tasks (e.g., "research 10 companies simultaneously")
Overhead: Significant coordination overhead for tightly coupled tasks
Resource cost: Each agent needs its own LLM calls + browser instance
Complexity: Debugging multi-agent browser workflows is extremely difficult

Recommendation: SKIP (for now)

Multica addresses a real need (parallel web research) but is premature for Hermes for several reasons:

Hermes already has subagent delegation (delegate_tool) — agents can already do parallel browser work without Multica
No mature implementation — Multica is more of a concept than a production-ready tool
Complexity vs. benefit — the coordination overhead and debugging difficulty outweigh the benefits for most use cases
Better alternatives exist — for parallel research, simply delegating multiple subagents with browser tools is simpler and already works

Revisit when: Hermes's delegation model supports shared state between subagents, or a mature Multica implementation emerges.

Integration Roadmap

Phase 1: Browser Use PoC (this PR)

Create tools/browser_use_tool.py wrapping browser-use as Hermes tool
Create docs/browser-integration-analysis.md (this document)
Test with real browser tasks
Add to toolset configuration

Phase 2: Browser Use Production (follow-up)

Add browser_use to toolsets.py toolset definitions
Add configuration options in config.yaml
Add tests in tests/test_browser_use_tool.py
Consider MCP server variant for subagent use

Phase 3: Graphify Investigation (follow-up)

Evaluate specific Graphify implementations
Prototype lightweight LLM-based entity extraction tool
Assess integration with existing graph_store.py
Create PoC if investigation is positive

Phase 4: Multi-Agent Browser (future)

Monitor Multica ecosystem maturity
Evaluate when delegation model supports shared state
Consider simpler parallel delegation patterns first

Appendix: Existing Browser Stack

Hermes already has a comprehensive browser tool stack:

Component	Description
`browser_tool.py`	Low-level agent-controlled browser (navigate, click, type, snapshot)
`browser_camofox.py`	Anti-detection browser via Camofox REST API
`browser_providers/`	Cloud providers (Browserbase, Browser Use API, Firecrawl)
`web_tools.py`	Web search (Parallel) and extraction (Firecrawl)
`mcp_tool.py`	MCP client for connecting external tool servers

The existing stack covers:

Local browsing: Headless Chromium via agent-browser CLI
Cloud browsing: Browserbase, Browser Use cloud, Firecrawl
Anti-detection: Camofox (local) or Browserbase advanced stealth
Content extraction: Firecrawl for clean markdown extraction
Search: Parallel AI web search

New browser integrations should complement rather than replace these tools.

14 KiB Raw Blame History

Browser Integration Analysis: Browser Use + Graphify + Multica

Executive Summary

1. Browser Use (browser-use)

What It Does

Integration with Hermes

Dependencies and Requirements

Security Considerations

Performance Characteristics

Recommendation: INTEGRATE

2. Graphify

What It Does

Integration with Hermes

Dependencies and Requirements

Security Considerations

Performance Characteristics

Recommendation: INVESTIGATE FURTHER

3. Multica

What It Does

Integration with Hermes

Dependencies and Requirements

Security Considerations

Performance Characteristics

Recommendation: SKIP (for now)

Integration Roadmap

Phase 1: Browser Use PoC (this PR)

Phase 2: Browser Use Production (follow-up)

Phase 3: Graphify Investigation (follow-up)

Phase 4: Multi-Agent Browser (future)

Appendix: Existing Browser Stack

14 KiB

Raw Blame History

1. Browser Use (`browser-use`)