Files
hermes-agent/docs/browser-integration-analysis.md
Alexander Whitestone f85c07551a
Some checks failed
Forge CI / smoke-and-build (pull_request) Failing after 36s
feat: browser integration analysis + PoC tool (#262)
Add docs/browser-integration-analysis.md:
- Technical analysis of Browser Use, Graphify, and Multica for Hermes
- Integration paths, security considerations, performance characteristics
- Clear recommendations: Browser Use (integrate), Graphify (investigate),
  Multica (skip)
- Phased integration roadmap

Add tools/browser_use_tool.py:
- Wraps browser-use library as Hermes tool (toolset: browser_use)
- Three tools: browser_use_run, browser_use_extract, browser_use_compare
- Autonomous multi-step browser automation from natural language tasks
- Integrates with existing url_safety and website_policy security modules
- Supports both local Playwright and cloud execution modes
- Follows existing tool registration pattern (registry.register)

Refs: #262
2026-04-10 07:10:29 -04:00

14 KiB

Browser Integration Analysis: Browser Use + Graphify + Multica

Issue: #262 — Investigation: Browser Use + Graphify + Multica — Hermes Integration Analysis Date: 2026-04-10 Author: Hermes Agent (burn branch)

Executive Summary

This document evaluates three browser-related projects for integration with hermes-agent. Each tool is assessed on capability, integration complexity, security posture, and strategic fit with Hermes's existing browser stack.

Tool Recommendation Integration Path
Browser Use Integrate (PoC) Tool + MCP server
Graphify Investigate further MCP server or tool
Multica Skip (for now) N/A — premature

1. Browser Use (browser-use)

What It Does

Browser Use is a Python library that wraps Playwright to provide LLM-driven browser automation. An agent describes a task in natural language, and browser-use autonomously navigates, clicks, types, and extracts data by feeding the page's accessibility tree to an LLM and executing the resulting actions in a loop.

Key capabilities:

  • Autonomous multi-step browser workflows from a single text instruction
  • Accessibility tree extraction (DOM + ARIA snapshot)
  • Screenshot and visual context for multimodal models
  • Form filling, navigation, data extraction, file downloads
  • Custom actions (register callable Python functions the LLM can invoke)
  • Parallel agent execution (multiple browser agents simultaneously)
  • Cloud execution via browser-use.com API (no local browser needed)

Integration with Hermes

Primary path: Custom Hermes tool wrapping browser-use as a high-level "automated browsing" capability alongside the existing browser_tool.py (low-level, agent-controlled) tools.

Why a separate tool rather than replacing browser_tool.py:

  • Hermes's existing browser tools (navigate, snapshot, click, type) give the LLM fine-grained step-by-step control — this is valuable for interactive tasks and debugging.
  • browser-use gives coarse-grained "do this task for me" autonomy — better for multi-step extraction workflows where the LLM would otherwise need 10+ tool calls.
  • Both modes have legitimate use cases. Offer both.

Integration architecture:

hermes-agent
  tools/
    browser_tool.py          # Existing — low-level agent-controlled browsing
    browser_use_tool.py      # NEW — high-level autonomous browsing (PoC)
      |
      +-- browser_use.run()  # Wraps browser-use Agent class
      +-- browser_use.extract()  # Wraps browser-use for data extraction

The tool registers with tools/registry.py as toolset browser_use with a check_fn that verifies browser-use is installed.

Alternative: MCP server — browser-use could also be exposed as an MCP server for multi-agent setups where subagents need independent browser access. This is a follow-up, not the initial integration.

Dependencies and Requirements

pip install browser-use          # Core library
playwright install chromium      # Playwright browser binary

Or use cloud mode with BROWSER_USE_API_KEY — no local browser needed.

Python 3.11+, Playwright. No exotic system dependencies beyond what Hermes already requires for its existing browser tool.

Security Considerations

Concern Mitigation
Arbitrary URL access Reuse Hermes's website_policy and url_safety modules
Data exfiltration Browser-use agents run in isolated Playwright contexts; no access to Hermes filesystem
Prompt injection via page browser-use feeds page content to LLM — same risk as existing browser_snapshot; already handled by Hermes prompt hardening
Credential leakage Do not pass API keys to untrusted pages; cloud mode keeps credentials server-side
Resource exhaustion Set max_steps on browser-use Agent to prevent infinite loops
Downloaded files Playwright download path is sandboxed; tool should restrict to temp directory

Key security property: browser-use executes within Playwright's sandboxed browser context. The LLM controlling browser-use is Hermes itself (or a configured auxiliary model), not the page content. This is equivalent to the existing browser tool's security model.

Performance Characteristics

  • Startup: ~2-3s for Playwright Chromium launch (same as existing local mode)
  • Per-step: ~1-3s per LLM call + browser action (comparable to manual browser_navigate + browser_snapshot loop)
  • Full task (5-10 steps): ~15-45s depending on page complexity
  • Token usage: Each step sends the accessibility tree to the LLM. Browser-use supports vision mode (screenshots) which is more token-heavy.
  • Parallelism: Supports multiple concurrent browser agents

Comparison to existing tools: For a 10-step browser task, the existing approach requires 10+ Hermes API calls (navigate, snapshot, click, type, snapshot, click, ...). Browser-use consolidates this into a single Hermes tool call that internally runs its own LLM loop. This reduces Hermes API round-trips but shifts the LLM cost to browser-use's internal model calls.

Recommendation: INTEGRATE

Browser Use fills a clear gap — autonomous multi-step browser tasks — that complements Hermes's existing fine-grained browser tools. The integration is straightforward (Python library, same security model). A PoC tool is provided in tools/browser_use_tool.py.


2. Graphify

What It Does

Graphify is a knowledge graph extraction tool that processes unstructured text (including web content) and extracts entities, relationships, and structured knowledge into a graph format. It can:

  • Extract entities and relationships from text using NLP/LLM techniques
  • Build knowledge graphs from web-scraped content
  • Support incremental graph updates as new content is processed
  • Export graphs in standard formats (JSON-LD, RDF, etc.)

(Note: "Graphify" as a project name is used by several tools. The most relevant for browser integration is the concept of extracting structured knowledge graphs from web content during or after browsing.)

Integration with Hermes

Primary path: MCP server or Hermes tool that takes web content (from browser_tool or web_extract) and produces structured knowledge graphs.

Integration architecture:

hermes-agent
  tools/
    graphify_tool.py          # NEW — knowledge graph extraction from text
      |
      +-- graphify.extract()  # Extract entities/relations from text
      +-- graphify.merge()    # Merge into existing graph
      +-- graphify.query()    # Query the accumulated graph

Or via MCP:

hermes-agent --mcp-server graphify-mcp
  -> tools: graphify_extract, graphify_query, graphify_export

Synergy with browser tools:

  1. browser_navigate + browser_snapshot to get page content
  2. graphify_extract to pull entities and relationships
  3. Repeat across multiple pages to build a domain knowledge graph
  4. graphify_query to answer questions about accumulated knowledge

Dependencies and Requirements

Varies significantly depending on the specific Graphify implementation. Typical requirements:

  • Python 3.11+
  • spaCy or similar NLP library for entity extraction
  • Optional: Neo4j or NetworkX for graph storage
  • LLM access (can reuse Hermes's existing model configuration)

Security Considerations

Concern Mitigation
Processing untrusted text NLP extraction is read-only; no code execution
Graph data persistence Store in Hermes's data directory with appropriate permissions
Information aggregation Knowledge graphs could accumulate sensitive data; provide clear/delete commands
External graph DB access If using Neo4j, require authentication and restrict to localhost

Performance Characteristics

  • Extraction: ~0.5-2s per page depending on content length and NLP model
  • Graph operations: Sub-second for graphs under 100K nodes
  • Storage: Lightweight (JSON/SQLite) for small graphs, Neo4j for large-scale
  • Token usage: If using LLM-based extraction, ~500-2000 tokens per page

Recommendation: INVESTIGATE FURTHER

The concept is sound — knowledge graph extraction from web content is a natural complement to browser tools. However:

  1. Multiple competing tools exist under this name; need to identify the best-maintained option
  2. Value proposition unclear vs. Hermes's existing memory system and file-based knowledge storage
  3. NLP dependency adds complexity (spaCy models are ~500MB)

Suggested next steps:

  • Evaluate specific Graphify implementations (graphify.ai, custom NLP pipelines)
  • Prototype with a lightweight approach: LLM-based entity extraction + NetworkX
  • Assess whether Hermes's existing memory/graph_store.py can serve this role

3. Multica

What It Does

Multica is a multi-agent browser coordination framework. It enables multiple AI agents to collaboratively browse the web, with features for:

  • Task decomposition: splitting complex web tasks across multiple agents
  • Shared browser state: agents see a common view of browsing progress
  • Coordination protocols: agents can communicate about what they've found
  • Parallel web research: multiple agents researching different aspects simultaneously

Integration with Hermes

Theoretical path: Multica would integrate as a higher-level orchestration layer on top of Hermes's existing browser tools, coordinating multiple Hermes subagents (via delegate_tool) each with browser access.

Integration architecture:

hermes-agent (orchestrator)
  delegate_tool -> subagent_1 (browser_navigate, browser_snapshot, ...)
  delegate_tool -> subagent_2 (browser_navigate, browser_snapshot, ...)
  delegate_tool -> subagent_3 (browser_navigate, browser_snapshot, ...)
                    |
                    +-- Multica coordination layer (shared state, task splitting)

Dependencies and Requirements

  • Complex multi-agent orchestration infrastructure
  • Shared state management between agents
  • Potentially a custom runtime for agent coordination
  • Likely requires significant architectural changes to Hermes's delegation model

Security Considerations

Concern Mitigation
Multiple agents on same browser Session isolation per agent (Hermes already does this)
Coordinated exfiltration Same per-agent restrictions apply
Amplified prompt injection Each agent processes its own pages independently
Resource multiplication N agents = N browser instances = Nx resource usage

Performance Characteristics

  • Scaling: Near-linear improvement for embarrassingly parallel tasks (e.g., "research 10 companies simultaneously")
  • Overhead: Significant coordination overhead for tightly coupled tasks
  • Resource cost: Each agent needs its own LLM calls + browser instance
  • Complexity: Debugging multi-agent browser workflows is extremely difficult

Recommendation: SKIP (for now)

Multica addresses a real need (parallel web research) but is premature for Hermes for several reasons:

  1. Hermes already has subagent delegation (delegate_tool) — agents can already do parallel browser work without Multica
  2. No mature implementation — Multica is more of a concept than a production-ready tool
  3. Complexity vs. benefit — the coordination overhead and debugging difficulty outweigh the benefits for most use cases
  4. Better alternatives exist — for parallel research, simply delegating multiple subagents with browser tools is simpler and already works

Revisit when: Hermes's delegation model supports shared state between subagents, or a mature Multica implementation emerges.


Integration Roadmap

Phase 1: Browser Use PoC (this PR)

  • Create tools/browser_use_tool.py wrapping browser-use as Hermes tool
  • Create docs/browser-integration-analysis.md (this document)
  • Test with real browser tasks
  • Add to toolset configuration

Phase 2: Browser Use Production (follow-up)

  • Add browser_use to toolsets.py toolset definitions
  • Add configuration options in config.yaml
  • Add tests in tests/test_browser_use_tool.py
  • Consider MCP server variant for subagent use

Phase 3: Graphify Investigation (follow-up)

  • Evaluate specific Graphify implementations
  • Prototype lightweight LLM-based entity extraction tool
  • Assess integration with existing graph_store.py
  • Create PoC if investigation is positive

Phase 4: Multi-Agent Browser (future)

  • Monitor Multica ecosystem maturity
  • Evaluate when delegation model supports shared state
  • Consider simpler parallel delegation patterns first

Appendix: Existing Browser Stack

Hermes already has a comprehensive browser tool stack:

Component Description
browser_tool.py Low-level agent-controlled browser (navigate, click, type, snapshot)
browser_camofox.py Anti-detection browser via Camofox REST API
browser_providers/ Cloud providers (Browserbase, Browser Use API, Firecrawl)
web_tools.py Web search (Parallel) and extraction (Firecrawl)
mcp_tool.py MCP client for connecting external tool servers

The existing stack covers:

  • Local browsing: Headless Chromium via agent-browser CLI
  • Cloud browsing: Browserbase, Browser Use cloud, Firecrawl
  • Anti-detection: Camofox (local) or Browserbase advanced stealth
  • Content extraction: Firecrawl for clean markdown extraction
  • Search: Parallel AI web search

New browser integrations should complement rather than replace these tools.