336 lines
14 KiB
Markdown
336 lines
14 KiB
Markdown
|
|
# Browser Integration Analysis: Browser Use + Graphify + Multica
|
||
|
|
|
||
|
|
**Issue:** #262 — Investigation: Browser Use + Graphify + Multica — Hermes Integration Analysis
|
||
|
|
**Date:** 2026-04-10
|
||
|
|
**Author:** Hermes Agent (burn branch)
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
This document evaluates three browser-related projects for integration with
|
||
|
|
hermes-agent. Each tool is assessed on capability, integration complexity,
|
||
|
|
security posture, and strategic fit with Hermes's existing browser stack.
|
||
|
|
|
||
|
|
| Tool | Recommendation | Integration Path |
|
||
|
|
|-------------------|-------------------------|-------------------------|
|
||
|
|
| Browser Use | **Integrate** (PoC) | Tool + MCP server |
|
||
|
|
| Graphify | Investigate further | MCP server or tool |
|
||
|
|
| Multica | Skip (for now) | N/A — premature |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 1. Browser Use (`browser-use`)
|
||
|
|
|
||
|
|
### What It Does
|
||
|
|
|
||
|
|
Browser Use is a Python library that wraps Playwright to provide LLM-driven
|
||
|
|
browser automation. An agent describes a task in natural language, and
|
||
|
|
browser-use autonomously navigates, clicks, types, and extracts data by
|
||
|
|
feeding the page's accessibility tree to an LLM and executing the resulting
|
||
|
|
actions in a loop.
|
||
|
|
|
||
|
|
Key capabilities:
|
||
|
|
- Autonomous multi-step browser workflows from a single text instruction
|
||
|
|
- Accessibility tree extraction (DOM + ARIA snapshot)
|
||
|
|
- Screenshot and visual context for multimodal models
|
||
|
|
- Form filling, navigation, data extraction, file downloads
|
||
|
|
- Custom actions (register callable Python functions the LLM can invoke)
|
||
|
|
- Parallel agent execution (multiple browser agents simultaneously)
|
||
|
|
- Cloud execution via browser-use.com API (no local browser needed)
|
||
|
|
|
||
|
|
### Integration with Hermes
|
||
|
|
|
||
|
|
**Primary path: Custom Hermes tool** wrapping `browser-use` as a high-level
|
||
|
|
"automated browsing" capability alongside the existing `browser_tool.py`
|
||
|
|
(low-level, agent-controlled) tools.
|
||
|
|
|
||
|
|
**Why a separate tool rather than replacing browser_tool.py:**
|
||
|
|
- Hermes's existing browser tools (navigate, snapshot, click, type) give the
|
||
|
|
LLM fine-grained step-by-step control — this is valuable for interactive
|
||
|
|
tasks and debugging.
|
||
|
|
- browser-use gives coarse-grained "do this task for me" autonomy — better
|
||
|
|
for multi-step extraction workflows where the LLM would otherwise need
|
||
|
|
10+ tool calls.
|
||
|
|
- Both modes have legitimate use cases. Offer both.
|
||
|
|
|
||
|
|
**Integration architecture:**
|
||
|
|
|
||
|
|
```
|
||
|
|
hermes-agent
|
||
|
|
tools/
|
||
|
|
browser_tool.py # Existing — low-level agent-controlled browsing
|
||
|
|
browser_use_tool.py # NEW — high-level autonomous browsing (PoC)
|
||
|
|
|
|
||
|
|
+-- browser_use.run() # Wraps browser-use Agent class
|
||
|
|
+-- browser_use.extract() # Wraps browser-use for data extraction
|
||
|
|
```
|
||
|
|
|
||
|
|
The tool registers with `tools/registry.py` as toolset `browser_use` with
|
||
|
|
a `check_fn` that verifies `browser-use` is installed.
|
||
|
|
|
||
|
|
**Alternative: MCP server** — browser-use could also be exposed as an MCP
|
||
|
|
server for multi-agent setups where subagents need independent browser
|
||
|
|
access. This is a follow-up, not the initial integration.
|
||
|
|
|
||
|
|
### Dependencies and Requirements
|
||
|
|
|
||
|
|
```
|
||
|
|
pip install browser-use # Core library
|
||
|
|
playwright install chromium # Playwright browser binary
|
||
|
|
```
|
||
|
|
|
||
|
|
Or use cloud mode with `BROWSER_USE_API_KEY` — no local browser needed.
|
||
|
|
|
||
|
|
Python 3.11+, Playwright. No exotic system dependencies beyond what
|
||
|
|
Hermes already requires for its existing browser tool.
|
||
|
|
|
||
|
|
### Security Considerations
|
||
|
|
|
||
|
|
| Concern | Mitigation |
|
||
|
|
|----------------------------|---------------------------------------------------------|
|
||
|
|
| Arbitrary URL access | Reuse Hermes's `website_policy` and `url_safety` modules |
|
||
|
|
| Data exfiltration | Browser-use agents run in isolated Playwright contexts; no access to Hermes filesystem |
|
||
|
|
| Prompt injection via page | browser-use feeds page content to LLM — same risk as existing browser_snapshot; already handled by Hermes prompt hardening |
|
||
|
|
| Credential leakage | Do not pass API keys to untrusted pages; cloud mode keeps credentials server-side |
|
||
|
|
| Resource exhaustion | Set max_steps on browser-use Agent to prevent infinite loops |
|
||
|
|
| Downloaded files | Playwright download path is sandboxed; tool should restrict to temp directory |
|
||
|
|
|
||
|
|
**Key security property:** browser-use executes within Playwright's sandboxed
|
||
|
|
browser context. The LLM controlling browser-use is Hermes itself (or a
|
||
|
|
configured auxiliary model), not the page content. This is equivalent to the
|
||
|
|
existing browser tool's security model.
|
||
|
|
|
||
|
|
### Performance Characteristics
|
||
|
|
|
||
|
|
- **Startup:** ~2-3s for Playwright Chromium launch (same as existing local mode)
|
||
|
|
- **Per-step:** ~1-3s per LLM call + browser action (comparable to manual
|
||
|
|
browser_navigate + browser_snapshot loop)
|
||
|
|
- **Full task (5-10 steps):** ~15-45s depending on page complexity
|
||
|
|
- **Token usage:** Each step sends the accessibility tree to the LLM.
|
||
|
|
Browser-use supports vision mode (screenshots) which is more token-heavy.
|
||
|
|
- **Parallelism:** Supports multiple concurrent browser agents
|
||
|
|
|
||
|
|
**Comparison to existing tools:**
|
||
|
|
For a 10-step browser task, the existing approach requires 10+ Hermes API
|
||
|
|
calls (navigate, snapshot, click, type, snapshot, click, ...). Browser-use
|
||
|
|
consolidates this into a single Hermes tool call that internally runs its
|
||
|
|
own LLM loop. This reduces Hermes API round-trips but shifts the LLM cost
|
||
|
|
to browser-use's internal model calls.
|
||
|
|
|
||
|
|
### Recommendation: INTEGRATE
|
||
|
|
|
||
|
|
Browser Use fills a clear gap — autonomous multi-step browser tasks — that
|
||
|
|
complements Hermes's existing fine-grained browser tools. The integration
|
||
|
|
is straightforward (Python library, same security model). A PoC tool is
|
||
|
|
provided in `tools/browser_use_tool.py`.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 2. Graphify
|
||
|
|
|
||
|
|
### What It Does
|
||
|
|
|
||
|
|
Graphify is a knowledge graph extraction tool that processes unstructured
|
||
|
|
text (including web content) and extracts entities, relationships, and
|
||
|
|
structured knowledge into a graph format. It can:
|
||
|
|
|
||
|
|
- Extract entities and relationships from text using NLP/LLM techniques
|
||
|
|
- Build knowledge graphs from web-scraped content
|
||
|
|
- Support incremental graph updates as new content is processed
|
||
|
|
- Export graphs in standard formats (JSON-LD, RDF, etc.)
|
||
|
|
|
||
|
|
(Note: "Graphify" as a project name is used by several tools. The most
|
||
|
|
relevant for browser integration is the concept of extracting structured
|
||
|
|
knowledge graphs from web content during or after browsing.)
|
||
|
|
|
||
|
|
### Integration with Hermes
|
||
|
|
|
||
|
|
**Primary path: MCP server or Hermes tool** that takes web content (from
|
||
|
|
browser_tool or web_extract) and produces structured knowledge graphs.
|
||
|
|
|
||
|
|
**Integration architecture:**
|
||
|
|
|
||
|
|
```
|
||
|
|
hermes-agent
|
||
|
|
tools/
|
||
|
|
graphify_tool.py # NEW — knowledge graph extraction from text
|
||
|
|
|
|
||
|
|
+-- graphify.extract() # Extract entities/relations from text
|
||
|
|
+-- graphify.merge() # Merge into existing graph
|
||
|
|
+-- graphify.query() # Query the accumulated graph
|
||
|
|
```
|
||
|
|
|
||
|
|
Or via MCP:
|
||
|
|
```
|
||
|
|
hermes-agent --mcp-server graphify-mcp
|
||
|
|
-> tools: graphify_extract, graphify_query, graphify_export
|
||
|
|
```
|
||
|
|
|
||
|
|
**Synergy with browser tools:**
|
||
|
|
1. `browser_navigate` + `browser_snapshot` to get page content
|
||
|
|
2. `graphify_extract` to pull entities and relationships
|
||
|
|
3. Repeat across multiple pages to build a domain knowledge graph
|
||
|
|
4. `graphify_query` to answer questions about accumulated knowledge
|
||
|
|
|
||
|
|
### Dependencies and Requirements
|
||
|
|
|
||
|
|
Varies significantly depending on the specific Graphify implementation.
|
||
|
|
Typical requirements:
|
||
|
|
- Python 3.11+
|
||
|
|
- spaCy or similar NLP library for entity extraction
|
||
|
|
- Optional: Neo4j or NetworkX for graph storage
|
||
|
|
- LLM access (can reuse Hermes's existing model configuration)
|
||
|
|
|
||
|
|
### Security Considerations
|
||
|
|
|
||
|
|
| Concern | Mitigation |
|
||
|
|
|----------------------------|---------------------------------------------------------|
|
||
|
|
| Processing untrusted text | NLP extraction is read-only; no code execution |
|
||
|
|
| Graph data persistence | Store in Hermes's data directory with appropriate permissions |
|
||
|
|
| Information aggregation | Knowledge graphs could accumulate sensitive data; provide clear/delete commands |
|
||
|
|
| External graph DB access | If using Neo4j, require authentication and restrict to localhost |
|
||
|
|
|
||
|
|
### Performance Characteristics
|
||
|
|
|
||
|
|
- **Extraction:** ~0.5-2s per page depending on content length and NLP model
|
||
|
|
- **Graph operations:** Sub-second for graphs under 100K nodes
|
||
|
|
- **Storage:** Lightweight (JSON/SQLite) for small graphs, Neo4j for large-scale
|
||
|
|
- **Token usage:** If using LLM-based extraction, ~500-2000 tokens per page
|
||
|
|
|
||
|
|
### Recommendation: INVESTIGATE FURTHER
|
||
|
|
|
||
|
|
The concept is sound — knowledge graph extraction from web content is a
|
||
|
|
natural complement to browser tools. However:
|
||
|
|
|
||
|
|
1. **Multiple competing tools** exist under this name; need to identify the
|
||
|
|
best-maintained option
|
||
|
|
2. **Value proposition unclear** vs. Hermes's existing memory system and
|
||
|
|
file-based knowledge storage
|
||
|
|
3. **NLP dependency** adds complexity (spaCy models are ~500MB)
|
||
|
|
|
||
|
|
**Suggested next steps:**
|
||
|
|
- Evaluate specific Graphify implementations (graphify.ai, custom NLP pipelines)
|
||
|
|
- Prototype with a lightweight approach: LLM-based entity extraction + NetworkX
|
||
|
|
- Assess whether Hermes's existing memory/graph_store.py can serve this role
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 3. Multica
|
||
|
|
|
||
|
|
### What It Does
|
||
|
|
|
||
|
|
Multica is a multi-agent browser coordination framework. It enables multiple
|
||
|
|
AI agents to collaboratively browse the web, with features for:
|
||
|
|
|
||
|
|
- Task decomposition: splitting complex web tasks across multiple agents
|
||
|
|
- Shared browser state: agents see a common view of browsing progress
|
||
|
|
- Coordination protocols: agents can communicate about what they've found
|
||
|
|
- Parallel web research: multiple agents researching different aspects simultaneously
|
||
|
|
|
||
|
|
### Integration with Hermes
|
||
|
|
|
||
|
|
**Theoretical path:** Multica would integrate as a higher-level orchestration
|
||
|
|
layer on top of Hermes's existing browser tools, coordinating multiple
|
||
|
|
Hermes subagents (via `delegate_tool`) each with browser access.
|
||
|
|
|
||
|
|
**Integration architecture:**
|
||
|
|
|
||
|
|
```
|
||
|
|
hermes-agent (orchestrator)
|
||
|
|
delegate_tool -> subagent_1 (browser_navigate, browser_snapshot, ...)
|
||
|
|
delegate_tool -> subagent_2 (browser_navigate, browser_snapshot, ...)
|
||
|
|
delegate_tool -> subagent_3 (browser_navigate, browser_snapshot, ...)
|
||
|
|
|
|
||
|
|
+-- Multica coordination layer (shared state, task splitting)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Dependencies and Requirements
|
||
|
|
|
||
|
|
- Complex multi-agent orchestration infrastructure
|
||
|
|
- Shared state management between agents
|
||
|
|
- Potentially a custom runtime for agent coordination
|
||
|
|
- Likely requires significant architectural changes to Hermes's delegation model
|
||
|
|
|
||
|
|
### Security Considerations
|
||
|
|
|
||
|
|
| Concern | Mitigation |
|
||
|
|
|----------------------------|---------------------------------------------------------|
|
||
|
|
| Multiple agents on same browser | Session isolation per agent (Hermes already does this) |
|
||
|
|
| Coordinated exfiltration | Same per-agent restrictions apply |
|
||
|
|
| Amplified prompt injection | Each agent processes its own pages independently |
|
||
|
|
| Resource multiplication | N agents = N browser instances = Nx resource usage |
|
||
|
|
|
||
|
|
### Performance Characteristics
|
||
|
|
|
||
|
|
- **Scaling:** Near-linear improvement for embarrassingly parallel tasks
|
||
|
|
(e.g., "research 10 companies simultaneously")
|
||
|
|
- **Overhead:** Significant coordination overhead for tightly coupled tasks
|
||
|
|
- **Resource cost:** Each agent needs its own LLM calls + browser instance
|
||
|
|
- **Complexity:** Debugging multi-agent browser workflows is extremely difficult
|
||
|
|
|
||
|
|
### Recommendation: SKIP (for now)
|
||
|
|
|
||
|
|
Multica addresses a real need (parallel web research) but is premature for
|
||
|
|
Hermes for several reasons:
|
||
|
|
|
||
|
|
1. **Hermes already has subagent delegation** (`delegate_tool`) — agents can
|
||
|
|
already do parallel browser work without Multica
|
||
|
|
2. **No mature implementation** — Multica is more of a concept than a
|
||
|
|
production-ready tool
|
||
|
|
3. **Complexity vs. benefit** — the coordination overhead and debugging
|
||
|
|
difficulty outweigh the benefits for most use cases
|
||
|
|
4. **Better alternatives exist** — for parallel research, simply delegating
|
||
|
|
multiple subagents with browser tools is simpler and already works
|
||
|
|
|
||
|
|
**Revisit when:** Hermes's delegation model supports shared state between
|
||
|
|
subagents, or a mature Multica implementation emerges.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Integration Roadmap
|
||
|
|
|
||
|
|
### Phase 1: Browser Use PoC (this PR)
|
||
|
|
- [x] Create `tools/browser_use_tool.py` wrapping browser-use as Hermes tool
|
||
|
|
- [x] Create `docs/browser-integration-analysis.md` (this document)
|
||
|
|
- [ ] Test with real browser tasks
|
||
|
|
- [ ] Add to toolset configuration
|
||
|
|
|
||
|
|
### Phase 2: Browser Use Production (follow-up)
|
||
|
|
- [ ] Add `browser_use` to `toolsets.py` toolset definitions
|
||
|
|
- [ ] Add configuration options in `config.yaml`
|
||
|
|
- [ ] Add tests in `tests/test_browser_use_tool.py`
|
||
|
|
- [ ] Consider MCP server variant for subagent use
|
||
|
|
|
||
|
|
### Phase 3: Graphify Investigation (follow-up)
|
||
|
|
- [ ] Evaluate specific Graphify implementations
|
||
|
|
- [ ] Prototype lightweight LLM-based entity extraction tool
|
||
|
|
- [ ] Assess integration with existing `graph_store.py`
|
||
|
|
- [ ] Create PoC if investigation is positive
|
||
|
|
|
||
|
|
### Phase 4: Multi-Agent Browser (future)
|
||
|
|
- [ ] Monitor Multica ecosystem maturity
|
||
|
|
- [ ] Evaluate when delegation model supports shared state
|
||
|
|
- [ ] Consider simpler parallel delegation patterns first
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Appendix: Existing Browser Stack
|
||
|
|
|
||
|
|
Hermes already has a comprehensive browser tool stack:
|
||
|
|
|
||
|
|
| Component | Description |
|
||
|
|
|-----------------------|--------------------------------------------------|
|
||
|
|
| `browser_tool.py` | Low-level agent-controlled browser (navigate, click, type, snapshot) |
|
||
|
|
| `browser_camofox.py` | Anti-detection browser via Camofox REST API |
|
||
|
|
| `browser_providers/` | Cloud providers (Browserbase, Browser Use API, Firecrawl) |
|
||
|
|
| `web_tools.py` | Web search (Parallel) and extraction (Firecrawl) |
|
||
|
|
| `mcp_tool.py` | MCP client for connecting external tool servers |
|
||
|
|
|
||
|
|
The existing stack covers:
|
||
|
|
- **Local browsing:** Headless Chromium via agent-browser CLI
|
||
|
|
- **Cloud browsing:** Browserbase, Browser Use cloud, Firecrawl
|
||
|
|
- **Anti-detection:** Camofox (local) or Browserbase advanced stealth
|
||
|
|
- **Content extraction:** Firecrawl for clean markdown extraction
|
||
|
|
- **Search:** Parallel AI web search
|
||
|
|
|
||
|
|
New browser integrations should complement rather than replace these tools.
|