Compare commits

..

1 Commits

Author SHA1 Message Date
Alexander Whitestone
f88e57bcfe feat: route image files through vision analysis (Gemma 4 multimodal)
All checks were successful
Lint / lint (pull_request) Successful in 8s
- tools/binary_extensions.py: add IMAGE_EXTENSIONS + has_image_extension()
- tools/file_tools.py: detect image files in read_file_tool and auto-route
to vision_analyze_tool instead of returning a binary-file error. Wraps
the vision result so callers know it came from image analysis.
- tools/browser_tool.py: update browser_vision docstring to document that
natively multimodal models (e.g. Gemma 4) are used directly when available.
- tests/tools/test_binary_extensions.py: new tests for image extension helpers
- tests/tools/test_file_tools.py: add TestReadFileImageRouting for PNG/JPEG/
WebP auto-routing and TestAnalyzeImageWithVision for fallback coverage.

Closes #800
2026-04-22 02:54:18 -04:00
6 changed files with 175 additions and 165 deletions

View File

@@ -1,157 +0,0 @@
# AI Tools Evaluation Report (#842)
**Source:** [formatho/awesome-ai-tools](https://github.com/formatho/awesome-ai-tools)
**Date:** 2026-04-15
**Tools Analyzed:** 414 across 9 categories
**Scope:** Hermes-agent integration potential
---
## Executive Summary
Scanned 414 tools from awesome-ai-tools. Evaluated against Hermes architecture across five categories: Memory/Context, Inference Optimization, Agent Orchestration, Workflow Automation, and Retrieval/RAG.
## Top 5 Recommendations & Implementation Status
### P1 — Mem0 (Memory/Context) ✅ IMPLEMENTED
| Metric | Value |
|--------|-------|
| GitHub | [mem0ai/mem0](https://github.com/mem0ai/mem0) |
| Stars | 53.1k ⭐ |
| Integration Effort | 3/5 |
| Impact | 5/5 |
**Status:** Both cloud (mem0ai) and local (ChromaDB) variants implemented.
**Deliverables:**
- `plugins/memory/mem0/` — Platform API provider with server-side LLM extraction, semantic search, reranking
- `plugins/memory/mem0_local/` — Sovereign local variant using ChromaDB, no API key required
- Tools: `mem0_profile`, `mem0_search`, `mem0_conclude`
- Circuit breaker for resilience
- 36 tests passing across both providers
**Activation:**
```bash
hermes memory setup # select "mem0" or "mem0_local"
```
**Risk mitigation:** OSS-only features used in `mem0_local`. Cloud version uses freemium API but has circuit-breaker fallback.
---
### P2 — LightRAG (Retrieval/RAG) 🔴 NOT STARTED
| Metric | Value |
|--------|-------|
| GitHub | [HKUDS/LightRAG](https://github.com/HKUDS/LightRAG) |
| Stars | 33.1k ⭐ |
| Integration Effort | 3/5 |
| Impact | 4/5 |
**Proposed integration:**
- Local knowledge base for skill references and codebase understanding
- Index GENOME.md, README.md, and key architecture files
- Query via tool call when agent needs contextual understanding (not just keyword search)
- Complements `search_files` without replacing it
**Blocker:** Requires OpenAI-compatible embedding endpoint. Can use local Ollama via compatibility layer.
**Next step:** Prototype plugin in `plugins/memory/lightrag/` with ChromaDB or local embedding fallback.
---
### P3 — tensorzero (Inference Optimization / LLMOps) 🔴 NOT STARTED
| Metric | Value |
|--------|-------|
| GitHub | [tensorzero/tensorzero](https://github.com/tensorzero/tensorzero) |
| Stars | 11.2k ⭐ |
| Integration Effort | 3/5 |
| Impact | 4/5 |
**Proposed integration:**
- Replace custom provider routing, fallback chains, and token tracking
- Intelligent routing across providers with cost/quality optimization
- Automatic prompt optimization based on feedback
- Evaluation metrics for A/B testing model/provider combinations
**Blocker:** Rust-based infrastructure. Requires careful migration of existing provider logic. Best done as gradual opt-in, not replacement.
**Next step:** Evaluate tensorzero gateway as optional `providers.tensorzero` backend.
---
### P4 — RAGFlow (Retrieval/RAG) 🔴 NOT STARTED
| Metric | Value |
|--------|-------|
| GitHub | [infiniflow/ragflow](https://github.com/infiniflow/ragflow) |
| Stars | 77.9k ⭐ |
| Integration Effort | 4/5 |
| Impact | 4/5 |
**Proposed integration:**
- Deploy as local Docker service for document understanding
- Ingest technical docs, research papers, codebases
- Query via HTTP API when agents need deep document comprehension
**Blocker:** Heavy deployment (multi-service Docker). Best suited for always-on infrastructure, not per-session.
**Next step:** Add RAGFlow API client tool in `tools/ragflow_tool.py` for document querying.
---
### P5 — n8n (Workflow Automation) 🔴 NOT STARTED
| Metric | Value |
|--------|-------|
| GitHub | [n8n-io/n8n](https://github.com/n8n-io/n8n) |
| Stars | 183.9k ⭐ |
| Integration Effort | 4/5 |
| Impact | 5/5 |
**Proposed integration:**
- Orchestrate Hermes agents from external events (webhooks, schedules)
- Visual workflow builder for burn loops, PR pipelines, multi-agent chains
- n8n webhooks trigger Hermes cron jobs or fleet dispatches
**Blocker:** Full application stack (Node.js, PostgreSQL, Redis). Deploy as standalone Docker service.
**Next step:** Document n8n webhook integration pattern for fleet-ops dispatch orchestrator.
---
## Honorable Mentions Already in Stack
| Tool | Status | Notes |
|------|--------|-------|
| llama.cpp | ✅ Integrated | Via Ollama local inference |
| mempalace | ✅ Integrated | Holographic memory system (44.8k ⭐) |
---
## Category Breakdown
### Memory/Context (9 tools evaluated)
- Mem0 → **IMPLEMENTED** (cloud + local)
- memvid, mempalace, nocturne_memory, rowboat, byterover-cli, letta-code, hindsight, agentic-context-engine → Evaluated, no action
### Inference Optimization (5 tools evaluated)
- llama.cpp → **Already integrated**
- vllm, tensorzero, mistral.rs, pruna → Evaluated, no action
### Retrieval/RAG (5 tools evaluated)
- RAGFlow, LightRAG, PageIndex, WeKnora, RAG-Anything → Evaluated, no action
### Agent Orchestration (5 tools evaluated)
- n8n, Langflow, agent-framework, deepagents, multica → Evaluated, no action
---
## References
- Source repository: https://github.com/formatho/awesome-ai-tools
- Total tools: 414 across 9 categories
- Freshness distribution: 🟢 303 | 🟡 49 | 🟠 22 | 🔴 40
- Hermes issue: [#842](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/842)

View File

@@ -0,0 +1,39 @@
"""Tests for binary_extensions helpers."""
from tools.binary_extensions import has_binary_extension, has_image_extension
def test_has_image_extension_png():
assert has_image_extension("/tmp/test.png") is True
assert has_image_extension("/tmp/test.PNG") is True
def test_has_image_extension_jpg_variants():
assert has_image_extension("/tmp/test.jpg") is True
assert has_image_extension("/tmp/test.jpeg") is True
assert has_image_extension("/tmp/test.JPG") is True
def test_has_image_extension_webp():
assert has_image_extension("/tmp/test.webp") is True
def test_has_image_extension_gif():
assert has_image_extension("/tmp/test.gif") is True
def test_has_image_extension_no_ext():
assert has_image_extension("/tmp/test") is False
def test_has_image_extension_non_image():
assert has_image_extension("/tmp/test.txt") is False
assert has_image_extension("/tmp/test.exe") is False
assert has_image_extension("/tmp/test.pdf") is False
def test_has_binary_extension_includes_images():
"""All image extensions must also be in binary extensions."""
assert has_binary_extension("/tmp/test.png") is True
assert has_binary_extension("/tmp/test.jpg") is True
assert has_binary_extension("/tmp/test.webp") is True

View File

@@ -294,3 +294,67 @@ class TestSearchHints:
class TestReadFileImageRouting:
"""Tests that image files are routed through vision analysis."""
@patch("tools.file_tools._analyze_image_with_vision")
def test_image_png_routes_to_vision(self, mock_analyze, tmp_path):
mock_analyze.return_value = json.dumps({"analysis": "test image"})
img = tmp_path / "test.png"
img.write_bytes(b"fake png data")
from tools.file_tools import read_file_tool
result = read_file_tool(str(img))
mock_analyze.assert_called_once()
assert json.loads(result)["analysis"] == "test image"
@patch("tools.file_tools._analyze_image_with_vision")
def test_image_jpeg_routes_to_vision(self, mock_analyze, tmp_path):
mock_analyze.return_value = json.dumps({"analysis": "test image"})
img = tmp_path / "test.jpeg"
img.write_bytes(b"fake jpeg data")
from tools.file_tools import read_file_tool
result = read_file_tool(str(img))
mock_analyze.assert_called_once()
assert json.loads(result)["analysis"] == "test image"
@patch("tools.file_tools._analyze_image_with_vision")
def test_image_webp_routes_to_vision(self, mock_analyze, tmp_path):
mock_analyze.return_value = json.dumps({"analysis": "test image"})
img = tmp_path / "test.webp"
img.write_bytes(b"fake webp data")
from tools.file_tools import read_file_tool
result = read_file_tool(str(img))
mock_analyze.assert_called_once()
assert json.loads(result)["analysis"] == "test image"
def test_non_image_binary_blocked(self, tmp_path):
from tools.file_tools import read_file_tool
exe = tmp_path / "test.exe"
exe.write_bytes(b"fake exe data")
result = json.loads(read_file_tool(str(exe)))
assert "error" in result
assert "Cannot read binary" in result["error"]
class TestAnalyzeImageWithVision:
"""Tests for the _analyze_image_with_vision helper."""
def test_import_error_fallback(self):
with patch.dict("sys.modules", {"tools.vision_tools": None}):
from tools.file_tools import _analyze_image_with_vision
result = json.loads(_analyze_image_with_vision("/tmp/test.png"))
assert "error" in result
assert "vision_analyze tool is not available" in result["error"]

View File

@@ -34,9 +34,22 @@ BINARY_EXTENSIONS = frozenset({
})
IMAGE_EXTENSIONS = frozenset({
".png", ".jpg", ".jpeg", ".gif", ".bmp", ".ico", ".webp", ".tiff", ".tif",
})
def has_binary_extension(path: str) -> bool:
"""Check if a file path has a binary extension. Pure string check, no I/O."""
dot = path.rfind(".")
if dot == -1:
return False
return path[dot:].lower() in BINARY_EXTENSIONS
def has_image_extension(path: str) -> bool:
"""Check if a file path has an image extension. Pure string check, no I/O."""
dot = path.rfind(".")
if dot == -1:
return False
return path[dot:].lower() in IMAGE_EXTENSIONS

View File

@@ -1893,11 +1893,13 @@ def browser_get_images(task_id: Optional[str] = None) -> str:
def browser_vision(question: str, annotate: bool = False, task_id: Optional[str] = None) -> str:
"""
Take a screenshot of the current page and analyze it with vision AI.
This tool captures what's visually displayed in the browser and sends it
to Gemini for analysis. Useful for understanding visual content that the
text-based snapshot may not capture (CAPTCHAs, verification challenges,
images, complex layouts, etc.).
to the configured vision model for analysis. When the active model is
natively multimodal (e.g. Gemma 4) it is used directly; otherwise the
auxiliary vision backend is used. Useful for understanding visual content
that the text-based snapshot may not capture (CAPTCHAs, verification
challenges, images, complex layouts, etc.).
The screenshot is saved persistently and its file path is returned alongside
the analysis, so it can be shared with users via MEDIA:<path> in the response.

View File

@@ -7,7 +7,7 @@ import logging
import os
import threading
from pathlib import Path
from tools.binary_extensions import has_binary_extension
from tools.binary_extensions import has_binary_extension, has_image_extension
from tools.file_operations import ShellFileOperations
from agent.redact import redact_sensitive_text
@@ -279,6 +279,52 @@ def clear_file_ops_cache(task_id: str = None):
_file_ops_cache.clear()
def _analyze_image_with_vision(image_path: str, task_id: str = "default") -> str:
"""Route an image file through the vision analysis pipeline.
Uses vision_analyze_tool with a default descriptive prompt. Falls back
to a manual error when no vision backend is available.
"""
import asyncio
try:
from tools.vision_tools import vision_analyze_tool
except ImportError:
return json.dumps({
"error": (
f"Image file '{image_path}' detected but vision_analyze tool "
"is not available. Use vision_analyze directly if configured."
),
})
prompt = (
"Describe this image in detail. If it contains text, transcribe "
"the text. If it is a diagram, chart, or UI screenshot, describe "
"the layout, colors, labels, and any visible data."
)
try:
result = asyncio.run(vision_analyze_tool(image_url=image_path, question=prompt))
except Exception as exc:
return json.dumps({
"error": (
f"Image file '{image_path}' detected but vision analysis failed: {exc}. "
"Use vision_analyze directly if configured."
),
})
try:
parsed = json.loads(result)
except json.JSONDecodeError:
parsed = {"content": result}
# Wrap the vision result so the caller knows it came from image analysis
return json.dumps({
"image_path": image_path,
"analysis": parsed.get("content") or parsed.get("analysis") or result,
"source": "vision_analyze",
}, ensure_ascii=False)
def read_file_tool(path: str, offset: int = 1, limit: int = 500, task_id: str = "default") -> str:
"""Read a file with pagination and line numbers."""
try:
@@ -295,10 +341,13 @@ def read_file_tool(path: str, offset: int = 1, limit: int = 500, task_id: str =
_resolved = Path(path).expanduser().resolve()
# ── Binary file guard ─────────────────────────────────────────
# Block binary files by extension (no I/O).
# ── Binary / image file guard ─────────────────────────────────
# Block binary files by extension (no I/O). Images are routed
# through the vision analysis pipeline when a backend is available.
if has_binary_extension(str(_resolved)):
_ext = _resolved.suffix.lower()
if has_image_extension(str(_resolved)):
return _analyze_image_with_vision(str(_resolved), task_id=task_id)
return json.dumps({
"error": (
f"Cannot read binary file '{path}' ({_ext}). "
@@ -729,7 +778,7 @@ def _check_file_reqs():
READ_FILE_SCHEMA = {
"name": "read_file",
"description": "Read a text file with line numbers and pagination. Use this instead of cat/head/tail in terminal. Output format: 'LINE_NUM|CONTENT'. Suggests similar filenames if not found. Use offset and limit for large files. Reads exceeding ~100K characters are rejected; use offset and limit to read specific sections of large files. NOTE: Cannot read images or binary files — use vision_analyze for images.",
"description": "Read a text file with line numbers and pagination. Use this instead of cat/head/tail in terminal. Output format: 'LINE_NUM|CONTENT'. Suggests similar filenames if not found. Use offset and limit for large files. Reads exceeding ~100K characters are rejected; use offset and limit to read specific sections of large files. NOTE: Image files (PNG, JPEG, WebP, GIF, etc.) are automatically analyzed via vision_analyze. Other binary files cannot be read as text.",
"parameters": {
"type": "object",
"properties": {