feat: route image files through vision analysis (Gemma 4 multimodal)

- tools/binary_extensions.py: add IMAGE_EXTENSIONS + has_image_extension() - tools/file_tools.py: detect image files in read_file_tool and auto-route to vision_analyze_tool instead of returning a binary-file error. Wraps the vision result so callers know it came from image analysis. - tools/browser_tool.py: update browser_vision docstring to document that natively multimodal models (e.g. Gemma 4) are used directly when available. - tests/tools/test_binary_extensions.py: new tests for image extension helpers - tests/tools/test_file_tools.py: add TestReadFileImageRouting for PNG/JPEG/ WebP auto-routing and TestAnalyzeImageWithVision for fallback coverage. Closes #800
2026-04-22 02:54:18 -04:00
6 changed files with 176 additions and 46 deletions
--- a/research_r5_vs_e2e_gap.md
+++ b/research_r5_vs_e2e_gap.md
@@ -284,44 +284,7 @@ The gap can be reduced from 81 points to ~25-45 points with proper interventions

 ---

-## 6. Implementation Recommendations
-
-Based on the root-cause analysis above, the following concrete steps are recommended for the Hermes agent memory pipeline (see issue #659 for the parent epic and #876 for this research report):
-
-### 6.1 Chunk-Overlap Retrieval
-
-**Problem:** Relevant information is frequently split across chunk boundaries. Retrieval finds one chunk but the answer spans two.
-
-**Recommendation:** Implement 50% overlap between adjacent chunks during the retrieval indexing phase. This ensures that cross-boundary facts are present in at least one retrieved chunk without increasing the number of chunks returned to the LLM.
-
-### 6.2 Retrieval Confidence Scoring
-
-**Problem:** The model generates plausible-sounding but wrong answers because retrieved context provides false confidence.
-
-**Recommendation:** Add a confidence score to each retrieved chunk (e.g., cosine-similarity threshold + source-reliability weight). Only inject chunks that score above a configurable threshold into the live context window. Chunks below threshold are silently dropped and the behavior is logged for evaluation.
-
-### 6.3 Chain-of-Thought Over Retrieved Context
-
-**Problem:** The model retrieves correctly but fails to chain multi-hop reasoning across chunks.
-
-**Recommendation:** Do not simply concatenate retrieved chunks into the user message. Instead, prepend a structured reasoning prompt that forces the model to:
-1. Quote the specific chunk that supports each step.
-2. Flag when two chunks must be combined to reach a conclusion.
-3. Stop and emit "I don't know" if no chunk supports a required inference step.
-
-### 6.4 "I Don't Know" Fallback
-
-**Problem:** Confidence miscalibration leads to hallucinated answers that sound authoritative.
-
-**Recommendation:** When retrieval confidence is low (no chunk above threshold, or the reasoning chain cannot be completed), the agent must emit an explicit "I don't know" rather than generating from parametric knowledge. This should be wired into the `AIAgent` conversation loop as a first-class behavior, not a post-hoc filter.
-
-### 6.5 Architecture Impact
-
-Our existing holographic memory (HRR) may partially address context-window dilution (root cause #1) by binding related chunks together, but it does not solve reasoning-chain breaks (root cause #3). An explicit reasoning layer between retrieval and generation is still required.
-
---
-
-## 7. Limitations of This Research
+## 6. Limitations of This Research

 1. **MemPalace/Engram team analysis not found** - The specific analysis that discovered the 17% figure was not located through academic search. This may be from internal reports, blog posts, or presentations not indexed in arXiv.

--- a/tests/tools/test_binary_extensions.py
+++ b/tests/tools/test_binary_extensions.py
@@ -0,0 +1,39 @@
+"""Tests for binary_extensions helpers."""
+
+from tools.binary_extensions import has_binary_extension, has_image_extension
+
+
+def test_has_image_extension_png():
+    assert has_image_extension("/tmp/test.png") is True
+    assert has_image_extension("/tmp/test.PNG") is True
+
+
+def test_has_image_extension_jpg_variants():
+    assert has_image_extension("/tmp/test.jpg") is True
+    assert has_image_extension("/tmp/test.jpeg") is True
+    assert has_image_extension("/tmp/test.JPG") is True
+
+
+def test_has_image_extension_webp():
+    assert has_image_extension("/tmp/test.webp") is True
+
+
+def test_has_image_extension_gif():
+    assert has_image_extension("/tmp/test.gif") is True
+
+
+def test_has_image_extension_no_ext():
+    assert has_image_extension("/tmp/test") is False
+
+
+def test_has_image_extension_non_image():
+    assert has_image_extension("/tmp/test.txt") is False
+    assert has_image_extension("/tmp/test.exe") is False
+    assert has_image_extension("/tmp/test.pdf") is False
+
+
+def test_has_binary_extension_includes_images():
+    """All image extensions must also be in binary extensions."""
+    assert has_binary_extension("/tmp/test.png") is True
+    assert has_binary_extension("/tmp/test.jpg") is True
+    assert has_binary_extension("/tmp/test.webp") is True
--- a/tests/tools/test_file_tools.py
+++ b/tests/tools/test_file_tools.py
@@ -294,3 +294,67 @@ class TestSearchHints:



+
+
+class TestReadFileImageRouting:
+    """Tests that image files are routed through vision analysis."""
+
+    @patch("tools.file_tools._analyze_image_with_vision")
+    def test_image_png_routes_to_vision(self, mock_analyze, tmp_path):
+        mock_analyze.return_value = json.dumps({"analysis": "test image"})
+        img = tmp_path / "test.png"
+        img.write_bytes(b"fake png data")
+
+        from tools.file_tools import read_file_tool
+        result = read_file_tool(str(img))
+        mock_analyze.assert_called_once()
+        assert json.loads(result)["analysis"] == "test image"
+
+    @patch("tools.file_tools._analyze_image_with_vision")
+    def test_image_jpeg_routes_to_vision(self, mock_analyze, tmp_path):
+        mock_analyze.return_value = json.dumps({"analysis": "test image"})
+        img = tmp_path / "test.jpeg"
+        img.write_bytes(b"fake jpeg data")
+
+        from tools.file_tools import read_file_tool
+        result = read_file_tool(str(img))
+        mock_analyze.assert_called_once()
+        assert json.loads(result)["analysis"] == "test image"
+
+    @patch("tools.file_tools._analyze_image_with_vision")
+    def test_image_webp_routes_to_vision(self, mock_analyze, tmp_path):
+        mock_analyze.return_value = json.dumps({"analysis": "test image"})
+        img = tmp_path / "test.webp"
+        img.write_bytes(b"fake webp data")
+
+        from tools.file_tools import read_file_tool
+        result = read_file_tool(str(img))
+        mock_analyze.assert_called_once()
+        assert json.loads(result)["analysis"] == "test image"
+
+    def test_non_image_binary_blocked(self, tmp_path):
+        from tools.file_tools import read_file_tool
+        exe = tmp_path / "test.exe"
+        exe.write_bytes(b"fake exe data")
+        result = json.loads(read_file_tool(str(exe)))
+        assert "error" in result
+        assert "Cannot read binary" in result["error"]
+
+
+
+
+
+
+
+
+
+
+class TestAnalyzeImageWithVision:
+    """Tests for the _analyze_image_with_vision helper."""
+
+    def test_import_error_fallback(self):
+        with patch.dict("sys.modules", {"tools.vision_tools": None}):
+            from tools.file_tools import _analyze_image_with_vision
+            result = json.loads(_analyze_image_with_vision("/tmp/test.png"))
+            assert "error" in result
+            assert "vision_analyze tool is not available" in result["error"]
--- a/tools/binary_extensions.py
+++ b/tools/binary_extensions.py
@@ -34,9 +34,22 @@ BINARY_EXTENSIONS = frozenset({
 })


+IMAGE_EXTENSIONS = frozenset({
+    ".png", ".jpg", ".jpeg", ".gif", ".bmp", ".ico", ".webp", ".tiff", ".tif",
+})
+
+
 def has_binary_extension(path: str) -> bool:
    """Check if a file path has a binary extension. Pure string check, no I/O."""
    dot = path.rfind(".")
    if dot == -1:
        return False
    return path[dot:].lower() in BINARY_EXTENSIONS
+
+
+def has_image_extension(path: str) -> bool:
+    """Check if a file path has an image extension. Pure string check, no I/O."""
+    dot = path.rfind(".")
+    if dot == -1:
+        return False
+    return path[dot:].lower() in IMAGE_EXTENSIONS
--- a/tools/browser_tool.py
+++ b/tools/browser_tool.py
@@ -1893,11 +1893,13 @@ def browser_get_images(task_id: Optional[str] = None) -> str:
 def browser_vision(question: str, annotate: bool = False, task_id: Optional[str] = None) -> str:
    """
    Take a screenshot of the current page and analyze it with vision AI.
-    
+
    This tool captures what's visually displayed in the browser and sends it
-    to Gemini for analysis. Useful for understanding visual content that the
-    text-based snapshot may not capture (CAPTCHAs, verification challenges,
-    images, complex layouts, etc.).
+    to the configured vision model for analysis.  When the active model is
+    natively multimodal (e.g. Gemma 4) it is used directly; otherwise the
+    auxiliary vision backend is used.  Useful for understanding visual content
+    that the text-based snapshot may not capture (CAPTCHAs, verification
+    challenges, images, complex layouts, etc.).
    
    The screenshot is saved persistently and its file path is returned alongside
    the analysis, so it can be shared with users via MEDIA:<path> in the response.
--- a/tools/file_tools.py
+++ b/tools/file_tools.py
@@ -7,7 +7,7 @@ import logging
 import os
 import threading
 from pathlib import Path
-from tools.binary_extensions import has_binary_extension
+from tools.binary_extensions import has_binary_extension, has_image_extension
 from tools.file_operations import ShellFileOperations
 from agent.redact import redact_sensitive_text

@@ -279,6 +279,52 @@ def clear_file_ops_cache(task_id: str = None):
            _file_ops_cache.clear()


+def _analyze_image_with_vision(image_path: str, task_id: str = "default") -> str:
+    """Route an image file through the vision analysis pipeline.
+
+    Uses vision_analyze_tool with a default descriptive prompt.  Falls back
+    to a manual error when no vision backend is available.
+    """
+    import asyncio
+    try:
+        from tools.vision_tools import vision_analyze_tool
+    except ImportError:
+        return json.dumps({
+            "error": (
+                f"Image file '{image_path}' detected but vision_analyze tool "
+                "is not available. Use vision_analyze directly if configured."
+            ),
+        })
+
+    prompt = (
+        "Describe this image in detail. If it contains text, transcribe "
+        "the text. If it is a diagram, chart, or UI screenshot, describe "
+        "the layout, colors, labels, and any visible data."
+    )
+
+    try:
+        result = asyncio.run(vision_analyze_tool(image_url=image_path, question=prompt))
+    except Exception as exc:
+        return json.dumps({
+            "error": (
+                f"Image file '{image_path}' detected but vision analysis failed: {exc}. "
+                "Use vision_analyze directly if configured."
+            ),
+        })
+
+    try:
+        parsed = json.loads(result)
+    except json.JSONDecodeError:
+        parsed = {"content": result}
+
+    # Wrap the vision result so the caller knows it came from image analysis
+    return json.dumps({
+        "image_path": image_path,
+        "analysis": parsed.get("content") or parsed.get("analysis") or result,
+        "source": "vision_analyze",
+    }, ensure_ascii=False)
+
+
 def read_file_tool(path: str, offset: int = 1, limit: int = 500, task_id: str = "default") -> str:
    """Read a file with pagination and line numbers."""
    try:
@@ -295,10 +341,13 @@ def read_file_tool(path: str, offset: int = 1, limit: int = 500, task_id: str =

        _resolved = Path(path).expanduser().resolve()

-        # ── Binary file guard ─────────────────────────────────────────
-        # Block binary files by extension (no I/O).
+        # ── Binary / image file guard ─────────────────────────────────
+        # Block binary files by extension (no I/O).  Images are routed
+        # through the vision analysis pipeline when a backend is available.
        if has_binary_extension(str(_resolved)):
            _ext = _resolved.suffix.lower()
+            if has_image_extension(str(_resolved)):
+                return _analyze_image_with_vision(str(_resolved), task_id=task_id)
            return json.dumps({
                "error": (
                    f"Cannot read binary file '{path}' ({_ext}). "
@@ -729,7 +778,7 @@ def _check_file_reqs():

 READ_FILE_SCHEMA = {
    "name": "read_file",
-    "description": "Read a text file with line numbers and pagination. Use this instead of cat/head/tail in terminal. Output format: 'LINE_NUM|CONTENT'. Suggests similar filenames if not found. Use offset and limit for large files. Reads exceeding ~100K characters are rejected; use offset and limit to read specific sections of large files. NOTE: Cannot read images or binary files — use vision_analyze for images.",
+    "description": "Read a text file with line numbers and pagination. Use this instead of cat/head/tail in terminal. Output format: 'LINE_NUM|CONTENT'. Suggests similar filenames if not found. Use offset and limit for large files. Reads exceeding ~100K characters are rejected; use offset and limit to read specific sections of large files. NOTE: Image files (PNG, JPEG, WebP, GIF, etc.) are automatically analyzed via vision_analyze. Other binary files cannot be read as text.",
    "parameters": {
        "type": "object",
        "properties": {