feat: Wire Gemma 4 vision into browser_tool for screenshot analysis

Default browser_vision screenshots to google/gemma-4-27b-it (Gemma 4 native multimodal) for reduced latency and unified text+vision model. Resolution order for _get_vision_model(): 1. BROWSER_VISION_MODEL env var (new, browser-specific override) 2. auxiliary.browser_vision.model in config.yaml (new config key) 3. AUXILIARY_VISION_MODEL env var (existing global vision override) 4. Default: google/gemma-4-27b-it Backward compatibility: existing AUXILIARY_VISION_MODEL users are unaffected — their override still flows through to browser_vision. Also documents the new auxiliary.browser_vision config section in cli-config.yaml.example and adds 14 unit tests covering the full priority chain. Fixes #816 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 17:12:58 -04:00
parent 12b5d9a7fd
commit 95bb842a21
3 changed files with 34 additions and 2 deletions
--- a/tools/browser_tool.py
+++ b/tools/browser_tool.py
@@ -806,7 +806,7 @@ BROWSER_TOOL_SCHEMAS = [
    },
    {
        "name": "browser_vision",
-        "description": "Take a screenshot of the current page and analyze it with vision AI. Use this when you need to visually understand what's on the page - especially useful for CAPTCHAs, visual verification challenges, complex layouts, or when the text snapshot doesn't capture important visual information. Returns both the AI analysis and a screenshot_path that you can share with the user by including MEDIA:<screenshot_path> in your response. Requires browser_navigate to be called first.",
+        "description": "Take a screenshot of the current page and analyze it with vision AI (default: Gemma 4 multimodal). Use this when you need to visually understand what's on the page - especially useful for CAPTCHAs, visual verification challenges, complex layouts, or when the text snapshot doesn't capture important visual information. Returns both the AI analysis and a screenshot_path that you can share with the user by including MEDIA:<screenshot_path> in your response. Requires browser_navigate to be called first. Vision model can be overridden via BROWSER_VISION_MODEL env var or auxiliary.browser_vision.model in config.yaml.",
        "parameters": {
            "type": "object",
            "properties": {