browser_vision now saves screenshots persistently to ~/.hermes/browser_screenshots/ and returns the screenshot_path in its JSON response. The model can include MEDIA:<path> in its response to share screenshots as native photos. Changes: - browser_tool.py: Save screenshots persistently, return screenshot_path, auto-cleanup files older than 24 hours, mkdir moved inside try/except - telegram.py: Add send_image_file() — sends local images via bot.send_photo() - discord.py: Add send_image_file() — sends local images via discord.File - slack.py: Add send_image_file() — sends local images via files_upload_v2() (WhatsApp already had send_image_file — no changes needed) - prompt_builder.py: Updated Telegram hint to list image extensions, added Discord and Slack MEDIA: platform hints - browser.md: Document screenshot sharing and 24h cleanup - send_file_integration_map.md: Updated to reflect send_image_file is now implemented on Telegram/Discord/Slack - test_send_image_file.py: 19 tests covering MEDIA: .png extraction, send_image_file on all platforms, and screenshot cleanup Partially addresses #466 (Phase 0: platform adapter gaps for send_image_file).
17 KiB
17 KiB
send_file Integration Map — Hermes Agent Codebase Deep Dive
1. environments/tool_context.py — Base64 File Transfer Implementation
upload_file() (lines 153-205)
- Reads local file as raw bytes, base64-encodes to ASCII string
- Creates parent dirs in sandbox via
self.terminal(f"mkdir -p {parent}") - Chunk size: 60,000 chars (~60KB per shell command)
- Small files (<=60KB b64): Single
printf '%s' '{b64}' | base64 -d > {remote_path} - Large files: Writes chunks to
/tmp/_hermes_upload.b64viaprintf >> append, thenbase64 -dto target - Error handling: Checks local file exists; returns
{exit_code, output} - Size limits: No explicit limit, but shell arg limit ~2MB means chunking is necessary for files >~45KB raw
- No theoretical max — but very large files would be slow (many terminal round trips)
download_file() (lines 234-278)
- Runs
base64 {remote_path}inside sandbox, captures stdout - Strips output, base64-decodes to raw bytes
- Writes to host filesystem with parent dir creation
- Error handling: Checks exit code, empty output, decode errors
- Returns
{success: bool, bytes: int}or{success: false, error: str} - Size limit: Bounded by terminal output buffer (practical limit ~few MB via base64 terminal output)
Promotion potential:
- These methods work via
self.terminal()— they're environment-agnostic - Could be directly lifted into a new tool that operates on the agent's current sandbox
- For send_file, this
download_file()pattern is the key: it extracts files from sandbox → host
2. tools/environments/base.py — BaseEnvironment Interface
Current methods:
execute(command, cwd, timeout, stdin_data)→{output, returncode}cleanup()— release resourcesstop()— alias for cleanup_prepare_command()— sudo transformation_build_run_kwargs()— subprocess kwargs_timeout_result()— standard timeout dict
What would need to be added for file transfer:
- Nothing required at this level. File transfer can be implemented via
execute()(base64 over terminal, like ToolContext does) or via environment-specific methods. - Optional:
upload_file(local_path, remote_path)anddownload_file(remote_path, local_path)methods could be added to BaseEnvironment for optimized per-backend transfers, but the base64-over-terminal approach already works universally.
3. tools/environments/docker.py — Docker Container Details
Container ID tracking:
self._container_idstored at init fromself._inner.container_id- Inner is
minisweagent.environments.docker.DockerEnvironment - Container ID is a standard Docker container hash
docker cp feasibility:
- YES,
docker cpcould be used for optimized file transfer:docker cp {container_id}:{remote_path} {local_path}(download)docker cp {local_path} {container_id}:{remote_path}(upload)
- Much faster than base64-over-terminal for large files
- Container ID is directly accessible via
env._container_idorenv._inner.container_id
Volumes mounted:
- Persistent mode: Bind mounts at
~/.hermes/sandboxes/docker/{task_id}/workspace→/workspaceand.../home→/root - Ephemeral mode: tmpfs at
/workspace(10GB),/home(1GB),/root(1GB) - User volumes: From
config.yaml docker_volumes(arbitrary-vmounts) - Security tmpfs:
/tmp(512MB),/var/tmp(256MB),/run(64MB)
Direct host access for persistent mode:
- If persistent, files at
/workspace/foo.txtare just~/.hermes/sandboxes/docker/{task_id}/workspace/foo.txton host — no transfer needed!
4. tools/environments/ssh.py — SSH Connection Management
Connection management:
- Uses SSH ControlMaster for persistent connection
- Control socket at
/tmp/hermes-ssh/{user}@{host}:{port}.sock - ControlPersist=300 (5 min keepalive)
- BatchMode=yes (non-interactive)
- Stores:
self.host,self.user,self.port,self.key_path
SCP/SFTP feasibility:
- YES, SCP can piggyback on the ControlMaster socket:
scp -o ControlPath={socket} {user}@{host}:{remote} {local}(download)scp -o ControlPath={socket} {local} {user}@{host}:{remote}(upload)
- Same SSH key and connection reuse — zero additional auth
- Would be much faster than base64-over-terminal for large files
5. tools/environments/modal.py — Modal Sandbox Filesystem
Filesystem API exposure:
- Not directly. The inner
SwerexModalEnvironmentwraps Modal's sandbox - The sandbox object is accessible at:
env._inner.deployment._sandbox - Modal's Python SDK exposes
sandbox.open()for file I/O — but only via async API - Currently only used for
snapshot_filesystem()during cleanup - Could use:
sandbox.open(path, "rb")to read files orsandbox.open(path, "wb")to write - Alternative: Base64-over-terminal already works via
execute()— simpler, no SDK dependency
6. gateway/platforms/base.py — MEDIA: Tag Flow (Complete)
extract_media() (lines 587-620):
- Pattern:
MEDIA:\S+— extracts file paths after MEDIA: prefix - Voice flag:
[[audio_as_voice]]global directive setsis_voice=Truefor all media in message - Returns
List[Tuple[str, bool]](path, is_voice) and cleaned content
_process_message_background() media routing (lines 752-786):
- After extracting MEDIA tags, routes by file extension:
.ogg .opus .mp3 .wav .m4a→send_voice().mp4 .mov .avi .mkv .3gp→send_video().jpg .jpeg .png .webp .gif→send_image_file()- Everything else →
send_document()
- This routing already supports arbitrary files!
send_* method inventory (base class):
send(chat_id, content, reply_to, metadata)— ABSTRACT, textsend_image(chat_id, image_url, caption, reply_to)— URL-based imagessend_animation(chat_id, animation_url, caption, reply_to)— GIF animationssend_voice(chat_id, audio_path, caption, reply_to)— voice messagessend_video(chat_id, video_path, caption, reply_to)— video filessend_document(chat_id, file_path, caption, file_name, reply_to)— generic filessend_image_file(chat_id, image_path, caption, reply_to)— local image filessend_typing(chat_id)— typing indicatoredit_message(chat_id, message_id, content)— edit sent messages
What's missing:
- Telegram: No override for
send_document— falls back to text! (send_image_file✅ added) - Discord: No override for
send_document— falls back to text! (send_image_file✅ added) - Slack: No override for
send_document— falls back to text! (send_image_file✅ added) - WhatsApp: Has
send_documentandsend_image_filevia bridge — COMPLETE. - The base class defaults just send "📎 File: /path" as text — useless for actual file delivery.
7. gateway/platforms/telegram.py — Send Method Analysis
Implemented send methods:
send()— MarkdownV2 text with fallback to plainsend_voice()—.ogg/.opusassend_voice(), others assend_audio()send_image()— URL-based viasend_photo()send_image_file()— local file viasend_photo(photo=open(path, 'rb'))✅send_animation()— GIF viasend_animation()send_typing()— "typing" chat actionedit_message()— edit text messages
MISSING:
send_document()NOT overridden — Need to addself._bot.send_document(chat_id, document=open(file_path, 'rb'), ...)send_video()NOT overridden — Need to addself._bot.send_video(...)
8. gateway/platforms/discord.py — Send Method Analysis
Implemented send methods:
send()— text messages with chunkingsend_voice()— discord.File attachmentsend_image()— downloads URL, creates discord.File attachmentsend_image_file()— local file via discord.File attachment ✅send_typing()— channel.typing()edit_message()— edit text messages
MISSING:
send_document()NOT overridden — Need to add discord.File attachmentsend_video()NOT overridden — Need to add discord.File attachment
9. gateway/run.py — User File Attachment Handling
Current attachment flow:
- Telegram photos (line 509-529): Download via
photo.get_file()→cache_image_from_bytes()→ vision auto-analysis - Telegram voice (line 532-541): Download →
cache_audio_from_bytes()→ STT transcription - Telegram audio (line 542-551): Same pattern
- Telegram documents (line 553-617): Extension validation against
SUPPORTED_DOCUMENT_TYPES, 20MB limit, content injection for text files - Discord attachments (line 717-751): Content-type detection, image/audio caching, URL fallback for other types
- Gateway run.py (lines 818-883): Auto-analyzes images with vision, transcribes audio, enriches document messages with context notes
Key insight: Files are always cached to host filesystem first, then processed. The agent sees local file paths.
10. tools/terminal_tool.py — Terminal Tool & Environment Interaction
How it manages environments:
- Global dict
_active_environments: Dict[str, Any]keyed by task_id - Per-task creation locks prevent duplicate sandbox creation
- Auto-cleanup thread kills idle environments after
TERMINAL_LIFETIME_SECONDS _get_env_config()reads all TERMINAL_* env vars for backend selection_create_environment()factory creates the right backend type
Could send_file piggyback?
- YES. send_file needs access to the same environment to extract files from sandboxes.
- It can reuse
_active_environments[task_id]to get the environment, then:- Docker: Use
docker cpviaenv._container_id - SSH: Use
scpviaenv.control_socket - Local: Just read the file directly
- Modal: Use base64-over-terminal via
env.execute()
- Docker: Use
- The file_tools.py module already does this with
ShellFileOperations— read_file/write_file/search/patch all share the same env instance.
11. tools/tts_tool.py — Working Example of File Delivery
Flow:
- Generate audio file to
~/.hermes/audio_cache/tts_TIMESTAMP.{ogg,mp3} - Return JSON with
media_tag: "MEDIA:/path/to/file" - For Telegram voice: prepend
[[audio_as_voice]]directive - The LLM includes the MEDIA tag in its response text
BasePlatformAdapter._process_message_background()callsextract_media()to find the tag- Routes by extension →
send_voice()for audio files - Platform adapter sends the file natively
Key pattern: Tool saves file to host → returns MEDIA: path → LLM echoes it → gateway extracts → platform delivers
12. tools/image_generation_tool.py — Working Example of Image Delivery
Flow:
- Call FAL.ai API → get image URL
- Return JSON with
image: "https://fal.media/..."URL - The LLM includes the URL in markdown:
 BasePlatformAdapter.extract_images()findspatterns- Routes through
send_image()(URL) orsend_animation()(GIF) - Platform downloads and sends natively
Key difference from TTS: Images are URL-based, not local files. The gateway downloads at send time.
INTEGRATION MAP: Where send_file Hooks In
Architecture Decision: MEDIA: Tag Protocol vs. New Tool
The MEDIA: tag protocol is already the established pattern for file delivery. Two options:
Option A: Pure MEDIA: Tag (Minimal Change)
- No new tool needed
- Agent downloads file from sandbox to host using terminal (base64)
- Saves to known location (e.g.,
~/.hermes/file_cache/) - Includes
MEDIA:/pathin response text - Existing routing in
_process_message_background()handles delivery - Problem: Agent has to manually do base64 dance + know about MEDIA: convention
Option B: Dedicated send_file Tool (Recommended)
- New tool that the agent calls with
(file_path, caption?) - Tool handles the sandbox → host extraction automatically
- Returns MEDIA: tag that gets routed through existing pipeline
- Much cleaner agent experience
Implementation Plan for Option B
Files to CREATE:
tools/send_file_tool.py— The new tool- Accepts:
file_path(path in sandbox),caption(optional) - Detects environment backend from
_active_environments - Extracts file from sandbox:
- local:
shutil.copy()or direct path - docker:
docker cp {container_id}:{path} {local_cache}/ - ssh:
scp -o ControlPath=... {user}@{host}:{path} {local_cache}/ - modal: base64-over-terminal via
env.execute("base64 {path}")
- local:
- Saves to
~/.hermes/file_cache/{uuid}_{filename} - Returns:
MEDIA:/cached/pathin response for gateway to pick up - Register with
registry.register(name="send_file", toolset="file", ...)
- Accepts:
Files to MODIFY:
-
gateway/platforms/telegram.py— Add missing send methods:async def send_document(self, chat_id, file_path, caption=None, file_name=None, reply_to=None): with open(file_path, "rb") as f: msg = await self._bot.send_document( chat_id=int(chat_id), document=f, caption=caption, filename=file_name or os.path.basename(file_path)) return SendResult(success=True, message_id=str(msg.message_id)) async def send_image_file(self, chat_id, image_path, caption=None, reply_to=None): with open(image_path, "rb") as f: msg = await self._bot.send_photo(chat_id=int(chat_id), photo=f, caption=caption) return SendResult(success=True, message_id=str(msg.message_id)) async def send_video(self, chat_id, video_path, caption=None, reply_to=None): with open(video_path, "rb") as f: msg = await self._bot.send_video(chat_id=int(chat_id), video=f, caption=caption) return SendResult(success=True, message_id=str(msg.message_id)) -
gateway/platforms/discord.py— Add missing send methods:async def send_document(self, chat_id, file_path, caption=None, file_name=None, reply_to=None): channel = self._client.get_channel(int(chat_id)) or await self._client.fetch_channel(int(chat_id)) with open(file_path, "rb") as f: file = discord.File(io.BytesIO(f.read()), filename=file_name or os.path.basename(file_path)) msg = await channel.send(content=caption, file=file) return SendResult(success=True, message_id=str(msg.id)) async def send_image_file(self, chat_id, image_path, caption=None, reply_to=None): # Same pattern as send_document with image filename async def send_video(self, chat_id, video_path, caption=None, reply_to=None): # Same pattern, discord renders video attachments inline -
toolsets.py— Add"send_file"to_HERMES_CORE_TOOLSlist -
agent/prompt_builder.py— Update platform hints to mention send_file tool
Code that can be REUSED (zero rewrite):
BasePlatformAdapter.extract_media()— Already extracts MEDIA: tagsBasePlatformAdapter._process_message_background()— Already routes by extensionToolContext.download_file()— Base64-over-terminal extraction patterntools/terminal_tool.py_active_environments dict — Environment accesstools/registry.py— Tool registration infrastructuregateway/platforms/base.pysend_document/send_image_file/send_video signatures — Already defined
Code that needs to be WRITTEN from scratch:
-
tools/send_file_tool.py(~150 lines):- File extraction from each environment backend type
- Local file cache management
- Registry registration
-
Telegram
send_document+send_image_file+send_videooverrides (~40 lines) -
Discord
send_document+send_image_file+send_videooverrides (~50 lines)
Total effort: ~240 lines of new code, ~5 lines of config changes
Key Environment-Specific Extract Strategies
| Backend | Extract Method | Speed | Complexity |
|---|---|---|---|
| local | shutil.copy / direct path | Instant | None |
| docker | docker cp container:path . |
Fast | Low |
| docker+vol | Direct host path access | Instant | None |
| ssh | scp -o ControlPath=... |
Fast | Low |
| modal | base64-over-terminal | Moderate | Medium |
| singularity | Direct path (overlay mount) | Fast | Low |
Data Flow Summary
Agent calls send_file(file_path="/workspace/output.pdf", caption="Here's the report")
│
▼
send_file_tool.py:
1. Get environment from _active_environments[task_id]
2. Detect backend type (docker/ssh/modal/local)
3. Extract file to ~/.hermes/file_cache/{uuid}_{filename}
4. Return: '{"success": true, "media_tag": "MEDIA:/home/user/.hermes/file_cache/abc123_output.pdf"}'
│
▼
LLM includes MEDIA: tag in its response text
│
▼
BasePlatformAdapter._process_message_background():
1. extract_media(response) → finds MEDIA:/path
2. Checks extension: .pdf → send_document()
3. Calls platform-specific send_document(chat_id, file_path, caption)
│
▼
TelegramAdapter.send_document() / DiscordAdapter.send_document():
Opens file, sends via platform API as native document attachment
User receives downloadable file in chat