- Updated CLI to load configuration from user-specific and project-specific YAML files, prioritizing user settings. - Introduced a new command `/platforms` to display the status of connected messaging platforms (Telegram, Discord, WhatsApp). - Implemented a gateway system for handling messaging interactions, including session management and delivery routing for cron job outputs. - Added support for environment variable configuration and a dedicated gateway configuration file for advanced settings. - Enhanced documentation in README.md and added a new messaging.md file to guide users on platform integrations and setup. - Updated toolsets to include platform-specific capabilities for Telegram, Discord, and WhatsApp, ensuring secure and tailored interactions.
737 lines
30 KiB
Markdown
737 lines
30 KiB
Markdown
# Hermes Agent - Future Improvements
|
|
|
|
> Ideas for enhancing the agent's capabilities, generated from self-analysis of the codebase.
|
|
|
|
---
|
|
|
|
## 🚨 HIGH PRIORITY - Immediate Fixes
|
|
|
|
These items need to be addressed ASAP:
|
|
|
|
### 1. SUDO Breaking Terminal Tool 🔐 ✅ COMPLETE
|
|
- [x] **Problem:** SUDO commands break the terminal tool execution (hangs indefinitely)
|
|
- [x] **Fix:** Created custom environment wrappers in `tools/terminal_tool.py`
|
|
- `stdin=subprocess.DEVNULL` prevents hanging on interactive prompts
|
|
- Sudo fails gracefully with clear error if no password configured
|
|
- Same UX as Claude Code - agent sees error, tells user to run it themselves
|
|
- [x] **All 5 environments now have consistent behavior:**
|
|
- `_LocalEnvironment` - local execution
|
|
- `_DockerEnvironment` - Docker containers
|
|
- `_SingularityEnvironment` - Singularity/Apptainer containers
|
|
- `_ModalEnvironment` - Modal cloud sandboxes
|
|
- `_SSHEnvironment` - remote SSH execution
|
|
- [x] **Optional sudo support via `SUDO_PASSWORD` env var:**
|
|
- Shared `_transform_sudo_command()` helper used by all environments
|
|
- If set, auto-transforms `sudo cmd` → pipes password via `sudo -S`
|
|
- Documented in `.env.example`, `cli-config.yaml`, and README
|
|
- Works for chained commands: `cmd1 && sudo cmd2`
|
|
- [x] **Interactive sudo prompt in CLI mode:**
|
|
- When sudo detected and no password configured, prompts user
|
|
- 45-second timeout (auto-skips if no input)
|
|
- Hidden password input via `getpass` (password not visible)
|
|
- Password cached for session (don't ask repeatedly)
|
|
- Spinner pauses during prompt for clean UX
|
|
- Uses `HERMES_INTERACTIVE` env var to detect CLI mode
|
|
|
|
### 2. Fix `browser_get_images` Tool 🖼️ ✅ VERIFIED WORKING
|
|
- [x] **Tested:** Tool works correctly on multiple sites
|
|
- [x] **Results:** Successfully extracts image URLs, alt text, dimensions
|
|
- [x] **Note:** Some sites (Pixabay, etc.) have Cloudflare bot protection that blocks headless browsers - this is expected behavior, not a bug
|
|
|
|
### 3. Better Action Logging for Debugging 📝 ✅ COMPLETE
|
|
- [x] **Problem:** Need better logging of agent actions for debugging
|
|
- [x] **Implementation:**
|
|
- Save full session trajectories to `logs/` directory as JSON
|
|
- Each session gets a unique file: `session_YYYYMMDD_HHMMSS_UUID.json`
|
|
- Logs all messages, tool calls with inputs/outputs, timestamps
|
|
- Structured JSON format for easy parsing and replay
|
|
- Automatic on CLI runs (configurable)
|
|
|
|
### 4. Automatic Context Compression 🗜️ ✅ COMPLETE
|
|
- [x] **Problem:** Long conversations exceed model context limits, causing errors
|
|
- [x] **Solution:** Auto-compress middle turns when approaching limit
|
|
- [x] **Implementation:**
|
|
- Fetches model context lengths from OpenRouter `/api/v1/models` API (cached 1hr)
|
|
- Tracks actual token usage from API responses (`usage.prompt_tokens`)
|
|
- Triggers at 85% of model's context limit (configurable)
|
|
- Protects first 3 turns (system, initial request, first response)
|
|
- Protects last 4 turns (recent context most relevant)
|
|
- Summarizes middle turns using fast model (Gemini Flash)
|
|
- Inserts summary as user message, conversation continues seamlessly
|
|
- If context error occurs, attempts compression before failing
|
|
- [x] **Configuration (cli-config.yaml / env vars):**
|
|
- `CONTEXT_COMPRESSION_ENABLED` (default: true)
|
|
- `CONTEXT_COMPRESSION_THRESHOLD` (default: 0.85 = 85%)
|
|
- `CONTEXT_COMPRESSION_MODEL` (default: google/gemini-2.0-flash-001)
|
|
|
|
### 5. Stream Thinking Summaries in Real-Time 💭 ⏸️ DEFERRED
|
|
- [ ] **Problem:** Thinking/reasoning summaries not shown while streaming
|
|
- [ ] **Complexity:** This is a significant refactor - leaving for later
|
|
|
|
**OpenRouter Streaming Info:**
|
|
- Uses `stream=True` with OpenAI SDK
|
|
- Reasoning comes in `choices[].delta.reasoning_details` chunks
|
|
- Types: `reasoning.summary`, `reasoning.text`, `reasoning.encrypted`
|
|
- Tool call arguments stream as partial JSON (need accumulation)
|
|
- Items paradigm: same ID emitted multiple times with updated content
|
|
|
|
**Key Challenges:**
|
|
- Tool call JSON accumulation (partial `{"query": "wea` → `{"query": "weather"}`)
|
|
- Multiple concurrent outputs (thinking + tool calls + text simultaneously)
|
|
- State management for partial responses
|
|
- Error handling if connection drops mid-stream
|
|
- Deciding when tool calls are "complete" enough to execute
|
|
|
|
**UX Questions to Resolve:**
|
|
- Show raw thinking text or summarized?
|
|
- Live expanding text vs. spinner replacement?
|
|
- Markdown rendering while streaming?
|
|
- How to handle thinking + tool call display simultaneously?
|
|
|
|
**Implementation Options:**
|
|
- New `run_conversation_streaming()` method (keep non-streaming as fallback)
|
|
- Wrapper that handles streaming internally
|
|
- Big refactor of existing `run_conversation()`
|
|
|
|
**References:**
|
|
- https://openrouter.ai/docs/api/reference/streaming
|
|
- https://openrouter.ai/docs/guides/best-practices/reasoning-tokens#streaming-response
|
|
|
|
---
|
|
|
|
## 1. Subagent Architecture (Context Isolation) 🎯
|
|
|
|
**Problem:** Long-running tools (terminal commands, browser automation, complex file operations) consume massive context. A single `ls -la` can add hundreds of lines. Browser snapshots, debugging sessions, and iterative terminal work quickly bloat the main conversation, leaving less room for actual reasoning.
|
|
|
|
**Solution:** The main agent becomes an **orchestrator** that delegates context-heavy tasks to **subagents**.
|
|
|
|
**Architecture:**
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ ORCHESTRATOR (main agent) │
|
|
│ - Receives user request │
|
|
│ - Plans approach │
|
|
│ - Delegates heavy tasks to subagents │
|
|
│ - Receives summarized results │
|
|
│ - Maintains clean, focused context │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
│ TERMINAL AGENT │ │ BROWSER AGENT │ │ CODE AGENT │
|
|
│ - terminal tool │ │ - browser tools │ │ - file tools │
|
|
│ - file tools │ │ - web_search │ │ - terminal │
|
|
│ │ │ - web_extract │ │ │
|
|
│ Isolated context│ │ Isolated context│ │ Isolated context│
|
|
│ Returns summary │ │ Returns summary │ │ Returns summary │
|
|
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
|
```
|
|
|
|
**How it works:**
|
|
1. User asks: "Set up a new Python project with FastAPI and tests"
|
|
2. Orchestrator plans: "I need to create files, install deps, write code"
|
|
3. Orchestrator calls: `terminal_task(goal="Create venv, install fastapi pytest", context="New project in ~/myapp")`
|
|
4. **Subagent spawns** with fresh context, only terminal/file tools
|
|
5. Subagent iterates (may take 10+ tool calls, lots of output)
|
|
6. Subagent completes → returns summary: "Created venv, installed fastapi==0.109.0, pytest==8.0.0"
|
|
7. Orchestrator receives **only the summary**, context stays clean
|
|
8. Orchestrator continues with next subtask
|
|
|
|
**Key tools to implement:**
|
|
- [ ] `terminal_task(goal, context, cwd?)` - Delegate terminal/shell work
|
|
- [ ] `browser_task(goal, context, start_url?)` - Delegate web research/automation
|
|
- [ ] `code_task(goal, context, files?)` - Delegate code writing/modification
|
|
- [ ] Generic `delegate_task(goal, context, toolsets=[])` - Flexible delegation
|
|
|
|
**Implementation details:**
|
|
- [ ] Subagent uses same `run_agent.py` but with:
|
|
- Fresh/empty conversation history
|
|
- Limited toolset (only what's needed)
|
|
- Smaller max_iterations (focused task)
|
|
- Task-specific system prompt
|
|
- [ ] Subagent returns structured result:
|
|
```python
|
|
{
|
|
"success": True,
|
|
"summary": "Installed 3 packages, created 2 files",
|
|
"details": "Optional longer explanation if needed",
|
|
"artifacts": ["~/myapp/requirements.txt", "~/myapp/main.py"], # Files created
|
|
"errors": [] # Any issues encountered
|
|
}
|
|
```
|
|
- [ ] Orchestrator sees only the summary in its context
|
|
- [ ] Full subagent transcript saved separately for debugging
|
|
|
|
**Benefits:**
|
|
- 🧹 **Clean context** - Orchestrator stays focused, doesn't drown in tool output
|
|
- 📊 **Better token efficiency** - 50 terminal outputs → 1 summary paragraph
|
|
- 🎯 **Focused subagents** - Each agent has just the tools it needs
|
|
- 🔄 **Parallel potential** - Independent subtasks could run concurrently
|
|
- 🐛 **Easier debugging** - Each subtask has its own isolated transcript
|
|
|
|
**When to use subagents vs direct tools:**
|
|
- **Subagent**: Multi-step tasks, iteration likely, lots of output expected
|
|
- **Direct**: Quick one-off commands, simple file reads, user needs to see output
|
|
|
|
**Files to modify:** `run_agent.py` (add orchestration mode), new `tools/delegate_tools.py`, new `subagent_runner.py`
|
|
|
|
---
|
|
|
|
## 2. Planning & Task Management 📋
|
|
|
|
**Problem:** Agent handles tasks reactively without explicit planning. Complex multi-step tasks lack structure, progress tracking, and the ability to decompose work into manageable chunks.
|
|
|
|
**Ideas:**
|
|
- [ ] **Task decomposition tool** - Break complex requests into subtasks:
|
|
```
|
|
User: "Set up a new Python project with FastAPI, tests, and Docker"
|
|
|
|
Agent creates plan:
|
|
├── 1. Create project structure and requirements.txt
|
|
├── 2. Implement FastAPI app skeleton
|
|
├── 3. Add pytest configuration and initial tests
|
|
├── 4. Create Dockerfile and docker-compose.yml
|
|
└── 5. Verify everything works together
|
|
```
|
|
- Each subtask becomes a trackable unit
|
|
- Agent can report progress: "Completed 3/5 tasks"
|
|
|
|
- [ ] **Progress checkpoints** - Periodic self-assessment:
|
|
- After N tool calls or time elapsed, pause to evaluate
|
|
- "What have I accomplished? What remains? Am I on track?"
|
|
- Detect if stuck in loops or making no progress
|
|
- Could trigger replanning if approach isn't working
|
|
|
|
- [ ] **Explicit plan storage** - Persist plan in conversation:
|
|
- Store as structured data (not just in context)
|
|
- Update status as tasks complete
|
|
- User can ask "What's the plan?" or "What's left?"
|
|
- Survives context compression (plans are protected)
|
|
|
|
- [ ] **Failure recovery with replanning** - When things go wrong:
|
|
- Record what failed and why
|
|
- Revise plan to work around the issue
|
|
- "Step 3 failed because X, adjusting approach to Y"
|
|
- Prevents repeating failed strategies
|
|
|
|
**Files to modify:** `run_agent.py` (add planning hooks), new `tools/planning_tool.py`
|
|
|
|
---
|
|
|
|
## 3. Tool Composition & Learning 🔧
|
|
|
|
**Problem:** Tools are atomic. Complex tasks require repeated manual orchestration of the same tool sequences.
|
|
|
|
**Ideas:**
|
|
- [ ] **Macro tools / Tool chains** - Define reusable tool sequences:
|
|
```yaml
|
|
research_topic:
|
|
description: "Deep research on a topic"
|
|
steps:
|
|
- web_search: {query: "$topic"}
|
|
- web_extract: {urls: "$search_results.urls[:3]"}
|
|
- summarize: {content: "$extracted"}
|
|
```
|
|
- Could be defined in skills or a new `macros/` directory
|
|
- Agent can invoke macro as single tool call
|
|
|
|
- [ ] **Tool failure patterns** - Learn from failures:
|
|
- Track: tool, input pattern, error type, what worked instead
|
|
- Before calling a tool, check: "Has this pattern failed before?"
|
|
- Persistent across sessions (stored in skills or separate DB)
|
|
|
|
- [ ] **Parallel tool execution** - When tools are independent, run concurrently:
|
|
- Detect independence (no data dependencies between calls)
|
|
- Use `asyncio.gather()` for parallel execution
|
|
- Already have async support in some tools, just need orchestration
|
|
|
|
**Files to modify:** `model_tools.py`, `toolsets.py`, new `tool_macros.py`
|
|
|
|
---
|
|
|
|
## 4. Dynamic Skills Expansion 📚
|
|
|
|
**Problem:** Skills system is elegant but static. Skills must be manually created and added.
|
|
|
|
**Ideas:**
|
|
- [ ] **Skill acquisition from successful tasks** - After completing a complex task:
|
|
- "This approach worked well. Save as a skill?"
|
|
- Extract: goal, steps taken, tools used, key decisions
|
|
- Generate SKILL.md automatically
|
|
- Store in user's skills directory
|
|
|
|
- [ ] **Skill templates** - Common patterns that can be parameterized:
|
|
```markdown
|
|
# Debug {language} Error
|
|
1. Reproduce the error
|
|
2. Search for error message: `web_search("{error_message} {language}")`
|
|
3. Check common causes: {common_causes}
|
|
4. Apply fix and verify
|
|
```
|
|
|
|
- [ ] **Skill chaining** - Combine skills for complex workflows:
|
|
- Skills can reference other skills as dependencies
|
|
- "To do X, first apply skill Y, then skill Z"
|
|
- Directed graph of skill dependencies
|
|
|
|
**Files to modify:** `tools/skills_tool.py`, `skills/` directory structure, new `skill_generator.py`
|
|
|
|
---
|
|
|
|
## 5. Interactive Clarifying Questions Tool ❓
|
|
|
|
**Problem:** Agent sometimes makes assumptions or guesses when it should ask the user. Currently can only ask via text, which gets lost in long outputs.
|
|
|
|
**Ideas:**
|
|
- [ ] **Multiple-choice prompt tool** - Let agent present structured choices to user:
|
|
```
|
|
ask_user_choice(
|
|
question="Should the language switcher enable only German or all languages?",
|
|
choices=[
|
|
"Only enable German - works immediately",
|
|
"Enable all, mark untranslated - show fallback notice",
|
|
"Let me specify something else"
|
|
]
|
|
)
|
|
```
|
|
- Renders as interactive terminal UI with arrow key / Tab navigation
|
|
- User selects option, result returned to agent
|
|
- Up to 4 choices + optional free-text option
|
|
|
|
- [ ] **Implementation:**
|
|
- Use `inquirer` or `questionary` Python library for rich terminal prompts
|
|
- Tool returns selected option text (or user's custom input)
|
|
- **CLI-only** - only works when running via `cli.py` (not API/programmatic use)
|
|
- Graceful fallback: if not in interactive mode, return error asking agent to rephrase as text
|
|
|
|
- [ ] **Use cases:**
|
|
- Clarify ambiguous requirements before starting work
|
|
- Confirm destructive operations with clear options
|
|
- Let user choose between implementation approaches
|
|
- Checkpoint complex multi-step workflows
|
|
|
|
**Files to modify:** New `tools/ask_user_tool.py`, `cli.py` (detect interactive mode), `model_tools.py`
|
|
|
|
---
|
|
|
|
## 6. Collaborative Problem Solving 🤝
|
|
|
|
**Problem:** Interaction is command/response. Complex problems benefit from dialogue.
|
|
|
|
**Ideas:**
|
|
- [ ] **Assumption surfacing** - Make implicit assumptions explicit:
|
|
- "I'm assuming you want Python 3.11+. Correct?"
|
|
- "This solution assumes you have sudo access..."
|
|
- Let user correct before going down wrong path
|
|
|
|
- [ ] **Checkpoint & confirm** - For high-stakes operations:
|
|
- "About to delete 47 files. Here's the list - proceed?"
|
|
- "This will modify your database. Want a backup first?"
|
|
- Configurable threshold for when to ask
|
|
|
|
**Files to modify:** `run_agent.py`, system prompt configuration
|
|
|
|
---
|
|
|
|
## 7. Project-Local Context 💾
|
|
|
|
**Problem:** Valuable context lost between sessions.
|
|
|
|
**Ideas:**
|
|
- [ ] **Project awareness** - Remember project-specific context:
|
|
- Store `.hermes/context.md` in project directory
|
|
- "This is a Django project using PostgreSQL"
|
|
- Coding style preferences, deployment setup, etc.
|
|
- Load automatically when working in that directory
|
|
|
|
- [ ] **Handoff notes** - Leave notes for future sessions:
|
|
- Write to `.hermes/notes.md` in project
|
|
- "TODO for next session: finish implementing X"
|
|
- "Known issues: Y doesn't work on Windows"
|
|
|
|
**Files to modify:** New `project_context.py`, auto-load in `run_agent.py`
|
|
|
|
---
|
|
|
|
## 8. Graceful Degradation & Robustness 🛡️
|
|
|
|
**Problem:** When things go wrong, recovery is limited. Should fail gracefully.
|
|
|
|
**Ideas:**
|
|
- [ ] **Fallback chains** - When primary approach fails, have backups:
|
|
- `web_extract` fails → try `browser_navigate` → try `web_search` for cached version
|
|
- Define fallback order per tool type
|
|
|
|
- [ ] **Partial progress preservation** - Don't lose work on failure:
|
|
- Long task fails midway → save what we've got
|
|
- "I completed 3/5 steps before the error. Here's what I have..."
|
|
|
|
- [ ] **Self-healing** - Detect and recover from bad states:
|
|
- Browser stuck → close and retry
|
|
- Terminal hung → timeout and reset
|
|
|
|
**Files to modify:** `model_tools.py`, tool implementations, new `fallback_manager.py`
|
|
|
|
---
|
|
|
|
## 9. Tools & Skills Wishlist 🧰
|
|
|
|
*Things that would need new tool implementations (can't do well with current tools):*
|
|
|
|
### High-Impact
|
|
|
|
- [ ] **Audio/Video Transcription** 🎬 *(See also: Section 16 for detailed spec)*
|
|
- Transcribe audio files, podcasts, YouTube videos
|
|
- Extract key moments from video
|
|
- Voice memo transcription for messaging integrations
|
|
- *Provider options: Whisper API, Deepgram, local Whisper*
|
|
|
|
- [ ] **Diagram Rendering** 📊
|
|
- Render Mermaid/PlantUML to actual images
|
|
- Can generate the code, but rendering requires external service or tool
|
|
- "Show me how these components connect" → actual visual diagram
|
|
|
|
### Medium-Impact
|
|
|
|
- [ ] **Canvas / Visual Workspace** 🖼️
|
|
- Agent-controlled visual panel for rendering interactive UI
|
|
- Inspired by OpenClaw's Canvas feature
|
|
- **Capabilities:**
|
|
- `present` / `hide` - Show/hide the canvas panel
|
|
- `navigate` - Load HTML files or URLs into the canvas
|
|
- `eval` - Execute JavaScript in the canvas context
|
|
- `snapshot` - Capture the rendered UI as an image
|
|
- **Use cases:**
|
|
- Display generated HTML/CSS/JS previews
|
|
- Show interactive data visualizations (charts, graphs)
|
|
- Render diagrams (Mermaid → rendered output)
|
|
- Present structured information in rich format
|
|
- A2UI-style component system for structured agent UI
|
|
- **Implementation options:**
|
|
- Electron-based panel for CLI
|
|
- WebSocket-connected web app
|
|
- VS Code webview extension
|
|
- *Would let agent "show" things rather than just describe them*
|
|
|
|
- [ ] **Document Generation** 📄
|
|
- Create styled PDFs, Word docs, presentations
|
|
- *Can do basic PDF via terminal tools, but limited*
|
|
|
|
- [ ] **Diff/Patch Tool** 📝
|
|
- Surgical code modifications with preview
|
|
- "Change line 45-50 to X" without rewriting whole file
|
|
- Show diffs before applying
|
|
- *Can use `diff`/`patch` but a native tool would be safer*
|
|
|
|
### Skills to Create
|
|
|
|
- [ ] **Domain-specific skill packs:**
|
|
- DevOps/Infrastructure (Terraform, K8s, AWS)
|
|
- Data Science workflows (EDA, model training)
|
|
- Security/pentesting procedures
|
|
|
|
- [ ] **Framework-specific skills:**
|
|
- React/Vue/Angular patterns
|
|
- Django/Rails/Express conventions
|
|
- Database optimization playbooks
|
|
|
|
- [ ] **Troubleshooting flowcharts:**
|
|
- "Docker container won't start" → decision tree
|
|
- "Production is slow" → systematic diagnosis
|
|
|
|
---
|
|
|
|
## 10. Messaging Platform Integrations 💬 ✅ COMPLETE
|
|
|
|
**Problem:** Agent currently only works via `cli.py` which requires direct terminal access. Users may want to interact via messaging apps from their phone or other devices.
|
|
|
|
**Architecture:**
|
|
- `run_agent.py` already accepts `conversation_history` parameter and returns updated messages ✅
|
|
- Need: persistent session storage, platform monitors, session key resolution
|
|
|
|
**Implementation approach:**
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Platform Monitor (e.g., telegram_monitor.py) │
|
|
│ ├─ Long-running daemon connecting to messaging platform │
|
|
│ ├─ On message: resolve session key → load history from disk│
|
|
│ ├─ Call run_agent.py with loaded history │
|
|
│ ├─ Save updated history back to disk (JSONL) │
|
|
│ └─ Send response back to platform │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**Platform support (each user sets up their own credentials):**
|
|
- [x] **Telegram** - via `python-telegram-bot`
|
|
- Bot token from @BotFather
|
|
- Easiest to set up, good for personal use
|
|
- [x] **Discord** - via `discord.py`
|
|
- Bot token from Discord Developer Portal
|
|
- Can work in servers (group sessions) or DMs
|
|
- [x] **WhatsApp** - via Node.js bridge (whatsapp-web.js/baileys)
|
|
- Requires Node.js bridge setup
|
|
- More complex, but reaches most people
|
|
|
|
**Session management:**
|
|
- [x] **Session store** - JSONL persistence per session key
|
|
- `~/.hermes/sessions/{session_id}.jsonl`
|
|
- Session keys: `agent:main:telegram:dm`, `agent:main:discord:group:123`, etc.
|
|
- [x] **Session expiry** - Configurable reset policies
|
|
- Daily reset (default 4am) OR idle timeout (default 2 hours)
|
|
- Manual reset via `/reset` or `/new` command in chat
|
|
- Per-platform and per-type overrides
|
|
- [x] **Session continuity** - Conversations persist across messages until reset
|
|
|
|
**Files created:** `gateway/`, `gateway/platforms/`, `gateway/config.py`, `gateway/session.py`, `gateway/delivery.py`, `gateway/run.py`
|
|
|
|
**Configuration:**
|
|
- Environment variables: `TELEGRAM_BOT_TOKEN`, `DISCORD_BOT_TOKEN`, etc.
|
|
- Config file: `~/.hermes/gateway.json`
|
|
- CLI commands: `/platforms` to check status, `--gateway` to start
|
|
|
|
**Dynamic context injection:**
|
|
- Agent knows its source platform and chat
|
|
- Agent knows connected platforms and home channels
|
|
- Agent can deliver cron outputs to specific platforms
|
|
|
|
---
|
|
|
|
## 11. Scheduled Tasks / Cron Jobs ⏰ ✅ COMPLETE
|
|
|
|
**Problem:** Agent only runs on-demand. Some tasks benefit from scheduled execution (daily summaries, monitoring, reminders).
|
|
|
|
**Solution Implemented:**
|
|
|
|
- [x] **Cron-style scheduler** - Run agent turns on a schedule
|
|
- Jobs stored in `~/.hermes/cron/jobs.json`
|
|
- Each job: `{ id, name, prompt, schedule, repeat, enabled, next_run_at, ... }`
|
|
- Built-in scheduler daemon or system cron integration
|
|
|
|
- [x] **Schedule formats:**
|
|
- Duration: `30m`, `2h`, `1d` (one-shot delay)
|
|
- Interval: `every 30m`, `every 2h` (recurring)
|
|
- Cron expression: `0 9 * * *` (requires `croniter` package)
|
|
- ISO timestamp: `2026-02-03T14:00:00` (one-shot at specific time)
|
|
|
|
- [x] **Repeat options:**
|
|
- `repeat=None` (or omit): One-shot schedules run once; intervals/cron run forever
|
|
- `repeat=1`: Run once then auto-delete
|
|
- `repeat=N`: Run exactly N times then auto-delete
|
|
|
|
- [x] **CLI interface:**
|
|
```bash
|
|
# List scheduled jobs
|
|
/cron
|
|
/cron list
|
|
|
|
# Add a one-shot job (runs once in 30 minutes)
|
|
/cron add 30m "Remind me to check the build status"
|
|
|
|
# Add a recurring job (every 2 hours)
|
|
/cron add "every 2h" "Check server status at 192.168.1.100"
|
|
|
|
# Add a cron expression (daily at 9am)
|
|
/cron add "0 9 * * *" "Generate morning briefing"
|
|
|
|
# Remove a job
|
|
/cron remove <job_id>
|
|
```
|
|
|
|
- [x] **Agent self-scheduling tools** (hermes-cli toolset):
|
|
- `schedule_cronjob(prompt, schedule, name?, repeat?)` - Create a scheduled task
|
|
- `list_cronjobs()` - View all scheduled jobs
|
|
- `remove_cronjob(job_id)` - Cancel a job
|
|
- Tool descriptions emphasize: **cronjobs run in isolated sessions with NO context**
|
|
|
|
- [x] **Daemon modes:**
|
|
```bash
|
|
# Built-in daemon (checks every 60 seconds)
|
|
python cli.py --cron-daemon
|
|
|
|
# Single tick for system cron integration
|
|
python cli.py --cron-tick-once
|
|
```
|
|
|
|
- [x] **Output storage:** `~/.hermes/cron/output/{job_id}/{timestamp}.md`
|
|
|
|
**Files created:** `cron/__init__.py`, `cron/jobs.py`, `cron/scheduler.py`, `tools/cronjob_tools.py`
|
|
|
|
**Toolset:** `hermes-cli` (default for CLI) includes cronjob tools; not in batch runner toolsets
|
|
|
|
---
|
|
|
|
## 12. Text-to-Speech (TTS) 🔊
|
|
|
|
**Problem:** Agent can only respond with text. Some users prefer audio responses (accessibility, hands-free use, podcasts).
|
|
|
|
**Ideas:**
|
|
- [ ] **TTS tool** - Generate audio files from text
|
|
```python
|
|
tts_generate(text="Here's your summary...", voice="nova", output="summary.mp3")
|
|
```
|
|
- Returns path to generated audio file
|
|
- For messaging integrations: can send as voice message
|
|
|
|
- [ ] **Provider options:**
|
|
- Edge TTS (free, good quality, many voices)
|
|
- OpenAI TTS (paid, excellent quality)
|
|
- ElevenLabs (paid, best quality, voice cloning)
|
|
- Local options (Coqui TTS, Bark)
|
|
|
|
- [ ] **Modes:**
|
|
- On-demand: User explicitly asks "read this to me"
|
|
- Auto-TTS: Configurable to always generate audio for responses
|
|
- Long-text handling: Summarize or chunk very long responses
|
|
|
|
- [ ] **Integration with messaging:**
|
|
- When enabled, can send voice notes instead of/alongside text
|
|
- User preference per channel
|
|
|
|
**Files to create:** `tools/tts_tool.py`, config in `cli-config.yaml`
|
|
|
|
---
|
|
|
|
## 13. Speech-to-Text / Audio Transcription 🎤
|
|
|
|
**Problem:** Users may want to send voice memos instead of typing. Agent is blind to audio content.
|
|
|
|
**Ideas:**
|
|
- [ ] **Voice memo transcription** - For messaging integrations
|
|
- User sends voice message → transcribe → process as text
|
|
- Seamless: user speaks, agent responds
|
|
|
|
- [ ] **Audio/video file transcription** - Existing idea, expanded:
|
|
- Transcribe local audio files (mp3, wav, m4a)
|
|
- Transcribe YouTube videos (download audio → transcribe)
|
|
- Extract key moments with timestamps
|
|
|
|
- [ ] **Provider options:**
|
|
- OpenAI Whisper API (good quality, cheap)
|
|
- Deepgram (fast, good for real-time)
|
|
- Local Whisper (free, runs on GPU)
|
|
- Groq Whisper (fast, free tier available)
|
|
|
|
- [ ] **Tool interface:**
|
|
```python
|
|
transcribe(source="audio.mp3") # Local file
|
|
transcribe(source="https://youtube.com/...") # YouTube
|
|
transcribe(source="voice_message", data=bytes) # Voice memo
|
|
```
|
|
|
|
**Files to create:** `tools/transcribe_tool.py`, integrate with messaging monitors
|
|
|
|
---
|
|
|
|
## Priority Order (Suggested)
|
|
|
|
1. **🎯 Subagent Architecture** - Critical for context management, enables everything else
|
|
2. **Memory & Context Management** - Complements subagents for remaining context
|
|
3. **Self-Reflection** - Improves reliability and reduces wasted tool calls
|
|
4. **Project-Local Context** - Practical win, keeps useful info across sessions
|
|
5. **Messaging Integrations** - Unlocks mobile access, new interaction patterns
|
|
6. **Scheduled Tasks / Cron Jobs** - Enables automation, reminders, monitoring
|
|
7. **Tool Composition** - Quality of life, builds on other improvements
|
|
8. **Dynamic Skills** - Force multiplier for repeated tasks
|
|
9. **Interactive Clarifying Questions** - Better UX for ambiguous tasks
|
|
10. **TTS / Audio Transcription** - Accessibility, hands-free use
|
|
|
|
---
|
|
|
|
## Removed Items (Unrealistic)
|
|
|
|
The following were removed because they're architecturally impossible:
|
|
|
|
- ~~Proactive suggestions / Prefetching~~ - Agent only runs on user request, can't interject
|
|
- ~~Clipboard integration~~ - No access to user's local system clipboard
|
|
|
|
The following **moved to active TODO** (now possible with new architecture):
|
|
|
|
- ~~Session save/restore~~ → See **Messaging Integrations** (session persistence)
|
|
- ~~Voice/TTS playback~~ → See **TTS** (can generate audio files, send via messaging)
|
|
- ~~Set reminders~~ → See **Scheduled Tasks / Cron Jobs**
|
|
|
|
The following were removed because they're **already possible**:
|
|
|
|
- ~~HTTP/API Client~~ → Use `curl` or Python `requests` in terminal
|
|
- ~~Structured Data Manipulation~~ → Use `pandas` in terminal
|
|
- ~~Git-Native Operations~~ → Use `git` CLI in terminal
|
|
- ~~Symbolic Math~~ → Use `SymPy` in terminal
|
|
- ~~Code Quality Tools~~ → Run linters (`eslint`, `black`, `mypy`) in terminal
|
|
- ~~Testing Framework~~ → Run `pytest`, `jest`, etc. in terminal
|
|
- ~~Translation~~ → LLM handles this fine, or use translation APIs
|
|
|
|
---
|
|
|
|
---
|
|
|
|
## 🧪 Brainstorm Ideas (Not Yet Fleshed Out)
|
|
|
|
*These are early-stage ideas that need more thinking before implementation. Captured here so they don't get lost.*
|
|
|
|
### Remote/Distributed Execution 🌐
|
|
|
|
**Concept:** Run agent on a powerful remote server while interacting from a thin client.
|
|
|
|
**Why interesting:**
|
|
- Run on beefy GPU server for local LLM inference
|
|
- Agent has access to remote machine's resources (files, tools, internet)
|
|
- User interacts via lightweight client (phone, low-power laptop)
|
|
|
|
**Open questions:**
|
|
- How does this differ from just SSH + running cli.py on remote?
|
|
- Would need secure communication channel (WebSocket? gRPC?)
|
|
- How to handle tool outputs that reference remote paths?
|
|
- Credential management for remote execution
|
|
- Latency considerations for interactive use
|
|
|
|
**Possible architecture:**
|
|
```
|
|
┌─────────────┐ ┌─────────────────────────┐
|
|
│ Thin Client │ ◄─────► │ Remote Hermes Server │
|
|
│ (phone/web) │ WS/API │ - Full agent + tools │
|
|
└─────────────┘ │ - GPU for local LLM │
|
|
│ - Access to server files│
|
|
└─────────────────────────┘
|
|
```
|
|
|
|
**Related to:** Messaging integrations (could be the "server" that monitors receive from)
|
|
|
|
---
|
|
|
|
### Multi-Agent Parallel Execution 🤖🤖
|
|
|
|
**Concept:** Extension of Subagent Architecture (Section 1) - run multiple subagents in parallel.
|
|
|
|
**Why interesting:**
|
|
- Independent subtasks don't need to wait for each other
|
|
- "Research X while setting up Y" - both run simultaneously
|
|
- Faster completion for complex multi-part tasks
|
|
|
|
**Open questions:**
|
|
- How to detect which tasks are truly independent?
|
|
- Resource management (API rate limits, concurrent connections)
|
|
- How to merge results when parallel tasks have conflicts?
|
|
- Cost implications of multiple parallel LLM calls
|
|
|
|
*Note: Basic subagent delegation (Section 1) should be implemented first, parallel execution is an optimization on top.*
|
|
|
|
---
|
|
|
|
### Plugin/Extension System 🔌
|
|
|
|
**Concept:** Allow users to add custom tools/skills without modifying core code.
|
|
|
|
**Why interesting:**
|
|
- Community contributions
|
|
- Organization-specific tools
|
|
- Clean separation of core vs. extensions
|
|
|
|
**Open questions:**
|
|
- Security implications of loading arbitrary code
|
|
- Versioning and compatibility
|
|
- Discovery and installation UX
|
|
|
|
---
|
|
|
|
*Last updated: $(date +%Y-%m-%d)* 🤖
|