Revise TODO.md to introduce Subagent Architecture and Interactive Clarifying Questions Tool

- Updated the structure of the TODO list, renaming and expanding the "Context Management" section to "Subagent Architecture" with detailed problem and solution descriptions. - Added a new section for "Interactive Clarifying Questions Tool," outlining the problem of agent assumptions and proposing a multiple-choice prompt tool for user interaction. - Included implementation details and benefits for both features, enhancing clarity and direction for future development.
2026-02-01 02:02:32 -08:00
parent 9c8d707530
commit 3db83b6824
1 changed files with 362 additions and 19 deletions
--- a/TODO.md
+++ b/TODO.md
@@ -39,7 +39,85 @@ These items need to be addressed ASAP:

 ---

-## 1. Context Management
+## 1. Subagent Architecture (Context Isolation) 🎯
+
+**Problem:** Long-running tools (terminal commands, browser automation, complex file operations) consume massive context. A single `ls -la` can add hundreds of lines. Browser snapshots, debugging sessions, and iterative terminal work quickly bloat the main conversation, leaving less room for actual reasoning.
+
+**Solution:** The main agent becomes an **orchestrator** that delegates context-heavy tasks to **subagents**.
+
+**Architecture:**
+```
+┌─────────────────────────────────────────────────────────────────┐
+│  ORCHESTRATOR (main agent)                                      │
+│  - Receives user request                                        │
+│  - Plans approach                                               │
+│  - Delegates heavy tasks to subagents                           │
+│  - Receives summarized results                                  │
+│  - Maintains clean, focused context                             │
+└─────────────────────────────────────────────────────────────────┘
+         │                    │                    │
+         ▼                    ▼                    ▼
+┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
+│ TERMINAL AGENT  │  │ BROWSER AGENT   │  │ CODE AGENT      │
+│ - terminal tool │  │ - browser tools │  │ - file tools    │
+│ - file tools    │  │ - web_search    │  │ - terminal      │
+│                 │  │ - web_extract   │  │                 │
+│ Isolated context│  │ Isolated context│  │ Isolated context│
+│ Returns summary │  │ Returns summary │  │ Returns summary │
+└─────────────────┘  └─────────────────┘  └─────────────────┘
+```
+
+**How it works:**
+1. User asks: "Set up a new Python project with FastAPI and tests"
+2. Orchestrator plans: "I need to create files, install deps, write code"
+3. Orchestrator calls: `terminal_task(goal="Create venv, install fastapi pytest", context="New project in ~/myapp")`
+4. **Subagent spawns** with fresh context, only terminal/file tools
+5. Subagent iterates (may take 10+ tool calls, lots of output)
+6. Subagent completes → returns summary: "Created venv, installed fastapi==0.109.0, pytest==8.0.0"
+7. Orchestrator receives **only the summary**, context stays clean
+8. Orchestrator continues with next subtask
+
+**Key tools to implement:**
+- [ ] `terminal_task(goal, context, cwd?)` - Delegate terminal/shell work
+- [ ] `browser_task(goal, context, start_url?)` - Delegate web research/automation  
+- [ ] `code_task(goal, context, files?)` - Delegate code writing/modification
+- [ ] Generic `delegate_task(goal, context, toolsets=[])` - Flexible delegation
+
+**Implementation details:**
+- [ ] Subagent uses same `run_agent.py` but with:
+  - Fresh/empty conversation history
+  - Limited toolset (only what's needed)
+  - Smaller max_iterations (focused task)
+  - Task-specific system prompt
+- [ ] Subagent returns structured result:
+  ```python
+  {
+    "success": True,
+    "summary": "Installed 3 packages, created 2 files",
+    "details": "Optional longer explanation if needed",
+    "artifacts": ["~/myapp/requirements.txt", "~/myapp/main.py"],  # Files created
+    "errors": []  # Any issues encountered
+  }
+  ```
+- [ ] Orchestrator sees only the summary in its context
+- [ ] Full subagent transcript saved separately for debugging
+
+**Benefits:**
+- 🧹 **Clean context** - Orchestrator stays focused, doesn't drown in tool output
+- 📊 **Better token efficiency** - 50 terminal outputs → 1 summary paragraph
+- 🎯 **Focused subagents** - Each agent has just the tools it needs
+- 🔄 **Parallel potential** - Independent subtasks could run concurrently
+- 🐛 **Easier debugging** - Each subtask has its own isolated transcript
+
+**When to use subagents vs direct tools:**
+- **Subagent**: Multi-step tasks, iteration likely, lots of output expected
+- **Direct**: Quick one-off commands, simple file reads, user needs to see output
+
+**Files to modify:** `run_agent.py` (add orchestration mode), new `tools/delegate_tools.py`, new `subagent_runner.py`
+
+---
+
+## 2. Context Management (complements Subagents)

 **Problem:** Context grows unbounded during long conversations. Trajectory compression exists for training data post-hoc, but live conversations lack intelligent context management.

@@ -162,7 +240,43 @@ These items need to be addressed ASAP:

 ---

-## 6. Uncertainty & Honesty Calibration 🎚️
+## 6. Interactive Clarifying Questions Tool ❓
+
+**Problem:** Agent sometimes makes assumptions or guesses when it should ask the user. Currently can only ask via text, which gets lost in long outputs.
+
+**Ideas:**
+- [ ] **Multiple-choice prompt tool** - Let agent present structured choices to user:
+  ```
+  ask_user_choice(
+    question="Should the language switcher enable only German or all languages?",
+    choices=[
+      "Only enable German - works immediately",
+      "Enable all, mark untranslated - show fallback notice",
+      "Let me specify something else"
+    ]
+  )
+  ```
+  - Renders as interactive terminal UI with arrow key / Tab navigation
+  - User selects option, result returned to agent
+  - Up to 4 choices + optional free-text option
+  
+- [ ] **Implementation:**
+  - Use `inquirer` or `questionary` Python library for rich terminal prompts
+  - Tool returns selected option text (or user's custom input)
+  - **CLI-only** - only works when running via `cli.py` (not API/programmatic use)
+  - Graceful fallback: if not in interactive mode, return error asking agent to rephrase as text
+  
+- [ ] **Use cases:**
+  - Clarify ambiguous requirements before starting work
+  - Confirm destructive operations with clear options
+  - Let user choose between implementation approaches
+  - Checkpoint complex multi-step workflows
+
+**Files to modify:** New `tools/ask_user_tool.py`, `cli.py` (detect interactive mode), `model_tools.py`
+
+---
+
+## 7. Uncertainty & Honesty Calibration 🎚️

 **Problem:** Sometimes confidently wrong. Should be better calibrated about what I know vs. don't know.

@@ -179,7 +293,7 @@ These items need to be addressed ASAP:

 ---

-## 7. Resource Awareness & Efficiency 💰
+## 8. Resource Awareness & Efficiency 💰

 **Problem:** No awareness of costs, time, or resource usage. Could be smarter about efficiency.

@@ -197,7 +311,7 @@ These items need to be addressed ASAP:

 ---

-## 8. Collaborative Problem Solving 🤝
+## 9. Collaborative Problem Solving 🤝

 **Problem:** Interaction is command/response. Complex problems benefit from dialogue.

@@ -216,7 +330,7 @@ These items need to be addressed ASAP:

 ---

-## 9. Project-Local Context 💾
+## 10. Project-Local Context 💾

 **Problem:** Valuable context lost between sessions.

@@ -236,7 +350,7 @@ These items need to be addressed ASAP:

 ---

-## 10. Graceful Degradation & Robustness 🛡️
+## 11. Graceful Degradation & Robustness 🛡️

 **Problem:** When things go wrong, recovery is limited. Should fail gracefully.

@@ -257,17 +371,17 @@ These items need to be addressed ASAP:

 ---

-## 11. Tools & Skills Wishlist 🧰
+## 12. Tools & Skills Wishlist 🧰

 *Things that would need new tool implementations (can't do well with current tools):*

 ### High-Impact

- [ ] **Audio/Video Transcription** 🎬
+- [ ] **Audio/Video Transcription** 🎬 *(See also: Section 16 for detailed spec)*
  - Transcribe audio files, podcasts, YouTube videos
  - Extract key moments from video
-  - Currently blind to multimedia content
-  - *Could potentially use whisper via terminal, but native tool would be cleaner*
+  - Voice memo transcription for messaging integrations
+  - *Provider options: Whisper API, Deepgram, local Whisper*
  
 - [ ] **Diagram Rendering** 📊
  - Render Mermaid/PlantUML to actual images
@@ -324,13 +438,169 @@ These items need to be addressed ASAP:

 ---

+## 13. Messaging Platform Integrations 💬
+
+**Problem:** Agent currently only works via `cli.py` which requires direct terminal access. Users may want to interact via messaging apps from their phone or other devices.
+
+**Architecture:**
+- `run_agent.py` already accepts `conversation_history` parameter and returns updated messages ✅
+- Need: persistent session storage, platform monitors, session key resolution
+
+**Implementation approach:**
+```
+┌─────────────────────────────────────────────────────────────┐
+│  Platform Monitor (e.g., telegram_monitor.py)               │
+│  ├─ Long-running daemon connecting to messaging platform    │
+│  ├─ On message: resolve session key → load history from disk│
+│  ├─ Call run_agent.py with loaded history                   │
+│  ├─ Save updated history back to disk (JSONL)               │
+│  └─ Send response back to platform                          │
+└─────────────────────────────────────────────────────────────┘
+```
+
+**Platform support (each user sets up their own credentials):**
+- [ ] **Telegram** - via `python-telegram-bot` or `grammy` equivalent
+  - Bot token from @BotFather
+  - Easiest to set up, good for personal use
+- [ ] **Discord** - via `discord.py`
+  - Bot token from Discord Developer Portal
+  - Can work in servers (group sessions) or DMs
+- [ ] **WhatsApp** - via `baileys` (WhatsApp Web protocol)
+  - QR code scan to authenticate
+  - More complex, but reaches most people
+
+**Session management:**
+- [ ] **Session store** - JSONL persistence per session key
+  - `~/.hermes/sessions/{session_key}.jsonl`
+  - Session keys: `telegram:dm:{user_id}`, `discord:channel:{id}`, etc.
+- [ ] **Session expiry** - Configurable reset policies
+  - Daily reset (default 4am) OR idle timeout (e.g., 2 hours)
+  - Manual reset via `/reset` or `/new` command in chat
+- [ ] **Session continuity** - Conversations persist across messages until reset
+
+**Files to create:** `monitors/telegram_monitor.py`, `monitors/discord_monitor.py`, `monitors/session_store.py`
+
+---
+
+## 14. Scheduled Tasks / Cron Jobs ⏰
+
+**Problem:** Agent only runs on-demand. Some tasks benefit from scheduled execution (daily summaries, monitoring, reminders).
+
+**Ideas:**
+- [ ] **Cron-style scheduler** - Run agent turns on a schedule
+  - Store jobs in `~/.hermes/cron/jobs.json`
+  - Each job: `{ id, schedule, prompt, session_mode, delivery }`
+  - Uses APScheduler or similar Python library
+  
+- [ ] **Session modes:**
+  - `isolated` - Fresh session each run (no history, clean context)
+  - `main` - Append to main session (agent remembers previous scheduled runs)
+  
+- [ ] **Delivery options:**
+  - Write output to file (`~/.hermes/cron/output/{job_id}/{timestamp}.md`)
+  - Send to messaging channel (if integrations enabled)
+  - Both
+  
+- [ ] **CLI interface:**
+  ```bash
+  # List scheduled jobs
+  python cli.py --cron list
+  
+  # Add a job (runs daily at 9am)
+  python cli.py --cron add "Summarize my email inbox" --schedule "0 9 * * *"
+  
+  # Quick syntax for simple intervals  
+  python cli.py --cron add "Check server status" --every 30m
+  
+  # Remove a job
+  python cli.py --cron remove <job_id>
+  ```
+
+- [ ] **Agent self-scheduling** - Let the agent create its own cron jobs
+  - New tool: `schedule_task(prompt, schedule, session_mode)`
+  - "Remind me to check the deployment tomorrow at 9am"
+  - Agent can set follow-up tasks for itself
+
+- [ ] **In-chat command:** `/cronjob {prompt} {frequency}` when using messaging integrations
+
+**Files to create:** `cron/scheduler.py`, `cron/jobs.py`, `tools/schedule_tool.py`
+
+---
+
+## 15. Text-to-Speech (TTS) 🔊
+
+**Problem:** Agent can only respond with text. Some users prefer audio responses (accessibility, hands-free use, podcasts).
+
+**Ideas:**
+- [ ] **TTS tool** - Generate audio files from text
+  ```python
+  tts_generate(text="Here's your summary...", voice="nova", output="summary.mp3")
+  ```
+  - Returns path to generated audio file
+  - For messaging integrations: can send as voice message
+  
+- [ ] **Provider options:**
+  - Edge TTS (free, good quality, many voices)
+  - OpenAI TTS (paid, excellent quality)
+  - ElevenLabs (paid, best quality, voice cloning)
+  - Local options (Coqui TTS, Bark)
+  
+- [ ] **Modes:**
+  - On-demand: User explicitly asks "read this to me"
+  - Auto-TTS: Configurable to always generate audio for responses
+  - Long-text handling: Summarize or chunk very long responses
+  
+- [ ] **Integration with messaging:**
+  - When enabled, can send voice notes instead of/alongside text
+  - User preference per channel
+
+**Files to create:** `tools/tts_tool.py`, config in `cli-config.yaml`
+
+---
+
+## 16. Speech-to-Text / Audio Transcription 🎤
+
+**Problem:** Users may want to send voice memos instead of typing. Agent is blind to audio content.
+
+**Ideas:**
+- [ ] **Voice memo transcription** - For messaging integrations
+  - User sends voice message → transcribe → process as text
+  - Seamless: user speaks, agent responds
+  
+- [ ] **Audio/video file transcription** - Existing idea, expanded:
+  - Transcribe local audio files (mp3, wav, m4a)
+  - Transcribe YouTube videos (download audio → transcribe)
+  - Extract key moments with timestamps
+  
+- [ ] **Provider options:**
+  - OpenAI Whisper API (good quality, cheap)
+  - Deepgram (fast, good for real-time)
+  - Local Whisper (free, runs on GPU)
+  - Groq Whisper (fast, free tier available)
+  
+- [ ] **Tool interface:**
+  ```python
+  transcribe(source="audio.mp3")  # Local file
+  transcribe(source="https://youtube.com/...")  # YouTube
+  transcribe(source="voice_message", data=bytes)  # Voice memo
+  ```
+
+**Files to create:** `tools/transcribe_tool.py`, integrate with messaging monitors
+
+---
+
 ## Priority Order (Suggested)

-1. **Memory & Context Management** - Biggest impact on complex tasks
-2. **Self-Reflection** - Improves reliability and reduces wasted tool calls  
-3. **Project-Local Context** - Practical win, keeps useful info across sessions
-4. **Tool Composition** - Quality of life, builds on other improvements
-5. **Dynamic Skills** - Force multiplier for repeated tasks
+1. **🎯 Subagent Architecture** - Critical for context management, enables everything else
+2. **Memory & Context Management** - Complements subagents for remaining context
+3. **Self-Reflection** - Improves reliability and reduces wasted tool calls  
+4. **Project-Local Context** - Practical win, keeps useful info across sessions
+5. **Messaging Integrations** - Unlocks mobile access, new interaction patterns
+6. **Scheduled Tasks / Cron Jobs** - Enables automation, reminders, monitoring
+7. **Tool Composition** - Quality of life, builds on other improvements
+8. **Dynamic Skills** - Force multiplier for repeated tasks
+9. **Interactive Clarifying Questions** - Better UX for ambiguous tasks
+10. **TTS / Audio Transcription** - Accessibility, hands-free use

 ---

@@ -339,11 +609,13 @@ These items need to be addressed ASAP:
 The following were removed because they're architecturally impossible:

 - ~~Proactive suggestions / Prefetching~~ - Agent only runs on user request, can't interject
- ~~Session save/restore across conversations~~ - Agent doesn't control session persistence
- ~~User preference learning across sessions~~ - Same issue
 - ~~Clipboard integration~~ - No access to user's local system clipboard
- ~~Voice/TTS playback~~ - Can generate audio but can't play it to user
- ~~Set reminders~~ - No persistent background execution
+
+The following **moved to active TODO** (now possible with new architecture):
+
+- ~~Session save/restore~~ → See **Messaging Integrations** (session persistence)
+- ~~Voice/TTS playback~~ → See **TTS** (can generate audio files, send via messaging)
+- ~~Set reminders~~ → See **Scheduled Tasks / Cron Jobs**

 The following were removed because they're **already possible**:

@@ -357,4 +629,75 @@ The following were removed because they're **already possible**:

 ---

+---
+
+## 🧪 Brainstorm Ideas (Not Yet Fleshed Out)
+
+*These are early-stage ideas that need more thinking before implementation. Captured here so they don't get lost.*
+
+### Remote/Distributed Execution 🌐
+
+**Concept:** Run agent on a powerful remote server while interacting from a thin client.
+
+**Why interesting:**
+- Run on beefy GPU server for local LLM inference
+- Agent has access to remote machine's resources (files, tools, internet)
+- User interacts via lightweight client (phone, low-power laptop)
+
+**Open questions:**
+- How does this differ from just SSH + running cli.py on remote?
+- Would need secure communication channel (WebSocket? gRPC?)
+- How to handle tool outputs that reference remote paths?
+- Credential management for remote execution
+- Latency considerations for interactive use
+
+**Possible architecture:**
+```
+┌─────────────┐         ┌─────────────────────────┐
+│ Thin Client │ ◄─────► │ Remote Hermes Server    │
+│ (phone/web) │  WS/API │ - Full agent + tools    │
+└─────────────┘         │ - GPU for local LLM     │
+                        │ - Access to server files│
+                        └─────────────────────────┘
+```
+
+**Related to:** Messaging integrations (could be the "server" that monitors receive from)
+
+---
+
+### Multi-Agent Parallel Execution 🤖🤖
+
+**Concept:** Extension of Subagent Architecture (Section 1) - run multiple subagents in parallel.
+
+**Why interesting:**
+- Independent subtasks don't need to wait for each other
+- "Research X while setting up Y" - both run simultaneously
+- Faster completion for complex multi-part tasks
+
+**Open questions:**
+- How to detect which tasks are truly independent?
+- Resource management (API rate limits, concurrent connections)
+- How to merge results when parallel tasks have conflicts?
+- Cost implications of multiple parallel LLM calls
+
+*Note: Basic subagent delegation (Section 1) should be implemented first, parallel execution is an optimization on top.*
+
+---
+
+### Plugin/Extension System 🔌
+
+**Concept:** Allow users to add custom tools/skills without modifying core code.
+
+**Why interesting:**
+- Community contributions
+- Organization-specific tools
+- Clean separation of core vs. extensions
+
+**Open questions:**
+- Security implications of loading arbitrary code
+- Versioning and compatibility
+- Discovery and installation UX
+
+---
+
 *Last updated: $(date +%Y-%m-%d)* 🤖