Add Text-to-Speech (TTS) functionality with multiple providers

Add tool previews

Add AGENTS and SOUL.md support

Add Exec Approval
This commit is contained in:
teknium1
2026-02-12 10:05:08 -08:00
parent 89c6f24d48
commit f5be6177b2
18 changed files with 1200 additions and 21 deletions

View File

@@ -37,8 +37,9 @@ All your settings are stored in `~/.hermes/` for easy access:
```
~/.hermes/
├── config.yaml # Settings (model, terminal, compression, etc.)
├── config.yaml # Settings (model, terminal, TTS, compression, etc.)
├── .env # API keys and secrets
├── SOUL.md # Optional: global persona (agent embodies this personality)
├── cron/ # Scheduled jobs
├── sessions/ # Gateway sessions
└── logs/ # Logs
@@ -76,6 +77,8 @@ You need at least one LLM provider:
| Web scraping | [Firecrawl](https://firecrawl.dev/) | `FIRECRAWL_API_KEY` |
| Browser automation | [Browserbase](https://browserbase.com/) | `BROWSERBASE_API_KEY`, `BROWSERBASE_PROJECT_ID` |
| Image generation | [FAL](https://fal.ai/) | `FAL_KEY` |
| Premium TTS voices | [ElevenLabs](https://elevenlabs.io/) | `ELEVENLABS_API_KEY` |
| OpenAI TTS voices | [OpenAI](https://platform.openai.com/api-keys) | `OPENAI_API_KEY` |
| RL Training | [Tinker](https://tinker-console.thinkingmachines.ai/) + [WandB](https://wandb.ai/) | `TINKER_API_KEY`, `WANDB_API_KEY` |
| Messaging | Telegram, Discord | `TELEGRAM_BOT_TOKEN`, `DISCORD_BOT_TOKEN` |
@@ -128,7 +131,58 @@ hermes --toolsets "web,terminal"
hermes --list-tools
```
**Available toolsets:** `web`, `terminal`, `browser`, `vision`, `creative`, `reasoning`, `skills`, `cronjob`, and more.
**Available toolsets:** `web`, `terminal`, `browser`, `vision`, `creative`, `reasoning`, `skills`, `tts`, `cronjob`, and more.
### 🔊 Text-to-Speech
Convert text to speech with three providers:
| Provider | Quality | Cost | API Key |
|----------|---------|------|---------|
| **Edge TTS** (default) | Good | Free | None needed |
| **ElevenLabs** | Excellent | Paid | `ELEVENLABS_API_KEY` |
| **OpenAI TTS** | Good | Paid | `OPENAI_API_KEY` |
On Telegram, audio plays as native voice bubbles. On Discord/WhatsApp, sent as audio files. In CLI mode, saved to `~/voice-memos/`.
**Configure in `~/.hermes/config.yaml`:**
```yaml
tts:
provider: "edge" # "edge" | "elevenlabs" | "openai"
edge:
voice: "en-US-AriaNeural" # 322 voices, 74 languages
elevenlabs:
voice_id: "pNInz6obpgDQGcFmaJgB" # Adam
model_id: "eleven_multilingual_v2"
openai:
model: "gpt-4o-mini-tts"
voice: "alloy" # alloy, echo, fable, onyx, nova, shimmer
```
> **Note:** Telegram voice bubbles require `ffmpeg` for Opus conversion (Edge TTS only outputs MP3). Install with `apt install ffmpeg` or `brew install ffmpeg`. Without ffmpeg, audio is sent as a file instead of a voice bubble.
### 📄 Context Files (SOUL.md, AGENTS.md, .cursorrules)
Drop these files in your project directory and the agent automatically picks them up:
| File | Purpose |
|------|---------|
| `AGENTS.md` | Project-specific instructions, coding conventions, tool usage guidelines |
| `SOUL.md` | Persona definition -- the agent embodies this personality and tone |
| `.cursorrules` | Cursor IDE rules (also detected) |
| `.cursor/rules/*.mdc` | Cursor rule files (also detected) |
- **AGENTS.md** is hierarchical: if subdirectories also have `AGENTS.md`, all are combined (like Codex/Cline).
- **SOUL.md** checks cwd first, then `~/.hermes/SOUL.md` as a global fallback.
- All context files are capped at 20,000 characters with smart truncation.
### 🛡️ Exec Approval (Messaging Platforms)
When the agent tries to run a potentially dangerous command (rm -rf, chmod 777, etc.) on Telegram/Discord/WhatsApp, instead of blocking it silently, it asks the user for approval:
> ⚠️ This command is potentially dangerous (recursive delete). Reply "yes" to approve.
Reply "yes"/"y" to approve or "no"/"n" to deny. In CLI mode, the existing interactive approval prompt (once/session/always/deny) is preserved.
### 🖥️ Terminal Backend