diff --git a/website/docs/user-guide/features/voice-mode.md b/website/docs/user-guide/features/voice-mode.md new file mode 100644 index 000000000..8a0f74b58 --- /dev/null +++ b/website/docs/user-guide/features/voice-mode.md @@ -0,0 +1,400 @@ +--- +sidebar_position: 10 +title: "Voice Mode" +description: "Real-time voice conversations with Hermes Agent — CLI, Telegram, Discord text, and Discord voice channels" +--- + +# Voice Mode + +Hermes Agent supports full voice interaction across CLI and messaging platforms. Talk to the agent using your microphone, hear spoken replies, and have live voice conversations in Discord voice channels. + +## Overview + +| Feature | Platform | Description | +|---------|----------|-------------| +| **Interactive Voice** | CLI | Press Ctrl+B to record, agent auto-detects silence and responds | +| **Auto Voice Reply** | Telegram, Discord | Agent sends spoken audio alongside text responses | +| **Voice Channel** | Discord | Bot joins VC, listens to users speaking, speaks replies back | + +## Requirements + +### Python Packages + +```bash +# CLI voice mode (microphone + audio playback) +pip install hermes-agent[voice] + +# Discord + Telegram messaging (includes discord.py[voice] for VC support) +pip install hermes-agent[messaging] + +# Premium TTS (ElevenLabs) +pip install hermes-agent[tts-premium] + +# Everything at once +pip install hermes-agent[all] +``` + +| Extra | Packages | Required For | +|-------|----------|-------------| +| `voice` | `sounddevice`, `numpy` | CLI voice mode | +| `messaging` | `discord.py[voice]`, `python-telegram-bot`, `aiohttp` | Discord & Telegram bots | +| `tts-premium` | `elevenlabs` | ElevenLabs TTS provider | + +:::info +`discord.py[voice]` installs **PyNaCl** (for voice encryption) and **opus bindings** automatically. This is required for Discord voice channel support. +::: + +### System Dependencies + +```bash +# macOS +brew install portaudio ffmpeg opus + +# Ubuntu/Debian +sudo apt install portaudio19-dev ffmpeg libopus0 +``` + +| Dependency | Purpose | Required For | +|-----------|---------|-------------| +| **PortAudio** | Microphone input and audio playback | CLI voice mode | +| **ffmpeg** | Audio format conversion (MP3 → Opus, PCM → WAV) | All platforms | +| **Opus** | Discord voice codec | Discord voice channels | + +### API Keys + +Add to `~/.hermes/.env`: + +```bash +# Speech-to-Text (at least one required) +GROQ_API_KEY=your-key # Groq Whisper — fast, free tier available (recommended) +VOICE_TOOLS_OPENAI_KEY=your-key # OpenAI Whisper — alternative + +# Text-to-Speech (optional — Edge TTS works without any key) +ELEVENLABS_API_KEY=your-key # ElevenLabs — premium quality +``` + +--- + +## CLI Voice Mode + +### Quick Start + +``` +/voice Toggle voice mode on/off +/voice on Enable voice mode +/voice off Disable voice mode +/voice tts Toggle TTS output +/voice status Show current state +``` + +### How It Works + +1. Enable voice mode with `/voice on` +2. **Press Ctrl+B** — a beep plays (880Hz), recording starts +3. **Speak** — a live audio level bar shows your input: `● [▁▂▃▅▇▇▅▂] ❯` +4. **Stop speaking** — after 3 seconds of silence, recording auto-stops +5. **Two beeps** play (660Hz) confirming the recording ended +6. Audio is transcribed via Whisper and sent to the agent +7. If TTS is enabled, the agent's reply is spoken aloud +8. Recording **automatically restarts** — speak again without pressing any key + +This loop continues until you press **Ctrl+B** during recording (exits continuous mode) or 3 consecutive recordings detect no speech. + +:::tip +The record key is configurable via `voice.record_key` in `~/.hermes/config.yaml` (default: `ctrl+b`). +::: + +### Silence Detection + +Two-stage algorithm detects when you've finished speaking: + +1. **Speech confirmation** — waits for audio above the RMS threshold (200) for at least 0.3s, tolerating brief dips between syllables +2. **End detection** — once speech is confirmed, triggers after 3.0 seconds of continuous silence + +If no speech is detected at all for 15 seconds, recording stops automatically. + +Both `silence_threshold` and `silence_duration` are configurable in `config.yaml`. + +### Streaming TTS + +When TTS is enabled, the agent speaks its reply **sentence-by-sentence** as it generates text — you don't wait for the full response: + +1. Buffers text deltas into complete sentences (min 20 chars) +2. Strips markdown formatting and `` blocks +3. Generates and plays audio per sentence in real-time + +### Hallucination Filter + +Whisper sometimes generates phantom text from silence or background noise ("Thank you for watching", "Subscribe", etc.). The agent filters these out using a database of 498+ known hallucination phrases across multiple languages. + +--- + +## Gateway Voice Reply (Telegram & Discord) + +### Commands + +These work in both Telegram and Discord text channels: + +``` +/voice Toggle voice mode on/off +/voice on Voice replies only when you send a voice message +/voice tts Voice replies for ALL messages +/voice off Disable voice replies +/voice status Show current setting +``` + +### Modes + +| Mode | Command | Behavior | +|------|---------|----------| +| `off` | `/voice off` | Text only (default) | +| `voice_only` | `/voice on` | Speaks reply only when you send a voice message | +| `all` | `/voice tts` | Speaks reply to every message | + +Voice mode setting is persisted across gateway restarts. + +### Platform Delivery + +| Platform | Format | Notes | +|----------|--------|-------| +| **Telegram** | Voice bubble (Opus/OGG) | Plays inline in chat. ffmpeg converts MP3 → Opus if needed | +| **Discord** | Audio file attachment (MP3) | Sent alongside text response | + +--- + +## Discord Voice Channels + +The most immersive voice feature: the bot joins a Discord voice channel, listens to users speaking, transcribes their speech, processes through the agent, and speaks the reply back in the voice channel. + +### Setup + +#### 1. Discord Bot Permissions + +If you already have a Discord bot set up for text (see [Discord Setup Guide](../messaging/discord.md)), you need to add voice permissions. + +Go to the [Discord Developer Portal](https://discord.com/developers/applications) → your application → **Installation** → **Default Install Settings** → **Guild Install**: + +**Add these permissions to the existing text permissions:** + +| Permission | Purpose | Required | +|-----------|---------|----------| +| **Connect** | Join voice channels | Yes | +| **Speak** | Play TTS audio in voice channels | Yes | +| **Use Voice Activity** | Detect when users are speaking | Recommended | + +**Updated Permissions Integer:** + +| Level | Integer | What's Included | +|-------|---------|----------------| +| Text only | `274878286912` | View Channels, Send Messages, Read History, Embeds, Attachments, Threads, Reactions | +| Text + Voice | `274881432640` | All above + Connect, Speak | + +**Re-invite the bot** with the updated permissions URL: + +``` +https://discord.com/oauth2/authorize?client_id=YOUR_APP_ID&scope=bot+applications.commands&permissions=274881432640 +``` + +Replace `YOUR_APP_ID` with your Application ID from the Developer Portal. + +:::warning +Re-inviting the bot to a server it's already in will update its permissions without removing it. You won't lose any data or configuration. +::: + +#### 2. Privileged Gateway Intents + +In the [Developer Portal](https://discord.com/developers/applications) → your application → **Bot** → **Privileged Gateway Intents**, enable all three: + +| Intent | Purpose | +|--------|---------| +| **Presence Intent** | Detect user online/offline status | +| **Server Members Intent** | Map voice SSRC identifiers to Discord user IDs | +| **Message Content Intent** | Read text message content in channels | + +All three are required for full voice channel functionality. **Server Members Intent** is especially critical — without it, the bot cannot identify who is speaking in the voice channel. + +#### 3. Opus Codec + +The Opus codec library must be installed on the machine running the gateway: + +```bash +# macOS (Homebrew) +brew install opus + +# Ubuntu/Debian +sudo apt install libopus0 +``` + +The bot auto-loads the codec from: +- **macOS:** `/opt/homebrew/lib/libopus.dylib` +- **Linux:** `libopus.so.0` + +#### 4. Environment Variables + +```bash +# ~/.hermes/.env + +# Discord bot (already configured for text) +DISCORD_BOT_TOKEN=your-bot-token +DISCORD_ALLOWED_USERS=your-user-id + +# STT — at least one required for voice channel listening +GROQ_API_KEY=your-key # Recommended (fast, free tier) + +# TTS — optional, Edge TTS (free) is the default +# ELEVENLABS_API_KEY=your-key # Premium quality +``` + +### Commands + +``` +/voice join Bot joins your current voice channel +/voice channel Alias for /voice join +/voice leave Bot disconnects from voice channel +/voice status Show voice mode and connected channel +``` + +:::info +You must be in a voice channel before running `/voice join`. The bot joins the same VC you're in. +::: + +### How It Works + +When the bot joins a voice channel, it: + +1. **Captures audio** via Discord's UDP socket (RTP packets) +2. **Decrypts** using NaCl transport encryption (aead_xchacha20_poly1305_rtpsize) +3. **Decrypts** DAVE end-to-end encryption (Discord Audio/Video Encryption) +4. **Decodes** Opus audio to raw PCM (48kHz stereo, per-user decoder) +5. **Detects silence** — 1.5s of silence after at least 0.5s of speech triggers processing +6. **Converts** PCM to 16kHz mono WAV via ffmpeg +7. **Transcribes** via Whisper STT (Groq or OpenAI) +8. **Processes** through the full agent pipeline (session, tools, memory) +9. **Generates TTS** reply audio +10. **Plays** the reply in the voice channel + +### Text Channel Integration + +When the bot is in a voice channel: + +- Transcripts appear in the text channel: `[Voice] @user: what you said` +- Agent responses are sent as text in the channel AND spoken in the VC +- The text channel is the one where `/voice join` was issued + +### Echo Prevention + +The bot automatically pauses its audio listener while playing TTS replies, preventing it from hearing and re-processing its own output. + +### Access Control + +Only users listed in `DISCORD_ALLOWED_USERS` can interact via voice. Other users' audio is silently ignored. + +```bash +# ~/.hermes/.env +DISCORD_ALLOWED_USERS=284102345871466496 +``` + +--- + +## Configuration Reference + +### config.yaml + +```yaml +# Voice recording (CLI) +voice: + record_key: "ctrl+b" # Key to start/stop recording + max_recording_seconds: 120 # Maximum recording length + auto_tts: false # Auto-enable TTS when voice mode starts + silence_threshold: 200 # RMS level (0-32767) below which counts as silence + silence_duration: 3.0 # Seconds of silence before auto-stop + +# Speech-to-Text +stt: + enabled: true + model: "whisper-1" # Or: whisper-large-v3-turbo (Groq) + +# Text-to-Speech +tts: + provider: "edge" # "edge" (free) | "elevenlabs" | "openai" + edge: + voice: "en-US-AriaNeural" # 322 voices, 74 languages + elevenlabs: + voice_id: "pNInz6obpgDQGcFmaJgB" # Adam + model_id: "eleven_multilingual_v2" + openai: + model: "gpt-4o-mini-tts" + voice: "alloy" # alloy, echo, fable, onyx, nova, shimmer +``` + +### Environment Variables + +```bash +# Speech-to-Text providers +GROQ_API_KEY=... # Groq Whisper (recommended — fast, free tier) +VOICE_TOOLS_OPENAI_KEY=... # OpenAI Whisper (alternative) + +# Text-to-Speech providers (Edge TTS needs no key) +ELEVENLABS_API_KEY=... # ElevenLabs (premium quality) +# OpenAI TTS uses VOICE_TOOLS_OPENAI_KEY + +# Discord voice channel +DISCORD_BOT_TOKEN=... +DISCORD_ALLOWED_USERS=... +``` + +### STT Provider Comparison + +| Provider | Model | Speed | Quality | Cost | +|----------|-------|-------|---------|------| +| **Groq** | `whisper-large-v3-turbo` | Very fast (~0.5s) | Good | Free tier | +| **Groq** | `whisper-large-v3` | Fast (~1s) | Better | Free tier | +| **OpenAI** | `whisper-1` | Fast (~1s) | Good | Low | +| **OpenAI** | `gpt-4o-transcribe` | Medium (~2s) | Best | Higher | + +### TTS Provider Comparison + +| Provider | Quality | Cost | Latency | Key Required | +|----------|---------|------|---------|-------------| +| **Edge TTS** | Good | Free | ~1s | No | +| **ElevenLabs** | Excellent | Paid | ~2s | Yes | +| **OpenAI TTS** | Good | Paid | ~1.5s | Yes | + +--- + +## Troubleshooting + +### "No audio device found" (CLI) + +PortAudio is not installed: + +```bash +brew install portaudio # macOS +sudo apt install portaudio19-dev # Ubuntu +``` + +### Bot joins VC but doesn't hear me + +- Check your Discord user ID is in `DISCORD_ALLOWED_USERS` +- Make sure you're not muted in Discord +- The bot needs a SPEAKING event from Discord before it can map your audio — start speaking within a few seconds of joining + +### Bot hears me but doesn't respond + +- Verify STT key is set (`GROQ_API_KEY` or `VOICE_TOOLS_OPENAI_KEY`) +- Check the LLM model is configured and accessible +- Review gateway logs: `tail -f ~/.hermes/logs/gateway.log` + +### Bot responds in text but not in voice channel + +- TTS provider may be failing — check API key and quota +- Edge TTS (free, no key) is the default fallback +- Check logs for TTS errors + +### Whisper returns garbage text + +The hallucination filter catches most cases automatically. If you're still getting phantom transcripts: + +- Use a quieter environment +- Adjust `silence_threshold` in config (higher = less sensitive) +- Try a different STT model diff --git a/website/docs/user-guide/messaging/discord.md b/website/docs/user-guide/messaging/discord.md index 38fb9598a..0fc7f8cbc 100644 --- a/website/docs/user-guide/messaging/discord.md +++ b/website/docs/user-guide/messaging/discord.md @@ -210,8 +210,8 @@ Replace the ID with the actual channel ID (right-click → Copy Channel ID with Hermes Agent supports Discord voice messages: -- **Incoming voice messages** are automatically transcribed using Whisper (requires `VOICE_TOOLS_OPENAI_KEY` to be set in your environment). -- **Text-to-speech**: When TTS is enabled, the bot can send spoken responses as MP3 file attachments. +- **Incoming voice messages** are automatically transcribed using Whisper (requires `GROQ_API_KEY` or `VOICE_TOOLS_OPENAI_KEY` to be set in your environment). +- **Text-to-speech**: Use `/voice tts` to have the bot send spoken audio responses alongside text replies. ## Troubleshooting