docs(voice): add comprehensive voice mode guide

Add a hands-on guide for using voice mode with Hermes, fix and expand the main voice-mode docs, surface /voice in messaging docs, and improve discoverability from the homepage and learning path.
This commit is contained in:
teknium1
2026-03-14 09:50:45 -07:00
parent 6c0bf2824e
commit f43c078f9e
7 changed files with 439 additions and 2 deletions

View File

@@ -54,7 +54,9 @@ Deploy Hermes Agent as a bot on your favorite messaging platform.
3. [Messaging Overview](/docs/user-guide/messaging)
4. [Telegram Setup](/docs/user-guide/messaging/telegram)
5. [Discord Setup](/docs/user-guide/messaging/discord)
6. [Security](/docs/user-guide/security)
6. [Voice Mode](/docs/user-guide/features/voice-mode)
7. [Use Voice Mode with Hermes](/docs/guides/use-voice-mode-with-hermes)
8. [Security](/docs/user-guide/security)
For full project examples, see:
- [Daily Briefing Bot](/docs/guides/daily-briefing-bot)

View File

@@ -0,0 +1,422 @@
---
sidebar_position: 7
title: "Use Voice Mode with Hermes"
description: "A practical guide to setting up and using Hermes voice mode across CLI, Telegram, Discord, and Discord voice channels"
---
# Use Voice Mode with Hermes
This guide is the practical companion to the [Voice Mode feature reference](/docs/user-guide/features/voice-mode).
If the feature page explains what voice mode can do, this guide shows how to actually use it well.
## What voice mode is good for
Voice mode is especially useful when:
- you want a hands-free CLI workflow
- you want spoken responses in Telegram or Discord
- you want Hermes sitting in a Discord voice channel for live conversation
- you want quick idea capture, debugging, or back-and-forth while walking around instead of typing
## Choose your voice mode setup
There are really three different voice experiences in Hermes.
| Mode | Best for | Platform |
|---|---|---|
| Interactive microphone loop | Personal hands-free use while coding or researching | CLI |
| Voice replies in chat | Spoken responses alongside normal messaging | Telegram, Discord |
| Live voice channel bot | Group or personal live conversation in a VC | Discord voice channels |
A good path is:
1. get text working first
2. enable voice replies second
3. move to Discord voice channels last if you want the full experience
## Step 1: make sure normal Hermes works first
Before touching voice mode, verify that:
- Hermes starts
- your provider is configured
- the agent can answer text prompts normally
```bash
hermes
```
Ask something simple:
```text
What tools do you have available?
```
If that is not solid yet, fix text mode first.
## Step 2: install the right extras
### CLI microphone + playback
```bash
pip install hermes-agent[voice]
```
### Messaging platforms
```bash
pip install hermes-agent[messaging]
```
### Premium ElevenLabs TTS
```bash
pip install hermes-agent[tts-premium]
```
### Everything
```bash
pip install hermes-agent[all]
```
## Step 3: install system dependencies
### macOS
```bash
brew install portaudio ffmpeg opus
```
### Ubuntu / Debian
```bash
sudo apt install portaudio19-dev ffmpeg libopus0
```
Why these matter:
- `portaudio` → microphone input / playback for CLI voice mode
- `ffmpeg` → audio conversion for TTS and messaging delivery
- `opus` → Discord voice codec support
## Step 4: choose STT and TTS providers
Hermes supports both local and cloud speech stacks.
### Easiest / cheapest setup
Use local STT and free Edge TTS:
- STT provider: `local`
- TTS provider: `edge`
This is usually the best place to start.
### Environment file example
Add to `~/.hermes/.env`:
```bash
# Cloud STT options (local needs no key)
GROQ_API_KEY=***
VOICE_TOOLS_OPENAI_KEY=***
# Premium TTS (optional)
ELEVENLABS_API_KEY=***
```
### Provider recommendations
#### Speech-to-text
- `local` → best default for privacy and zero-cost use
- `groq` → very fast cloud transcription
- `openai` → good paid fallback
#### Text-to-speech
- `edge` → free and good enough for most users
- `elevenlabs` → best quality
- `openai` → good middle ground
## Step 5: recommended config
```yaml
voice:
record_key: "ctrl+b"
max_recording_seconds: 120
auto_tts: false
silence_threshold: 200
silence_duration: 3.0
stt:
provider: "local"
local:
model: "base"
tts:
provider: "edge"
edge:
voice: "en-US-AriaNeural"
```
This is a good conservative default for most people.
## Use case 1: CLI voice mode
## Turn it on
Start Hermes:
```bash
hermes
```
Inside the CLI:
```text
/voice on
```
### Recording flow
Default key:
- `Ctrl+B`
Workflow:
1. press `Ctrl+B`
2. speak
3. wait for silence detection to stop recording automatically
4. Hermes transcribes and responds
5. if TTS is on, it speaks the answer
6. the loop can automatically restart for continuous use
### Useful commands
```text
/voice
/voice on
/voice off
/voice tts
/voice status
```
### Good CLI workflows
#### Walk-up debugging
Say:
```text
I keep getting a docker permission error. Help me debug it.
```
Then continue hands-free:
- "Read the last error again"
- "Explain the root cause in simpler terms"
- "Now give me the exact fix"
#### Research / brainstorming
Great for:
- walking around while thinking
- dictating half-formed ideas
- asking Hermes to structure your thoughts in real time
#### Accessibility / low-typing sessions
If typing is inconvenient, voice mode is one of the fastest ways to stay in the full Hermes loop.
## Tuning CLI behavior
### Silence threshold
If Hermes starts/stops too aggressively, tune:
```yaml
voice:
silence_threshold: 250
```
Higher threshold = less sensitive.
### Silence duration
If you pause a lot between sentences, increase:
```yaml
voice:
silence_duration: 4.0
```
### Record key
If `Ctrl+B` conflicts with your terminal or tmux habits:
```yaml
voice:
record_key: "ctrl+space"
```
## Use case 2: voice replies in Telegram or Discord
This mode is simpler than full voice channels.
Hermes stays a normal chat bot, but can speak replies.
### Start the gateway
```bash
hermes gateway
```
### Turn on voice replies
Inside Telegram or Discord:
```text
/voice on
```
or
```text
/voice tts
```
### Modes
| Mode | Meaning |
|---|---|
| `off` | text only |
| `voice_only` | speak only when the user sent voice |
| `all` | speak every reply |
### When to use which mode
- `/voice on` if you want spoken replies only for voice-originating messages
- `/voice tts` if you want a full spoken assistant all the time
### Good messaging workflows
#### Telegram assistant on your phone
Use when:
- you are away from your machine
- you want to send voice notes and get quick spoken replies
- you want Hermes to function like a portable research or ops assistant
#### Discord DMs with spoken output
Useful when you want private interaction without server-channel mention behavior.
## Use case 3: Discord voice channels
This is the most advanced mode.
Hermes joins a Discord VC, listens to user speech, transcribes it, runs the normal agent pipeline, and speaks replies back into the channel.
## Required Discord permissions
In addition to the normal text-bot setup, make sure the bot has:
- Connect
- Speak
- preferably Use Voice Activity
Also enable privileged intents in the Developer Portal:
- Presence Intent
- Server Members Intent
- Message Content Intent
## Join and leave
In a Discord text channel where the bot is present:
```text
/voice join
/voice leave
/voice status
```
### What happens when joined
- users speak in the VC
- Hermes detects speech boundaries
- transcripts are posted in the associated text channel
- Hermes responds in text and audio
- the text channel is the one where `/voice join` was issued
### Best practices for Discord VC use
- keep `DISCORD_ALLOWED_USERS` tight
- use a dedicated bot/testing channel at first
- verify STT and TTS work in ordinary text-chat voice mode before trying VC mode
## Voice quality recommendations
### Best quality setup
- STT: local `large-v3` or Groq `whisper-large-v3`
- TTS: ElevenLabs
### Best speed / convenience setup
- STT: local `base` or Groq
- TTS: Edge
### Best zero-cost setup
- STT: local
- TTS: Edge
## Common failure modes
### "No audio device found"
Install `portaudio`.
### "Bot joins but hears nothing"
Check:
- your Discord user ID is in `DISCORD_ALLOWED_USERS`
- you are not muted
- privileged intents are enabled
- the bot has Connect/Speak permissions
### "It transcribes but does not speak"
Check:
- TTS provider config
- API key / quota for ElevenLabs or OpenAI
- `ffmpeg` install for Edge conversion paths
### "Whisper outputs garbage"
Try:
- quieter environment
- higher `silence_threshold`
- different STT provider/model
- shorter, clearer utterances
### "It works in DMs but not in server channels"
That is often mention policy.
By default, the bot needs an `@mention` in Discord server text channels unless configured otherwise.
## Suggested first-week setup
If you want the shortest path to success:
1. get text Hermes working
2. install `hermes-agent[voice]`
3. use CLI voice mode with local STT + Edge TTS
4. then enable `/voice on` in Telegram or Discord
5. only after that, try Discord VC mode
That progression keeps the debugging surface small.
## Where to read next
- [Voice Mode feature reference](/docs/user-guide/features/voice-mode)
- [Messaging Gateway](/docs/user-guide/messaging)
- [Discord setup](/docs/user-guide/messaging/discord)
- [Telegram setup](/docs/user-guide/messaging/telegram)
- [Configuration](/docs/user-guide/configuration)

View File

@@ -33,6 +33,8 @@ It's not a coding copilot tethered to an IDE or a chatbot wrapper around a singl
| 📚 **[Skills System](/docs/user-guide/features/skills)** | Procedural memory the agent creates and reuses |
| 🔌 **[MCP Integration](/docs/user-guide/features/mcp)** | Connect to MCP servers, filter their tools, and extend Hermes safely |
| 🧭 **[Use MCP with Hermes](/docs/guides/use-mcp-with-hermes)** | Practical MCP setup patterns, examples, and tutorials |
| 🎙️ **[Voice Mode](/docs/user-guide/features/voice-mode)** | Real-time voice interaction in CLI, Telegram, Discord, and Discord VC |
| 🗣️ **[Use Voice Mode with Hermes](/docs/guides/use-voice-mode-with-hermes)** | Hands-on setup and usage patterns for Hermes voice workflows |
| 🎭 **[Personality & SOUL.md](/docs/user-guide/features/personality)** | Define Hermes' default voice with a global SOUL.md |
| 📄 **[Context Files](/docs/user-guide/features/context-files)** | Project context files that shape every conversation |
| 🔒 **[Security](/docs/user-guide/security)** | Command approval, authorization, container isolation |

View File

@@ -8,11 +8,13 @@ description: "Real-time voice conversations with Hermes Agent — CLI, Telegram,
Hermes Agent supports full voice interaction across CLI and messaging platforms. Talk to the agent using your microphone, hear spoken replies, and have live voice conversations in Discord voice channels.
If you want a practical setup walkthrough with recommended configurations and real usage patterns, see [Use Voice Mode with Hermes](/docs/guides/use-voice-mode-with-hermes).
## Prerequisites
Before using voice features, make sure you have:
1. **Hermes Agent installed**`pip install hermes-agent` (see [Getting Started](../../getting-started.md))
1. **Hermes Agent installed**`pip install hermes-agent` (see [Installation](/docs/getting-started/installation))
2. **An LLM provider configured** — set `OPENAI_API_KEY`, `OPENAI_BASE_URL`, and `LLM_MODEL` in `~/.hermes/.env`
3. **A working base setup** — run `hermes` to verify the agent responds to text before enabling voice

View File

@@ -212,6 +212,11 @@ Hermes Agent supports Discord voice messages:
- **Incoming voice messages** are automatically transcribed using Whisper (requires `GROQ_API_KEY` or `VOICE_TOOLS_OPENAI_KEY` to be set in your environment).
- **Text-to-speech**: Use `/voice tts` to have the bot send spoken audio responses alongside text replies.
- **Discord voice channels**: Hermes can also join a voice channel, listen to users speaking, and talk back in the channel.
For the full setup and operational guide, see:
- [Voice Mode](/docs/user-guide/features/voice-mode)
- [Use Voice Mode with Hermes](/docs/guides/use-voice-mode-with-hermes)
## Troubleshooting

View File

@@ -8,6 +8,8 @@ description: "Chat with Hermes from Telegram, Discord, Slack, WhatsApp, Signal,
Chat with Hermes from Telegram, Discord, Slack, WhatsApp, Signal, Email, Home Assistant, or your browser. The gateway is a single background process that connects to all your configured platforms, handles sessions, runs cron jobs, and delivers voice messages.
For the full voice feature set — including CLI microphone mode, spoken replies in messaging, and Discord voice-channel conversations — see [Voice Mode](/docs/user-guide/features/voice-mode) and [Use Voice Mode with Hermes](/docs/guides/use-voice-mode-with-hermes).
## Architecture
```text
@@ -77,6 +79,7 @@ hermes gateway status # Check service status
| `/usage` | Show token usage for this session |
| `/insights [days]` | Show usage insights and analytics |
| `/reasoning [level\|show\|hide]` | Change reasoning effort or toggle reasoning display |
| `/voice [on\|off\|tts\|join\|leave\|status]` | Control messaging voice replies and Discord voice-channel behavior |
| `/rollback [number]` | List or restore filesystem checkpoints |
| `/background <prompt>` | Run a prompt in a separate background session |
| `/reload-mcp` | Reload MCP servers from config |

View File

@@ -24,6 +24,7 @@ const sidebars: SidebarsConfig = {
'guides/python-library',
'guides/use-mcp-with-hermes',
'guides/use-soul-with-hermes',
'guides/use-voice-mode-with-hermes',
],
},
{