Restore local STT command fallback for voice transcription, detect whisper and ffmpeg in common local install paths, and avoid bogus no-provider messaging when only a backend-specific key is missing.
121 lines
4.3 KiB
Markdown
121 lines
4.3 KiB
Markdown
---
|
||
sidebar_position: 9
|
||
title: "Voice & TTS"
|
||
description: "Text-to-speech and voice message transcription across all platforms"
|
||
---
|
||
|
||
# Voice & TTS
|
||
|
||
Hermes Agent supports both text-to-speech output and voice message transcription across all messaging platforms.
|
||
|
||
## Text-to-Speech
|
||
|
||
Convert text to speech with three providers:
|
||
|
||
| Provider | Quality | Cost | API Key |
|
||
|----------|---------|------|---------|
|
||
| **Edge TTS** (default) | Good | Free | None needed |
|
||
| **ElevenLabs** | Excellent | Paid | `ELEVENLABS_API_KEY` |
|
||
| **OpenAI TTS** | Good | Paid | `VOICE_TOOLS_OPENAI_KEY` |
|
||
|
||
### Platform Delivery
|
||
|
||
| Platform | Delivery | Format |
|
||
|----------|----------|--------|
|
||
| Telegram | Voice bubble (plays inline) | Opus `.ogg` |
|
||
| Discord | Audio file attachment | MP3 |
|
||
| WhatsApp | Audio file attachment | MP3 |
|
||
| CLI | Saved to `~/.hermes/audio_cache/` | MP3 |
|
||
|
||
### Configuration
|
||
|
||
```yaml
|
||
# In ~/.hermes/config.yaml
|
||
tts:
|
||
provider: "edge" # "edge" | "elevenlabs" | "openai"
|
||
edge:
|
||
voice: "en-US-AriaNeural" # 322 voices, 74 languages
|
||
elevenlabs:
|
||
voice_id: "pNInz6obpgDQGcFmaJgB" # Adam
|
||
model_id: "eleven_multilingual_v2"
|
||
openai:
|
||
model: "gpt-4o-mini-tts"
|
||
voice: "alloy" # alloy, echo, fable, onyx, nova, shimmer
|
||
```
|
||
|
||
### Telegram Voice Bubbles & ffmpeg
|
||
|
||
Telegram voice bubbles require Opus/OGG audio format:
|
||
|
||
- **OpenAI and ElevenLabs** produce Opus natively — no extra setup
|
||
- **Edge TTS** (default) outputs MP3 and needs **ffmpeg** to convert:
|
||
|
||
```bash
|
||
# Ubuntu/Debian
|
||
sudo apt install ffmpeg
|
||
|
||
# macOS
|
||
brew install ffmpeg
|
||
|
||
# Fedora
|
||
sudo dnf install ffmpeg
|
||
```
|
||
|
||
Without ffmpeg, Edge TTS audio is sent as a regular audio file (playable, but shows as a rectangular player instead of a voice bubble).
|
||
|
||
:::tip
|
||
If you want voice bubbles without installing ffmpeg, switch to the OpenAI or ElevenLabs provider.
|
||
:::
|
||
|
||
## Voice Message Transcription (STT)
|
||
|
||
Voice messages sent on Telegram, Discord, WhatsApp, Slack, or Signal are automatically transcribed and injected as text into the conversation. The agent sees the transcript as normal text.
|
||
|
||
| Provider | Quality | Cost | API Key |
|
||
|----------|---------|------|---------|
|
||
| **Local Whisper** (default) | Good | Free | None needed |
|
||
| **Groq Whisper API** | Good–Best | Free tier | `GROQ_API_KEY` |
|
||
| **OpenAI Whisper API** | Good–Best | Paid | `VOICE_TOOLS_OPENAI_KEY` or `OPENAI_API_KEY` |
|
||
|
||
:::info Zero Config
|
||
Local transcription works out of the box when `faster-whisper` is installed. If that's unavailable, Hermes can also use a local `whisper` CLI from common install locations (like `/opt/homebrew/bin`) or a custom command via `HERMES_LOCAL_STT_COMMAND`.
|
||
:::
|
||
|
||
### Configuration
|
||
|
||
```yaml
|
||
# In ~/.hermes/config.yaml
|
||
stt:
|
||
provider: "local" # "local" | "groq" | "openai"
|
||
local:
|
||
model: "base" # tiny, base, small, medium, large-v3
|
||
openai:
|
||
model: "whisper-1" # whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe
|
||
```
|
||
|
||
### Provider Details
|
||
|
||
**Local (faster-whisper)** — Runs Whisper locally via [faster-whisper](https://github.com/SYSTRAN/faster-whisper). Uses CPU by default, GPU if available. Model sizes:
|
||
|
||
| Model | Size | Speed | Quality |
|
||
|-------|------|-------|---------|
|
||
| `tiny` | ~75 MB | Fastest | Basic |
|
||
| `base` | ~150 MB | Fast | Good (default) |
|
||
| `small` | ~500 MB | Medium | Better |
|
||
| `medium` | ~1.5 GB | Slower | Great |
|
||
| `large-v3` | ~3 GB | Slowest | Best |
|
||
|
||
**Groq API** — Requires `GROQ_API_KEY`. Good cloud fallback when you want a free hosted STT option.
|
||
|
||
**OpenAI API** — Accepts `VOICE_TOOLS_OPENAI_KEY` first and falls back to `OPENAI_API_KEY`. Supports `whisper-1`, `gpt-4o-mini-transcribe`, and `gpt-4o-transcribe`.
|
||
|
||
**Custom local CLI fallback** — Set `HERMES_LOCAL_STT_COMMAND` if you want Hermes to call a local transcription command directly. The command template supports `{input_path}`, `{output_dir}`, `{language}`, and `{model}` placeholders.
|
||
|
||
### Fallback Behavior
|
||
|
||
If your configured provider isn't available, Hermes automatically falls back:
|
||
- **Local faster-whisper unavailable** → Tries a local `whisper` CLI or `HERMES_LOCAL_STT_COMMAND` before cloud providers
|
||
- **Groq key not set** → Falls back to local transcription, then OpenAI
|
||
- **OpenAI key not set** → Falls back to local transcription, then Groq
|
||
- **Nothing available** → Voice messages pass through with an accurate note to the user
|