* fix: Home Assistant event filtering now closed by default Previously, when no watch_domains or watch_entities were configured, ALL state_changed events passed through to the agent, causing users to be flooded with notifications for every HA entity change. Now events are dropped by default unless the user explicitly configures: - watch_domains: list of domains to monitor (e.g. climate, light) - watch_entities: list of specific entity IDs to monitor - watch_all: true (new option — opt-in to receive all events) A warning is logged at connect time if no filters are configured, guiding users to set up their HA platform config. All 49 gateway HA tests + 52 HA tool tests pass. * docs: update Home Assistant integration documentation - homeassistant.md: Fix event filtering docs to reflect closed-by-default behavior. Add watch_all option. Replace Python dict config example with YAML. Fix defaults table (was incorrectly showing 'all'). Add required configuration warning admonition. - environment-variables.md: Add HASS_TOKEN and HASS_URL to Messaging section. - messaging/index.md: Add Home Assistant to description, architecture diagram, platform toolsets table, and Next Steps links. * fix(terminal): strip provider env vars from background and PTY subprocesses Extends the env var blocklist from #1157 to also cover the two remaining leaky paths in process_registry.py: - spawn_local() PTY path (line 156) - spawn_local() background Popen path (line 197) Both were still using raw os.environ, leaking provider vars to background processes and interactive PTY sessions. Now uses the same dynamic _HERMES_PROVIDER_ENV_BLOCKLIST from local.py. Explicit env_vars passed to spawn_local() still override the blocklist, matching the existing behavior for callers that intentionally need these. Gap identified by PR #1004 (@PeterFile). * feat(delegate): add observability metadata to subagent results Enrich delegate_task results with metadata from the child AIAgent: - model: which model the child used - exit_reason: completed | interrupted | max_iterations - tokens.input / tokens.output: token counts - tool_trace: per-tool-call trace with byte sizes and ok/error status Tool trace uses tool_call_id matching to correctly pair parallel tool calls with their results, with a fallback for messages without IDs. Cherry-picked from PR #872 by @omerkaz, with fixes: - Fixed parallel tool call trace pairing (was always updating last entry) - Removed redundant 'iterations' field (identical to existing 'api_calls') - Added test for parallel tool call trace correctness Co-authored-by: omerkaz <omerkaz@users.noreply.github.com> * feat(stt): add free local whisper transcription via faster-whisper Replace OpenAI-only STT with a dual-provider system mirroring the TTS architecture (Edge TTS free / ElevenLabs paid): STT: faster-whisper local (free, default) / OpenAI Whisper API (paid) Changes: - tools/transcription_tools.py: Full rewrite with provider dispatch, config loading, local faster-whisper backend, and OpenAI API backend. Auto-downloads model (~150MB for 'base') on first voice message. Singleton model instance reused across calls. - pyproject.toml: Add faster-whisper>=1.0.0 as core dependency - hermes_cli/config.py: Expand stt config to match TTS pattern with provider selection and per-provider model settings - agent/context_compressor.py: Fix .strip() crash when LLM returns non-string content (dict from llama.cpp, None). Fixes #1100 partially. - tests/: 23 new tests for STT providers + 2 for compressor fix - docs/: Updated Voice & TTS page with STT provider table, model sizes, config examples, and fallback behavior Fallback behavior: - Local not installed → OpenAI API (if key set) - OpenAI key not set → local whisper (if installed) - Neither → graceful error message to user Co-authored-by: Jah-yee <Jah-yee@users.noreply.github.com> --------- Co-authored-by: omerkaz <omerkaz@users.noreply.github.com> Co-authored-by: Jah-yee <Jah-yee@users.noreply.github.com>
115 lines
3.6 KiB
Markdown
115 lines
3.6 KiB
Markdown
---
|
||
sidebar_position: 9
|
||
title: "Voice & TTS"
|
||
description: "Text-to-speech and voice message transcription across all platforms"
|
||
---
|
||
|
||
# Voice & TTS
|
||
|
||
Hermes Agent supports both text-to-speech output and voice message transcription across all messaging platforms.
|
||
|
||
## Text-to-Speech
|
||
|
||
Convert text to speech with three providers:
|
||
|
||
| Provider | Quality | Cost | API Key |
|
||
|----------|---------|------|---------|
|
||
| **Edge TTS** (default) | Good | Free | None needed |
|
||
| **ElevenLabs** | Excellent | Paid | `ELEVENLABS_API_KEY` |
|
||
| **OpenAI TTS** | Good | Paid | `VOICE_TOOLS_OPENAI_KEY` |
|
||
|
||
### Platform Delivery
|
||
|
||
| Platform | Delivery | Format |
|
||
|----------|----------|--------|
|
||
| Telegram | Voice bubble (plays inline) | Opus `.ogg` |
|
||
| Discord | Audio file attachment | MP3 |
|
||
| WhatsApp | Audio file attachment | MP3 |
|
||
| CLI | Saved to `~/.hermes/audio_cache/` | MP3 |
|
||
|
||
### Configuration
|
||
|
||
```yaml
|
||
# In ~/.hermes/config.yaml
|
||
tts:
|
||
provider: "edge" # "edge" | "elevenlabs" | "openai"
|
||
edge:
|
||
voice: "en-US-AriaNeural" # 322 voices, 74 languages
|
||
elevenlabs:
|
||
voice_id: "pNInz6obpgDQGcFmaJgB" # Adam
|
||
model_id: "eleven_multilingual_v2"
|
||
openai:
|
||
model: "gpt-4o-mini-tts"
|
||
voice: "alloy" # alloy, echo, fable, onyx, nova, shimmer
|
||
```
|
||
|
||
### Telegram Voice Bubbles & ffmpeg
|
||
|
||
Telegram voice bubbles require Opus/OGG audio format:
|
||
|
||
- **OpenAI and ElevenLabs** produce Opus natively — no extra setup
|
||
- **Edge TTS** (default) outputs MP3 and needs **ffmpeg** to convert:
|
||
|
||
```bash
|
||
# Ubuntu/Debian
|
||
sudo apt install ffmpeg
|
||
|
||
# macOS
|
||
brew install ffmpeg
|
||
|
||
# Fedora
|
||
sudo dnf install ffmpeg
|
||
```
|
||
|
||
Without ffmpeg, Edge TTS audio is sent as a regular audio file (playable, but shows as a rectangular player instead of a voice bubble).
|
||
|
||
:::tip
|
||
If you want voice bubbles without installing ffmpeg, switch to the OpenAI or ElevenLabs provider.
|
||
:::
|
||
|
||
## Voice Message Transcription (STT)
|
||
|
||
Voice messages sent on Telegram, Discord, WhatsApp, Slack, or Signal are automatically transcribed and injected as text into the conversation. The agent sees the transcript as normal text.
|
||
|
||
| Provider | Quality | Cost | API Key |
|
||
|----------|---------|------|---------|
|
||
| **Local Whisper** (default) | Good | Free | None needed |
|
||
| **OpenAI Whisper API** | Good–Best | Paid | `VOICE_TOOLS_OPENAI_KEY` |
|
||
|
||
:::info Zero Config
|
||
Local transcription works out of the box — no API key needed. The `faster-whisper` model (~150 MB for `base`) is auto-downloaded on first voice message.
|
||
:::
|
||
|
||
### Configuration
|
||
|
||
```yaml
|
||
# In ~/.hermes/config.yaml
|
||
stt:
|
||
provider: "local" # "local" (free, faster-whisper) | "openai" (API)
|
||
local:
|
||
model: "base" # tiny, base, small, medium, large-v3
|
||
openai:
|
||
model: "whisper-1" # whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe
|
||
```
|
||
|
||
### Provider Details
|
||
|
||
**Local (faster-whisper)** — Runs Whisper locally via [faster-whisper](https://github.com/SYSTRAN/faster-whisper). Uses CPU by default, GPU if available. Model sizes:
|
||
|
||
| Model | Size | Speed | Quality |
|
||
|-------|------|-------|---------|
|
||
| `tiny` | ~75 MB | Fastest | Basic |
|
||
| `base` | ~150 MB | Fast | Good (default) |
|
||
| `small` | ~500 MB | Medium | Better |
|
||
| `medium` | ~1.5 GB | Slower | Great |
|
||
| `large-v3` | ~3 GB | Slowest | Best |
|
||
|
||
**OpenAI API** — Requires `VOICE_TOOLS_OPENAI_KEY`. Supports `whisper-1`, `gpt-4o-mini-transcribe`, and `gpt-4o-transcribe`.
|
||
|
||
### Fallback Behavior
|
||
|
||
If your configured provider isn't available, Hermes automatically falls back:
|
||
- **Local not installed** → Falls back to OpenAI API (if key is set)
|
||
- **OpenAI key not set** → Falls back to local Whisper (if installed)
|
||
- **Neither available** → Voice messages pass through with a note to the user
|