Files

Teknium 4ca6668daf docs: comprehensive update for recent merged PRs (#9019 )

Audit and update documentation across 12 files to match changes from
~50 recently merged PRs. Key updates:

Slash commands (slash-commands.md):
- Add 5 missing commands: /snapshot, /fast, /image, /debug, /restart
- Fix /status incorrectly labeled as messaging-only (available in both)
- Add --global flag to /model docs
- Add [focus topic] arg to /compress docs

CLI commands (cli-commands.md):
- Add hermes debug share section with options and examples
- Add hermes backup section with --quick and --label flags
- Add hermes import section

Feature docs:
- TTS: document global tts.speed and per-provider speed for Edge/OpenAI
- Web dashboard: add docs for 5 missing pages (Sessions, Logs,
  Analytics, Cron, Skills) and 15+ API endpoints
- WhatsApp: add streaming, 4K chunking, and markdown formatting docs
- Skills: add GitHub rate-limit/GITHUB_TOKEN troubleshooting tip
- Budget: document CLI notification on iteration budget exhaustion

Config migration (compression.summary_* → auxiliary.compression.*):
- Update configuration.md, environment-variables.md,
  fallback-providers.md, cli.md, and context-compression-and-caching.md
- Replace legacy compression.summary_model/provider/base_url references
  with auxiliary.compression.model/provider/base_url
- Add legacy migration info boxes explaining auto-migration

Minor fixes:
- wecom-callback.md: clarify 'text only' limitation (input only)
- Escape {session_id}/{job_id} in web-dashboard.md headings for MDX

2026-04-13 10:50:59 -07:00

6.3 KiB

Raw Permalink Blame History

sidebar_position, title, description

sidebar_position	title	description
9	Voice & TTS	Text-to-speech and voice message transcription across all platforms

Voice & TTS

Hermes Agent supports both text-to-speech output and voice message transcription across all messaging platforms.

Text-to-Speech

Convert text to speech with six providers:

Provider	Quality	Cost	API Key
Edge TTS (default)	Good	Free	None needed
ElevenLabs	Excellent	Paid	`ELEVENLABS_API_KEY`
OpenAI TTS	Good	Paid	`VOICE_TOOLS_OPENAI_KEY`
MiniMax TTS	Excellent	Paid	`MINIMAX_API_KEY`
Mistral (Voxtral TTS)	Excellent	Paid	`MISTRAL_API_KEY`
NeuTTS	Good	Free	None needed

Platform Delivery

Platform	Delivery	Format
Telegram	Voice bubble (plays inline)	Opus `.ogg`
Discord	Voice bubble (Opus/OGG), falls back to file attachment	Opus/MP3
WhatsApp	Audio file attachment	MP3
CLI	Saved to `~/.hermes/audio_cache/`	MP3

Configuration

# In ~/.hermes/config.yaml
tts:
  provider: "edge"              # "edge" | "elevenlabs" | "openai" | "minimax" | "mistral" | "neutts"
  speed: 1.0                    # Global speed multiplier (provider-specific settings override this)
  edge:
    voice: "en-US-AriaNeural"   # 322 voices, 74 languages
    speed: 1.0                  # Converted to rate percentage (+/-%)
  elevenlabs:
    voice_id: "pNInz6obpgDQGcFmaJgB"  # Adam
    model_id: "eleven_multilingual_v2"
  openai:
    model: "gpt-4o-mini-tts"
    voice: "alloy"              # alloy, echo, fable, onyx, nova, shimmer
    base_url: "https://api.openai.com/v1"  # Override for OpenAI-compatible TTS endpoints
    speed: 1.0                  # 0.25 - 4.0
  minimax:
    model: "speech-2.8-hd"     # speech-2.8-hd (default), speech-2.8-turbo
    voice_id: "English_Graceful_Lady"  # See https://platform.minimax.io/faq/system-voice-id
    speed: 1                    # 0.5 - 2.0
    vol: 1                      # 0 - 10
    pitch: 0                    # -12 - 12
  mistral:
    model: "voxtral-mini-tts-2603"
    voice_id: "c69964a6-ab8b-4f8a-9465-ec0925096ec8"  # Paul - Neutral (default)
  neutts:
    ref_audio: ''
    ref_text: ''
    model: neuphonic/neutts-air-q4-gguf
    device: cpu

Speed control: The global tts.speed value applies to all providers by default. Each provider can override it with its own speed setting (e.g., tts.openai.speed: 1.5). Provider-specific speed takes precedence over the global value. Default is 1.0 (normal speed).

Telegram Voice Bubbles & ffmpeg

Telegram voice bubbles require Opus/OGG audio format:

OpenAI, ElevenLabs, and Mistral produce Opus natively — no extra setup
Edge TTS (default) outputs MP3 and needs ffmpeg to convert:
MiniMax TTS outputs MP3 and needs ffmpeg to convert for Telegram voice bubbles
NeuTTS outputs WAV and also needs ffmpeg to convert for Telegram voice bubbles

# Ubuntu/Debian
sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Fedora
sudo dnf install ffmpeg

Without ffmpeg, Edge TTS, MiniMax TTS, and NeuTTS audio are sent as regular audio files (playable, but shown as a rectangular player instead of a voice bubble).

:::tip If you want voice bubbles without installing ffmpeg, switch to the OpenAI, ElevenLabs, or Mistral provider. :::

Voice Message Transcription (STT)

Voice messages sent on Telegram, Discord, WhatsApp, Slack, or Signal are automatically transcribed and injected as text into the conversation. The agent sees the transcript as normal text.

Provider	Quality	Cost	API Key
Local Whisper (default)	Good	Free	None needed
Groq Whisper API	Good–Best	Free tier	`GROQ_API_KEY`
OpenAI Whisper API	Good–Best	Paid	`VOICE_TOOLS_OPENAI_KEY` or `OPENAI_API_KEY`

:::info Zero Config Local transcription works out of the box when faster-whisper is installed. If that's unavailable, Hermes can also use a local whisper CLI from common install locations (like /opt/homebrew/bin) or a custom command via HERMES_LOCAL_STT_COMMAND. :::

Configuration

# In ~/.hermes/config.yaml
stt:
  provider: "local"           # "local" | "groq" | "openai" | "mistral"
  local:
    model: "base"             # tiny, base, small, medium, large-v3
  openai:
    model: "whisper-1"        # whisper-1, gpt-4o-mini-transcribe, gpt-4o-transcribe
  mistral:
    model: "voxtral-mini-latest"  # voxtral-mini-latest, voxtral-mini-2602

Provider Details

Local (faster-whisper) — Runs Whisper locally via faster-whisper. Uses CPU by default, GPU if available. Model sizes:

Model	Size	Speed	Quality
`tiny`	~75 MB	Fastest	Basic
`base`	~150 MB	Fast	Good (default)
`small`	~500 MB	Medium	Better
`medium`	~1.5 GB	Slower	Great
`large-v3`	~3 GB	Slowest	Best

Groq API — Requires GROQ_API_KEY. Good cloud fallback when you want a free hosted STT option.

OpenAI API — Accepts VOICE_TOOLS_OPENAI_KEY first and falls back to OPENAI_API_KEY. Supports whisper-1, gpt-4o-mini-transcribe, and gpt-4o-transcribe.

Mistral API (Voxtral Transcribe) — Requires MISTRAL_API_KEY. Uses Mistral's Voxtral Transcribe models. Supports 13 languages, speaker diarization, and word-level timestamps. Install with pip install hermes-agent[mistral].

Custom local CLI fallback — Set HERMES_LOCAL_STT_COMMAND if you want Hermes to call a local transcription command directly. The command template supports {input_path}, {output_dir}, {language}, and {model} placeholders.

Fallback Behavior

If your configured provider isn't available, Hermes automatically falls back:

Local faster-whisper unavailable → Tries a local whisper CLI or HERMES_LOCAL_STT_COMMAND before cloud providers
Groq key not set → Falls back to local transcription, then OpenAI
OpenAI key not set → Falls back to local transcription, then Groq
Mistral key/SDK not set → Skipped in auto-detect; falls through to next available provider
Nothing available → Voice messages pass through with an accurate note to the user

6.3 KiB Raw Permalink Blame History Unescape Escape

Voice & TTS

Text-to-Speech

Platform Delivery

Configuration

Telegram Voice Bubbles & ffmpeg

Voice Message Transcription (STT)

Configuration

Provider Details

Fallback Behavior

6.3 KiB

Raw Permalink Blame History