136 lines
4.0 KiB
Markdown
136 lines
4.0 KiB
Markdown
|
|
# Voice Output System
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
The Nexus voice output system converts text reports and briefings into spoken audio.
|
||
|
|
It supports multiple TTS providers with automatic fallback so that audio generation
|
||
|
|
degrades gracefully when a provider is unavailable.
|
||
|
|
|
||
|
|
Primary use cases:
|
||
|
|
- **Deep Dive** daily briefings (`bin/deepdive_tts.py`)
|
||
|
|
- **Night Watch** nightly reports (`bin/night_watch.py --voice-memo`)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Available Providers
|
||
|
|
|
||
|
|
### edge-tts (recommended default)
|
||
|
|
|
||
|
|
- **Cost:** Zero — no API key, no account required
|
||
|
|
- **Package:** `pip install edge-tts>=6.1.9`
|
||
|
|
- **Default voice:** `en-US-GuyNeural`
|
||
|
|
- **Output format:** MP3
|
||
|
|
- **How it works:** Streams audio from Microsoft Edge's neural TTS service over HTTPS.
|
||
|
|
No local model download required.
|
||
|
|
- **Available locales:** 100+ languages and locales. Full list:
|
||
|
|
https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support
|
||
|
|
|
||
|
|
Notable English voices:
|
||
|
|
| Voice ID | Style |
|
||
|
|
|---|---|
|
||
|
|
| `en-US-GuyNeural` | Neutral male (default) |
|
||
|
|
| `en-US-JennyNeural` | Warm female |
|
||
|
|
| `en-US-AriaNeural` | Expressive female |
|
||
|
|
| `en-GB-RyanNeural` | British male |
|
||
|
|
|
||
|
|
### piper
|
||
|
|
|
||
|
|
- **Cost:** Free, fully offline
|
||
|
|
- **Package:** `pip install piper-tts` + model download (~65 MB)
|
||
|
|
- **Model location:** `~/.local/share/piper/en_US-lessac-medium.onnx`
|
||
|
|
- **Output format:** WAV → MP3 (requires `lame`)
|
||
|
|
- **Sovereignty:** Fully local; no network calls after model download
|
||
|
|
|
||
|
|
### elevenlabs
|
||
|
|
|
||
|
|
- **Cost:** Usage-based (paid)
|
||
|
|
- **Requirement:** `ELEVENLABS_API_KEY` environment variable
|
||
|
|
- **Output format:** MP3
|
||
|
|
- **Quality:** Highest quality of the three providers
|
||
|
|
|
||
|
|
### openai
|
||
|
|
|
||
|
|
- **Cost:** Usage-based (paid)
|
||
|
|
- **Requirement:** `OPENAI_API_KEY` environment variable
|
||
|
|
- **Output format:** MP3
|
||
|
|
- **Default voice:** `alloy`
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Usage: deepdive_tts.py
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Use edge-tts (zero cost)
|
||
|
|
DEEPDIVE_TTS_PROVIDER=edge-tts python bin/deepdive_tts.py --text "Good morning."
|
||
|
|
|
||
|
|
# Specify a different Edge voice
|
||
|
|
python bin/deepdive_tts.py --provider edge-tts --voice en-US-JennyNeural --text "Hello world."
|
||
|
|
|
||
|
|
# Read from a file
|
||
|
|
python bin/deepdive_tts.py --provider edge-tts --input-file /tmp/briefing.txt --output /tmp/briefing
|
||
|
|
|
||
|
|
# Use OpenAI
|
||
|
|
OPENAI_API_KEY=sk-... python bin/deepdive_tts.py --provider openai --voice nova --text "Hello."
|
||
|
|
|
||
|
|
# Use ElevenLabs
|
||
|
|
ELEVENLABS_API_KEY=... python bin/deepdive_tts.py --provider elevenlabs --voice rachel --text "Hello."
|
||
|
|
|
||
|
|
# Use local Piper (offline)
|
||
|
|
python bin/deepdive_tts.py --provider piper --text "Hello."
|
||
|
|
```
|
||
|
|
|
||
|
|
Provider and voice can also be set via environment variables:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
export DEEPDIVE_TTS_PROVIDER=edge-tts
|
||
|
|
export DEEPDIVE_TTS_VOICE=en-GB-RyanNeural
|
||
|
|
python bin/deepdive_tts.py --text "Good evening."
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Usage: Night Watch --voice-memo
|
||
|
|
|
||
|
|
The `--voice-memo` flag causes Night Watch to generate an MP3 audio summary of the
|
||
|
|
nightly report immediately after writing the markdown file.
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python bin/night_watch.py --voice-memo
|
||
|
|
```
|
||
|
|
|
||
|
|
Output location: `/tmp/bezalel/night-watch-<YYYY-MM-DD>.mp3`
|
||
|
|
|
||
|
|
The voice memo:
|
||
|
|
- Strips markdown formatting (`#`, `|`, `*`, `---`) for cleaner speech
|
||
|
|
- Uses `edge-tts` with the `en-US-GuyNeural` voice
|
||
|
|
- Is non-fatal: if TTS fails, the markdown report is still written normally
|
||
|
|
|
||
|
|
Example crontab with voice memo:
|
||
|
|
|
||
|
|
```cron
|
||
|
|
0 3 * * * cd /path/to/the-nexus && python bin/night_watch.py --voice-memo \
|
||
|
|
>> /var/log/bezalel/night-watch.log 2>&1
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Fallback Chain
|
||
|
|
|
||
|
|
`HybridTTS` (used by `tts_engine.py`) attempts providers in this order:
|
||
|
|
|
||
|
|
1. **edge-tts** — zero cost, no API key
|
||
|
|
2. **piper** — offline local model (if model file present)
|
||
|
|
3. **elevenlabs** — cloud fallback (if `ELEVENLABS_API_KEY` set)
|
||
|
|
|
||
|
|
If `prefer_cloud=True` is passed, the order becomes: elevenlabs → piper.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase 3 TODO
|
||
|
|
|
||
|
|
Evaluate **fish-speech** and **F5-TTS** as fully offline, sovereign alternatives
|
||
|
|
with higher voice quality than Piper. These models run locally with no network
|
||
|
|
dependency whatsoever, providing complete independence from Microsoft's Edge service.
|
||
|
|
|
||
|
|
Tracking: to be filed as a follow-up to issue #830.
|