Salvaged PR #1052 onto current main with the contributor commit preserved plus a small follow-up for current-main conflict resolution and safe command quoting.
Follow up on salvaged PR #1052.
Restore current-main gateway lifecycle handling after conflict resolution and
adapt the update fallback to use shell-quoted argv parts safely.
Cancel any queued media-group flush tasks during Telegram adapter disconnect
and clear the buffered events map so shutdown can't leave a pending album
flush behind. Add a regression test covering disconnect before the debounce
window expires.
When shutil.which('hermes') returns None, _resolve_hermes_bin() now tries
sys.executable -m hermes_cli.main as a fallback. This handles setups where
Hermes is launched via a venv or module invocation and the hermes symlink is
not on PATH for the gateway process.
Fixes#1049
Telegram albums arrive as multiple updates with a shared media_group_id.
Previously each image triggered a separate MessageEvent, causing the agent
to interrupt itself when describing the first image.
- Add 0.8s debounce window for media group items
- Merge attachments into single MessageEvent
- Add regression test for photo album buffering
Salvages the two still-relevant fixes from PR #993 onto current main:
- use a 3-tuple LOCAL delivery key so explicit/local-origin targets are not duplicated
- shut down the previous agent-loop ThreadPoolExecutor when resizing the global pool
Adds regression tests for both behaviors.
Prevent gateway.platforms.discord from crashing at import time when discord.py is unavailable. Python 3.11 eagerly evaluates annotations, so using discord.Interaction and similar annotations caused an AttributeError after the optional import fallback set discord=None. Add postponed annotation evaluation and a regression test covering import without discord installed.
- store gateway PID metadata and validate the live process before trusting gateway.pid
- auto-refresh outdated systemd user units before start/restart so installs pick up --replace fixes
- sweep stray manual gateway processes after service stops
- add regression tests for PID validation and service drift recovery
Use getattr() when returning model metadata from GatewayRunner._run_agent so fake agents and minimal stubs without a model attribute do not break unrelated gateway flows while preserving the session-model backfill behavior.
Gateway sessions end up with model=NULL because the session row is
created before AIAgent is constructed. After the agent responds,
update_session() writes token counts but never fills in the model.
Thread agent.model through _run_agent()'s return dict into
update_session() → update_token_counts(). The SQL uses
COALESCE(model, ?) so it only fills NULL rows — never overwrites
a model already set at creation time (e.g. CLI sessions).
If the agent falls back to a different provider, agent.model is
updated in-place by _try_activate_fallback(), so the recorded value
reflects whichever model actually produced the response.
Fixes#987
- Use imap.uid() for search and fetch instead of imap.search/fetch.
Sequence numbers shift when messages are deleted, causing the adapter
to skip new messages or reprocess old ones. UIDs are stable.
- Pass ssl.create_default_context() to starttls() so the server
certificate is actually verified. Without it smtplib uses
ssl._create_stdlib_context() which skips verification.
- keep CLI voice prefixes API-local while storing the original user text
- persist explicit gateway off state and restore adapter auto-TTS suppression on restart
- add regression coverage for both behaviors
1. Anthropic + ElevenLabs TTS silence: forward full response to TTS
callback for non-streaming providers (choices first, then native
content blocks fallback).
2. Subprocess timeout kill: play_audio_file now kills the process on
TimeoutExpired instead of leaving zombie processes.
3. Discord disconnect cleanup: leave all voice channels before closing
the client to prevent leaked state.
4. Audio stream leak: close InputStream if stream.start() fails.
5. Race condition: read/write _on_silence_stop under lock in audio
callback thread.
6. _vprint force=True: show API error, retry, and truncation messages
even during streaming TTS.
7. _refresh_level lock: read _voice_recording under _voice_lock.
1. Gate _streaming_api_call to chat_completions mode only — Anthropic and
Codex fall back to _interruptible_api_call. Preserve Anthropic base_url
across all client rebuild paths (interrupt, fallback, 401 refresh).
2. Discord VC synthetic events now use chat_type="channel" instead of
defaulting to "dm" — prevents session bleed into DM context.
Authorization runs before echoing transcript. Sanitize @everyone/@here
in voice transcripts.
3. CLI voice prefix ("[Voice input...]") is now API-call-local only —
stripped from returned history so it never persists to session DB or
resumed sessions.
4. /voice off now disables base adapter auto-TTS via _auto_tts_disabled_chats
set — voice input no longer triggers TTS when voice mode is off.
Remove web UI gateway (web.py, tests, docs, toolset, env vars, Platform.WEB
enum) per maintainer request — Nous is building their own official chat UI.
Fix 1: Replace sd.wait() with polling pattern in play_audio_file() to prevent
indefinite hang when audio device stalls (consistent with play_beep()).
Fix 2: Use importlib.util.find_spec() for faster_whisper/openai availability
checks instead of module-level imports that trigger heavy native library
loading (CUDA/cuDNN) at import time.
Fix 3: Remove inspect.signature() hack in _send_voice_reply() — add **kwargs
to Telegram send_voice() so all adapters accept metadata uniformly.
Fix 4: Make session loading resilient to removed platform enum values — skip
entries with unknown platforms instead of crashing the entire gateway.
- web.py: pass stt_model from config like discord.py and run.py do
- run.py: match new error messages (No STT provider / not set)
- _transcribe_local: add missing "provider": "local" to return dict
- Change RTP packet logging from INFO to DEBUG level to reduce noise
(SPEAKING events remain at INFO as they are important lifecycle events)
- Use per-session chat_id (web_{session_id}) instead of shared "web"
to isolate conversation context between simultaneous web users
When bound to 127.0.0.1, only show localhost URL instead of listing
unreachable network interfaces. Add hint about WEB_UI_HOST=0.0.0.0
for phone/tablet access. Add VPN/multi-interface and token exposure
tests (11 new tests).
Only print the access token when auto-generated (user needs it to
log in). When set via WEB_UI_TOKEN env var, just confirm it is set
without exposing the value in console output.
- Use hmac.compare_digest for timing-safe token comparison (3 endpoints)
- Default bind to 127.0.0.1 instead of 0.0.0.0
- Sanitize upload filenames with Path.name to prevent path traversal
- Add DOMPurify to sanitize marked.parse() output against XSS
- Replace add_static with authenticated media handler
- Hide token in group chats for /remote-control command
- Use ctypes.util.find_library for Opus instead of hardcoded paths
- Add force=True to 5 interrupt _vprint calls for visibility
- Log Opus decode errors and voice restart failures instead of swallowing
Duplicated YAML config parsing for stt.model existed in gateway/run.py
and gateway/platforms/discord.py. Moved to a single helper in
transcription_tools.py and added 5 tests covering all edge cases.
Code fixes:
- STT model, Groq base URL, and OpenAI STT base URL are now
configurable via env vars (STT_GROQ_MODEL, STT_OPENAI_MODEL,
GROQ_BASE_URL, STT_OPENAI_BASE_URL) instead of hardcoded
- Gateway and Discord VC now read stt.model from config.yaml
(previously only CLI did this — gateway always used defaults)
Doc fixes:
- voice-mode.md: move Web UI troubleshooting to web.md (was duplicated)
- voice-mode.md: simplify "How It Works" for end users (remove NaCl,
DAVE, RTP internals)
- voice-mode.md: clarify STT priority (OpenAI used first if both keys
set, Groq recommended for free tier)
- voice-mode.md: document new STT env overrides in config reference
- web.md: remove duplicate Quick Start / Step 1-3 sections
- web.md: add mobile HTTPS mic workarounds (moved from voice-mode.md)
- web.md: clarify STT fallback order
1. VoiceReceiver.stop() now acquires _lock before clearing shared state
to prevent race with _on_packet on the socket reader thread
2. _packet_debug_count moved from class-level to instance-level to avoid
cross-instance race condition in multi-guild setups
3. play_in_voice_channel uses asyncio.get_running_loop() instead of
deprecated asyncio.get_event_loop()
4. _send_voice_reply uses uuid for filenames instead of time-based names
that can collide when two replies happen in the same second
5. Voice timeout now notifies runner via _on_voice_disconnect callback
so runner cleans up _voice_mode state (prevents orphaned TTS replies)
6. play_in_voice_channel adds PLAYBACK_TIMEOUT (120s) to prevent
infinite blocking when FFmpeg callback is never called
7. _send_voice_reply moves temp file cleanup to finally block so files
are always cleaned up even when send_voice/play raises
8. Base adapter auto-TTS wraps play_tts in try/finally with os.remove
to clean up generated audio files after playback
18 new tests (120 total voice tests)
- Add lock protection around VoiceReceiver buffer writes in _on_packet
to prevent race condition with check_silence on different threads
- Wire _voice_input_callback BEFORE join_voice_channel to avoid
losing voice input during the join window
- Add try/except around leave_voice_channel to ensure state cleanup
(voice_mode, callback) even if leave raises an exception
- Guard against empty text after markdown stripping in base.py auto-TTS
- Add 11 tests proving each bug and verifying the fix
Mobile browsers require HTTPS for navigator.mediaDevices API.
Instead of hiding the mic button (confusing UX), show it as dimmed
and display an informative message when tapped explaining the HTTPS
requirement.
When bot is in a Discord voice channel, both base auto-TTS and Discord
play_tts override skip audio. The skip_double guard was also blocking
the runner's _send_voice_reply, resulting in zero audio output in VC.
Now skip_double is overridden when the bot is actively connected to a
voice channel, allowing play_in_voice_channel to handle TTS.
Add comprehensive test matrix covering all platform x input x mode
combinations with full decision table documentation.
Base adapter auto-TTS already generates and sends audio for voice
messages in _process_message_background. The gateway runner's
_send_voice_reply was causing double audio on all platforms (not
just Web). Now skip_double applies to any voice input regardless
of platform.
Override play_tts in DiscordAdapter to no-op when connected to a voice
channel for the same guild. The gateway runner already plays TTS audio
in the VC via play_in_voice_channel, so the base adapter's fallback
to send_voice (file attachment) was causing double audio output.
When voice mode is enabled and user sends a voice message on Web UI,
both the base adapter auto-TTS (play_audio) and the gateway voice reply
(send_voice) would fire, causing duplicate audio playback. Skip the
gateway voice reply for Web platform voice input since base adapter
already handles it.
play_tts base class forwards metadata via **kwargs to send_voice,
but Discord and Slack adapters did not accept extra keyword arguments,
causing TypeError and silent message handling failure.
Also fix test_web_defaults to patch correct env var (WEB_UI_TOKEN).
- Voice mode: press mic once to enter, press again to exit
- VAD (Voice Activity Detection) auto-stops recording after 1.5s silence
- Continuous loop: speak → transcribe → agent responds → TTS plays → auto-listen
- Voice mode UI: input bar hides, large mic button centered
- Auto-restart listening when TTS playback finishes
- Fallback: restart listening on text response if no TTS arrives
- Auto-TTS: voice messages get spoken response (audio first, then text)
- STT: Groq Whisper fallback when VOICE_TOOLS_OPENAI_KEY not set
- Futuristic UI: glassmorphism, centered container, purple theme, glow effects
- Voice bubble: custom waveform player with seek and progress
- Invisible TTS playback via play_tts() method (no audio file in chat)
- Add hermes-web toolset with full tool access
- Register Platform.WEB in toolset/config maps
- Update docs for voice conversation feature
Detect all network interfaces instead of relying on UDP trick which
returns VPN IP. Prefers 192.168.x.x/10.x.x.x over VPN ranges.
Shows all available IPs in console output.
Type /remote-control from any platform (Telegram, Discord, etc.) to
instantly start the web UI without restarting the gateway.
- Auto-generates access token if not provided
- Shows URL + token in response
- Optional: /remote-control [port] [token]
- Reports status if already running
- Added to /help command list
New platform adapter that serves a full-featured chat interface via HTTP.
Enables access from any device on the network (phone, tablet, desktop).
Features:
- aiohttp server with WebSocket real-time messaging
- Token-based authentication
- Markdown rendering (marked.js) + code highlighting (highlight.js)
- Voice recording via MediaRecorder API + STT transcription
- Image, voice, and document display
- Typing indicator + message editing (streaming support)
- Mobile responsive dark theme
- Auto-reconnect on disconnect
- Media file cleanup (24h TTL)
Config: WEB_UI_ENABLED=true, WEB_UI_PORT=8765, WEB_UI_TOKEN=<token>
No new dependencies — uses aiohttp already in [messaging] extra.
Phase 2 of voice channel support: bot listens to users speaking in VC,
transcribes speech via Groq Whisper, and processes through the agent pipeline.
- Add VoiceReceiver class for RTP packet capture, NaCl/DAVE decryption, Opus decode
- Add silence detection and per-user PCM buffering
- Wire voice input callback from adapter to GatewayRunner
- Fix adapter dict key: use Platform.DISCORD enum instead of string
- Fix guild_id extraction for synthetic voice events via SimpleNamespace raw_message
- Pause/resume receiver during TTS playback to prevent echo
- Send Discord voice messages with flags=8192 and waveform metadata
so they render as native voice bubbles instead of file attachments
- Use .mp3 output path for TTS so edge-tts opus conversion works
correctly (edge always outputs mp3, convert was skipped for .ogg)
- Use actual file_path from TTS result after potential opus conversion
- Register /voice as Discord slash command with mode choices
- Fix _send_voice_reply to handle adapters that don't accept metadata
parameter (Discord) by inspecting the method signature at runtime
- /voice on: reply with voice when user sends voice messages
- /voice tts: reply with voice to all messages
- /voice off: disable, text-only replies
- /voice status: show current mode
- Per-chat state persisted to gateway_voice_mode.json
- Dedup: skips auto-reply if agent already called text_to_speech tool
- drop_pending_updates=True to ignore stale Telegram messages on restart
- 25 tests covering command handler, reply logic, and edge cases