Allow cron runs to keep using send_message for additional destinations, but
skip same-target sends when the scheduler will already auto-deliver the final
response there. Add prompt/tool guidance, docs, and regression coverage for
origin/home-channel resolution and thread-aware comparisons.
When vision_analyze_tool fails, the except block was returning a
generic 'could not be analyzed' message that gave the agent no
actionable information about the failure cause.
Replace the generic message with the actual exception string so the
agent can distinguish between backend errors, missing dependencies,
network failures, and unsupported image paths.
Also add an 'error' field to the failure response for structured
error handling by callers.
Fixes#1034
Remove diary-style memory framing from the system prompt and memory tool
schema, explicitly steer task/session logs to session_search, and clarify
that session_search is for cross-session recall after checking the current
conversation first. Add regression tests for the updated guidance text.
Resolve the cherry-pick against current browser_tool structure without carrying unrelated formatting churn, while preserving the intended cleanup, PATH, and screenshot recovery changes from PR #1001.
Unify browser session teardown so manual close, inactivity cleanup, and emergency shutdown all follow the same cleanup path instead of partially duplicating logic.
This changes browser_close() to delegate to cleanup_browser(), which means recording shutdown, Browserbase release, activity bookkeeping cleanup, and local socket-directory removal now happen consistently. It also updates emergency cleanup to route through cleanup_all_browsers() and explicitly clear in-memory tracking state after teardown so stale active-session, last-activity, and recording entries are not left behind on exit.
The screenshot fallback path has also been fixed. _extract_screenshot_path_from_text() now matches real absolute PNG paths, including quoted output, so browser_vision() can recover screenshots when agent-browser emits human-readable text instead of JSON.
Regression coverage was added in tests/tools/test_browser_cleanup.py for screenshot path extraction, cleanup_browser() state removal, browser_close() delegation, and emergency cleanup state clearing.
Verified with:
- python -m pytest tests/tools/test_browser_cleanup.py -q
- python -m pytest tests/tools/test_browser_console.py tests/gateway/test_send_image_file.py -q
Salvaged from PR #1007 by stablegenius49.
- let INSTALL_POLICY decide dangerous verdict handling for builtin skills
- allow --force to override blocked dangerous decisions for trusted and community sources
- accept --yes / -y as aliases for --force in /skills install
- update regression tests to match the intended policy precedence
Round out the skills hub integration with:
- richer skills.sh metadata and security surfacing during inspect/install
- generic check/update flows for hub-installed skills
- support for well-known Agent Skills endpoints via /.well-known/skills/index.json
Also persist upstream bundle metadata in the lock file and add
regression coverage plus live-compatible path handling for both
skills.sh aliases and well-known endpoints.
* improve: add exc_info to MoA error logging
* refactor: tighten MoA traceback logging scope
Follow up on salvaged PR #998 by limiting exc_info logging to terminal
failure paths, avoiding duplicate aggregator errors, and refreshing the
MoA default OpenRouter model lineup to current frontier options.
---------
Co-authored-by: aydnOktay <xaydinoktay@gmail.com>
Harden the skills.sh hub adapter by parsing skill detail pages when
search slugs do not map cleanly onto GitHub skill folder names.
This adds detail-page resolution for alias-style skills, improves
inspect metadata from the page itself, and covers the behavior with
regression tests plus live smoke validation for json-render-react.
Add regression coverage for the standalone email send path and pass an explicit default SSL context to STARTTLS for certificate verification, matching the gateway email adapter hardening salvaged from PR #994.
Add a skills.sh-backed source adapter for the Hermes Skills Hub.
The new adapter uses skills.sh search results for discovery, falls back to
featured homepage links for browse-style queries, and resolves installs /
inspects through the underlying GitHub repo using common Agent Skills
layout conventions. Also expose skills-sh in CLI source filters and add
regression coverage for search, alias resolution, and source routing.
Extend the salvaged MCP filtering work so utility tools are also governed by policy and server capabilities. Store the registered tool subset per server so rediscovery and status reporting stay accurate after filtering.
Add optional config keys under each mcp_servers entry:
- tools.include: whitelist, only listed tools are registered
- tools.exclude: blacklist, all tools except listed are registered
- enabled: false: skip server entirely, no connection attempt
Backward-compatible: no config keys = all tools registered as before.
Tests: TestMCPSelectiveToolLoading (4 tests), 134 passed total.
Salvaged from PR #979 onto current main.
Preserve the current terminal backend checks while surfacing actionable
preflight errors for unknown TERMINAL_ENV values, missing SSH host/user
configuration, and missing Modal credentials/config. Tighten the modal
regression test so it deterministically exercises the config-missing
path.
Salvaged from PR #977 onto current main.
Preserves the MCP stdio command resolution and improved error diagnostics,
with deterministic regression tests for the npx/node PATH cases.
Co-authored-by: kshitij <82637225+kshitijk4poor@users.noreply.github.com>
1. Anthropic + ElevenLabs TTS silence: forward full response to TTS
callback for non-streaming providers (choices first, then native
content blocks fallback).
2. Subprocess timeout kill: play_audio_file now kills the process on
TimeoutExpired instead of leaving zombie processes.
3. Discord disconnect cleanup: leave all voice channels before closing
the client to prevent leaked state.
4. Audio stream leak: close InputStream if stream.start() fails.
5. Race condition: read/write _on_silence_stop under lock in audio
callback thread.
6. _vprint force=True: show API error, retry, and truncation messages
even during streaming TTS.
7. _refresh_level lock: read _voice_recording under _voice_lock.
Remove web UI gateway (web.py, tests, docs, toolset, env vars, Platform.WEB
enum) per maintainer request — Nous is building their own official chat UI.
Fix 1: Replace sd.wait() with polling pattern in play_audio_file() to prevent
indefinite hang when audio device stalls (consistent with play_beep()).
Fix 2: Use importlib.util.find_spec() for faster_whisper/openai availability
checks instead of module-level imports that trigger heavy native library
loading (CUDA/cuDNN) at import time.
Fix 3: Remove inspect.signature() hack in _send_voice_reply() — add **kwargs
to Telegram send_voice() so all adapters accept metadata uniformly.
Fix 4: Make session loading resilient to removed platform enum values — skip
entries with unknown platforms instead of crashing the entire gateway.
- web.py: pass stt_model from config like discord.py and run.py do
- run.py: match new error messages (No STT provider / not set)
- _transcribe_local: add missing "provider": "local" to return dict
Move stream close outside the lock in shutdown() to prevent deadlock
when audio callback tries to acquire the same lock. Replace single
t.join(timeout) with a polling loop (0.1s intervals) so KeyboardInterrupt
is not blocked during stream cleanup.
Voice status was hardcoded to check API keys only. Now uses the actual
provider resolution (local/groq/openai) so it correctly shows
"local faster-whisper" when installed instead of "Groq" or "MISSING".
Merge main's faster-whisper (local, free) with our Groq support into a
unified three-provider STT pipeline: local > groq > openai.
Provider priority ensures free options are tried first. Each provider
has its own transcriber function with model auto-correction, env-
overridable endpoints, and proper error handling.
74 tests cover the full provider matrix, fallback chains, model
correction, config loading, validation edge cases, and dispatch.
Duplicated YAML config parsing for stt.model existed in gateway/run.py
and gateway/platforms/discord.py. Moved to a single helper in
transcription_tools.py and added 5 tests covering all edge cases.
Code fixes:
- STT model, Groq base URL, and OpenAI STT base URL are now
configurable via env vars (STT_GROQ_MODEL, STT_OPENAI_MODEL,
GROQ_BASE_URL, STT_OPENAI_BASE_URL) instead of hardcoded
- Gateway and Discord VC now read stt.model from config.yaml
(previously only CLI did this — gateway always used defaults)
Doc fixes:
- voice-mode.md: move Web UI troubleshooting to web.md (was duplicated)
- voice-mode.md: simplify "How It Works" for end users (remove NaCl,
DAVE, RTP internals)
- voice-mode.md: clarify STT priority (OpenAI used first if both keys
set, Groq recommended for free tier)
- voice-mode.md: document new STT env overrides in config reference
- web.md: remove duplicate Quick Start / Step 1-3 sections
- web.md: add mobile HTTPS mic workarounds (moved from voice-mode.md)
- web.md: clarify STT fallback order
The counter was incremented in start/stop/cancel but never read
anywhere in the codebase. The race condition it was meant to guard
against is practically impossible with the persistent stream design.
- Keep InputStream alive across recordings to avoid CoreAudio hang on
repeated open/close cycles on macOS. New _ensure_stream() creates the
stream once; start()/stop()/cancel() only toggle frame collection.
- Add _close_stream_with_timeout() with daemon thread to prevent
stream.stop()/close() from blocking indefinitely.
- Add generation counter to detect stale stream-open completions after
cancel or restart.
- Run recorder.cancel() in background thread from Ctrl+C handler to
keep the event loop responsive.
- Add shutdown() method called on /voice off to release audio resources.
- Fix silence timer reset during active speech: use dip tolerance for
_resume_start tracker so natural speech pauses (< 0.3s) don't prevent
the silence timer from being reset.
- Update tests to match persistent stream behavior.
AudioRecorder now auto-stops after 15 seconds if no speech is detected
(_has_spoken remains False). In quiet environments where ambient RMS
never exceeds the silence threshold (200), the recording would wait
indefinitely. The new _max_wait parameter fires the silence callback
after the timeout, triggering the normal "No speech detected" flow.
- Set max_retries=0 on the STT OpenAI client. The SDK default (2) honors
Groq's retry-after header (often 53s), blocking the thread for up to
~106s on rate limits. Voice STT should fail fast, not retry silently.
- Stop continuous recording mode after 3 consecutive no-speech cycles to
prevent infinite restart loops when nobody is talking.
- Set OpenAI client timeout=30s in transcribe_audio() — default 600s
blocks _voice_processing for 10 min if Groq/OpenAI stalls
- Move _voice_start_recording in _voice_stop_and_transcribe finally
block to a daemon thread (same pattern as Ctrl+B handler and
process_loop)
- Add _should_exit guard at top of _voice_start_recording so all 4
call sites respect shutdown without individual checks
- Replace sd.wait() with a poll loop + sd.stop() in play_beep().
sd.wait() calls Event.wait() without timeout — hangs forever if the
audio device stalls. Poll with a 2s ceiling and force-stop instead.
- Wrap _on_silence callback in try-except so exceptions are logged
instead of silently lost in the daemon thread. Prevents recording
state from becoming inconsistent on unexpected errors.
browser_tool.py registered SIGINT/SIGTERM handlers that called sys.exit()
at module import time. When a signal arrived during a lock acquisition
(e.g. AudioRecorder._lock in voice mode), SystemExit was raised inside
prompt_toolkit's async event loop, corrupting coroutine state and making
the process unkillable (required SIGKILL).
atexit handler already ensures browser sessions are cleaned up on any
normal exit path, so the signal handlers were redundant and harmful.
- edge_tts NameError: _generate_edge_tts now calls _import_edge_tts()
instead of referencing bare module name (tts_tool.py)
- TTS thread leak: chat() finally block sends sentinel to text_queue,
sets stop_event, and joins tts_thread on exception paths (cli.py)
- output_stream leak: moved close() into finally block so audio device
is released even on exception (tts_tool.py)
- Ctrl+C continuous mode: cancel handler now resets _voice_continuous
to prevent auto-restart after user cancels recording (cli.py)
- _disable_voice_mode: now calls stop_playback() and sets
_voice_tts_done so TTS stops when voice mode is turned off (cli.py)
- _show_voice_status: reads record key from config instead of
hardcoding Ctrl+B (cli.py)
- AudioRecorder.start() now catches InputStream errors gracefully
with a clear error message about microphone availability
- Fix config key mismatch: cli.py was reading "push_to_talk_key"
but config.py defines "record_key" -- now consistent
- Add format conversion from config format ("ctrl+b") to
prompt_toolkit format ("c-b")
1. Fully lazy imports: sounddevice, numpy, elevenlabs, edge_tts, and
openai are never imported at module level. Each is imported only when
the feature is explicitly activated, preventing crashes in headless
environments (SSH, Docker, WSL, no PortAudio).
2. No core agent loop changes: streaming TTS path extracted from
_interruptible_api_call() into separate _streaming_api_call() method.
The original method is restored to its upstream form.
3. Configurable key binding: push-to-talk key changed from Ctrl+R
(conflicts with readline reverse-search) to Ctrl+B by default.
Configurable via voice.push_to_talk_key in config.yaml.
4. Environment detection: new detect_audio_environment() function checks
for SSH, Docker, WSL, and missing audio devices before enabling voice
mode. Auto-disables with clear warnings in incompatible environments.
5. Graceful degradation: every audio touchpoint (sd.play, sd.InputStream,
sd.OutputStream) wrapped in try/except with ImportError/OSError
handling. Failures produce warnings, not crashes.
- Fix Gemini streaming tool call merge bug: multiple tool calls with same
index but different IDs are now parsed as separate calls instead of
concatenating names (e.g. ha_call_serviceha_call_service)
- Handle partial results in voice mode: show error and stop continuous
mode when agent returns partial/failed results with empty response
- Fix error display during streaming TTS: error messages are shown in
full response box even when streaming box was already opened
- Add duplicate sentence filter in TTS: skip near-duplicate sentences
from LLM repetition
- Fix fake HA server state mutation: turn_on/turn_off/set_temperature
correctly update entity states; temperature sensor simulates change
when thermostat is adjusted
- Add _vprint() helper to suppress log output when stream_callback is active
- Expand Whisper hallucination filter with multi-language phrases and regex pattern for repetitive text
- Stop continuous voice mode when agent returns a failed result (e.g. 429 rate limit)
- Atomic check-and-set for _voice_recording flag with _voice_lock
- Guard _voice_stop_and_transcribe against concurrent invocation
- Remove premature flag clearing from Ctrl+R handler
- Clean up temp WAV files in finally block (_play_via_tempfile)
- Use buffer-level regex for <think> block filtering (handles chunked tags)
- Prevent /voice on prompt accumulation on repeated calls
- Include Groq in STT key error message
Move screen output from stream_callback to display_callback called by
TTS consumer thread. Text now appears sentence-by-sentence in sync with
audio instead of streaming ahead at LLM speed. Removes quiet_mode hack.
sounddevice raises OSError (not ImportError) when the PortAudio C
library is missing. This broke test collection on CI runners that
have the Python package installed but lack the native library.
Stream audio to speaker as the agent generates tokens instead of
waiting for the full response. First sentence plays within ~1-2s
of agent starting to respond.
- run_agent: add stream_callback to run_conversation/chat, streaming
path in _interruptible_api_call accumulates chunks into mock
ChatCompletion while forwarding content deltas to callback
- tts_tool: add stream_tts_to_speaker() with sentence buffering,
think block filtering, markdown stripping, ElevenLabs pcm_24000
streaming to sounddevice OutputStream
- cli: wire up streaming TTS pipeline in chat(), detect elevenlabs
provider + sounddevice availability, skip batch TTS when streaming
is active, signal stop on interrupt
Falls back to batch TTS for Edge/OpenAI providers or when
elevenlabs/sounddevice are not available. Zero impact on non-voice
mode (callback defaults to None).
- Audio cues: beep on record start (880Hz), double beep on stop (660Hz)
- Silence detection: auto-stop recording after 3s of silence (RMS-based)
- Continuous mode: auto-restart recording after agent responds
- Ctrl+R starts continuous mode, Ctrl+R during recording exits it
- Waits for TTS to finish before restarting to avoid recording speaker
- Tests: 7 new tests for beep generation and silence detection
- Add multi-provider STT support (OpenAI > Groq fallback) in transcription_tools
- Auto-correct model selection when provider doesn't support the configured model
- Change voice record key from Ctrl+Space to Ctrl+R (macOS compatibility)
- Fix duplicate transcript echo in voice pipeline
- Add GROQ_API_KEY to .env.example