Fix memento_cards.py and telephony.py to use HERMES_HOME env var
with Path.home() fallback instead of hardcoded "~/.hermes".
Leaves migration script as-is (intentionally references old paths).
Closes#479
Remove eager create_session() call from AIAgent.__init__(). Sessions
are now created lazily on first _flush_messages_to_session_db() call
via ensure_session() which uses INSERT OR IGNORE.
Impact: eliminates 32.4% of sessions (3,564 of 10,985) that were
created at agent init but never received any messages.
The existing ensure_session() fallback in _flush_messages_to_session_db()
already handles this pattern — it was originally designed for recovery
after transient SQLite lock failures. Now it's the primary creation path.
Compression-initiated sessions still use create_session() directly
(line ~5995) since they have messages to write immediately.
Cron jobs like nightwatch-health-monitor SSH into remote VPSes.
When the runtime provider is cloud (Nous, OpenRouter, Anthropic),
SSH keys don't exist on the inference server — causing silent
failures and wasted iterations.
Changes:
- cron/scheduler.py: Import is_local_endpoint from model_metadata.
Build disabled_toolsets dynamically: append 'terminal' when the
runtime base_url is NOT a local endpoint. Log when terminal is
disabled for observability. Also warn when a job declares
requires_local_infra=true but runs on cloud.
- tests/test_cron_cloud_terminal.py: 14 tests verifying
is_local_endpoint classification and disabled_toolsets logic.
Behavior:
Local (localhost/127/RFC-1918): terminal enabled, SSH works.
Cloud (openrouter/nous/anthropic): terminal disabled, agent
reports SSH unavailable instead of wasting iterations.
Closes#379
Empirical audit: cron error rate peaks at 18:00 (9.4%) vs 4.0% at 09:00.
During configured high-error windows, automatically route cron jobs to
more capable models when the user is not present to correct errors.
- agent/smart_model_routing.py: resolve_cron_model() + _hour_in_window()
- cron/scheduler.py: wired into run_job() after base model resolution
- tests/test_cron_model_routing.py: 16 tests
Config:
cron_model_routing:
enabled: true
fallback_model: "anthropic/claude-sonnet-4"
fallback_provider: "openrouter"
windows:
- {start_hour: 17, end_hour: 22, reason: evening_error_peak}
- {start_hour: 2, end_hour: 5, reason: overnight_api_instability}
Features: midnight-wrap, per-window overrides, first-match-wins,
graceful degradation on malformed config.
Closes#317
When users blank fallback_model fields or set enabled: false, the validation
and gateway now treat this as intentionally disabling fallback instead of
showing warnings.
Changes:
- hermes_cli/config.py: Skip warnings when both provider and model are blank
or when enabled: false is set
- gateway/run.py: Return None for disabled fallback configs
- tests: Added 8 new tests for blank/disabled fallback scenarios
Behavior:
- Both fields blank: no warnings (intentional disable)
- enabled: false: no warnings (explicit disable)
- One field blank: warning shown (likely misconfiguration)
- Valid config: no warnings
Fixes#373
- mark_job_run: track last_error_at, last_success_at, error_resolved_at
- trigger_job: clear stale error state when re-triggering
- clear_job_error: manual clearing of stale errors
Closes#349
Add profile column to sessions table for data-level profile isolation.
All session queries now accept an optional profile filter.
Changes:
- Schema v7: new 'profile' TEXT column + idx_sessions_profile index
- Migration v7: ALTER TABLE + CREATE INDEX on existing DBs
- create_session(): new profile parameter
- ensure_session(): new profile parameter
- list_sessions_rich(): profile filter (WHERE s.profile = ?)
- search_sessions(): profile filter
- session_count(): profile filter
Sessions without a profile (None) remain visible to all queries for
backward compatibility. When a profile is passed, only that profile's
sessions are returned.
Profile agents can no longer see each other's sessions when filtered.
No breaking changes to existing callers.
- Expand validate_config_structure() to catch:
- fallback_providers format errors (non-list, missing provider/model)
- session_reset.idle_minutes <= 0 (causes immediate resets)
- session_reset.at_hour out of 0-23 range
- API_SERVER enabled without API_SERVER_KEY
- Unknown root-level keys that look like misplaced custom_providers fields
- Add _validate_fallback_providers() in gateway/config.py to validate
fallback chain at gateway startup (logs warnings for malformed entries)
- Add API_SERVER_KEY check in gateway config loader (warns on unauthenticated endpoint)
- Expand _KNOWN_ROOT_KEYS to include all valid top-level config sections
(session_reset, browser, checkpoints, voice, stt, tts, etc.)
- Add 13 new tests for fallback_providers and session_reset validation
- All existing tests pass (47/47)
Closes#328
Fixes#297
Problem: Tool handlers that return dict/list/None instead of a
JSON string crash the agent loop with cryptic errors. No error
proofing at the boundary.
Fix: In handle_function_call(), after dispatch returns:
1. If result is not str → wrap in JSON with _type_warning
2. If result is str but not valid JSON → wrap in {"output": ...}
3. Log type violations for analysis
4. Valid JSON strings pass through unchanged
Tests: 4 new tests (dict, None, non-JSON string, valid JSON).
All 16 tests in test_model_tools.py pass.
Fixes#313
Problem: MemoryStore.replace() and .remove() return
{"success": false, "error": "No entry matched..."} when the
search substring is not found. This is a valid outcome, not
an error. The empirical audit showed 58.4% error rate on the
memory tool, but 98.4% of those were just empty search results.
Fix: Return {"success": true, "result": "no_match", "message": ...}
instead. This drops the memory tool error rate from ~58% to ~1%.
Tests updated: test_replace_no_match and test_remove_no_match
now assert success=True with result="no_match".
All 33 memory tool tests pass.
Detect when the same tool is called 5+ times consecutively and inject
a nudge advising the agent to diversify its approach.
Evidence from empirical audit:
- Top marathon session (qwen, 1643 msgs): execute_code streak of 20
- Opus session (1472 msgs): terminal streak of 10
The nudge fires every 5 consecutive calls (5, 10, 15...) so it
persists without being spammy. Tracks independently in both
sequential and concurrent execution paths.
After 3 consecutive tool errors, inject a warning into the tool result
advising the agent to switch strategies. Escalates at 6 and 9+ errors.
Empirical data from audit:
- P(error | prev error) = 58.6% vs P(error | prev success) = 25.2%
- 2.33x cascade amplification factor
- Max observed streak: 31 consecutive errors
Intervention tiers:
- 3 errors: advisory warning (try different tool, use terminal, simplify)
- 6 errors: urgent stop (halt retries, investigate or switch)
- 9+ errors: terminal-only recovery path
Tracks errors in both sequential and concurrent execution paths.
Problem: 'hermes cron run JOBID' only queues for next scheduler tick.
Stale error state (like tool_choice TypeError residue) persists forever
because there's no way to execute a job immediately and get fresh results.
Solution: Three-layer synchronous execution path:
- cron/jobs.py: run_job_now() calls scheduler.run_job() then mark_job_run()
- gateway: POST /api/jobs/{id}/run-now endpoint (runs in thread executor)
- CLI: hermes cron run JOBID --now executes and prints result immediately
- tools/cronjob_tools.py: 'run_now' action routes to new function
Also fixes#346, #349 (same stale error pattern).
Fixes#351
Root cause: cron jobs with a per-job model override (e.g. `gemma4:latest`,
8K context) were only discovered to be incompatible at agent runtime,
causing a hard ValueError on every tick with no automatic recovery.
Changes:
- Add `CRON_MIN_CONTEXT_TOKENS = 64_000` constant to scheduler.py
- Add `ModelContextError(ValueError)` exception class for typed identification
- Add `_check_model_context_compat()` preflight function that calls
`get_model_context_length()` and raises `ModelContextError` if the
resolved model's context is below the minimum
- Call preflight check in `run_job()` after model resolution, before
`AIAgent()` is instantiated
- In `_process_single_job()` inside `tick()`, catch `ModelContextError`
and call `pause_job()` to auto-pause the offending job — it will no
longer fire on every tick until the operator fixes the config
- Honour `model.context_length` in config.yaml as an explicit override
that bypasses the check (operator accepts responsibility)
- If context detection itself fails (network/import error), log a warning
and allow the job to proceed (fail-open) so detection gaps don't block
otherwise-working jobs
- Fix pre-existing IndentationError in `tick()` result loop (missing
`try:` block introduced in #353 parallel-execution refactor)
- Export `ModelContextError` and `CRON_MIN_CONTEXT_TOKENS` from `cron/__init__.py`
- Add 8 new tests covering all branches of `_check_model_context_compat`
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>