Empirical audit: cron error rate peaks at 18:00 (9.4%) vs 4.0% at 09:00.
During configured high-error windows, automatically route cron jobs to
more capable models when the user is not present to correct errors.
- agent/smart_model_routing.py: resolve_cron_model() + _hour_in_window()
- cron/scheduler.py: wired into run_job() after base model resolution
- tests/test_cron_model_routing.py: 16 tests
Config:
cron_model_routing:
enabled: true
fallback_model: "anthropic/claude-sonnet-4"
fallback_provider: "openrouter"
windows:
- {start_hour: 17, end_hour: 22, reason: evening_error_peak}
- {start_hour: 2, end_hour: 5, reason: overnight_api_instability}
Features: midnight-wrap, per-window overrides, first-match-wins,
graceful degradation on malformed config.
Closes#317
Fixes#297
Problem: Tool handlers that return dict/list/None instead of a
JSON string crash the agent loop with cryptic errors. No error
proofing at the boundary.
Fix: In handle_function_call(), after dispatch returns:
1. If result is not str → wrap in JSON with _type_warning
2. If result is str but not valid JSON → wrap in {"output": ...}
3. Log type violations for analysis
4. Valid JSON strings pass through unchanged
Tests: 4 new tests (dict, None, non-JSON string, valid JSON).
All 16 tests in test_model_tools.py pass.
Fixes#313
Problem: MemoryStore.replace() and .remove() return
{"success": false, "error": "No entry matched..."} when the
search substring is not found. This is a valid outcome, not
an error. The empirical audit showed 58.4% error rate on the
memory tool, but 98.4% of those were just empty search results.
Fix: Return {"success": true, "result": "no_match", "message": ...}
instead. This drops the memory tool error rate from ~58% to ~1%.
Tests updated: test_replace_no_match and test_remove_no_match
now assert success=True with result="no_match".
All 33 memory tool tests pass.
Problem: 'hermes cron run JOBID' only queues for next scheduler tick.
Stale error state (like tool_choice TypeError residue) persists forever
because there's no way to execute a job immediately and get fresh results.
Solution: Three-layer synchronous execution path:
- cron/jobs.py: run_job_now() calls scheduler.run_job() then mark_job_run()
- gateway: POST /api/jobs/{id}/run-now endpoint (runs in thread executor)
- CLI: hermes cron run JOBID --now executes and prints result immediately
- tools/cronjob_tools.py: 'run_now' action routes to new function
Also fixes#346, #349 (same stale error pattern).
Fixes#351
Root cause: cron jobs with a per-job model override (e.g. `gemma4:latest`,
8K context) were only discovered to be incompatible at agent runtime,
causing a hard ValueError on every tick with no automatic recovery.
Changes:
- Add `CRON_MIN_CONTEXT_TOKENS = 64_000` constant to scheduler.py
- Add `ModelContextError(ValueError)` exception class for typed identification
- Add `_check_model_context_compat()` preflight function that calls
`get_model_context_length()` and raises `ModelContextError` if the
resolved model's context is below the minimum
- Call preflight check in `run_job()` after model resolution, before
`AIAgent()` is instantiated
- In `_process_single_job()` inside `tick()`, catch `ModelContextError`
and call `pause_job()` to auto-pause the offending job — it will no
longer fire on every tick until the operator fixes the config
- Honour `model.context_length` in config.yaml as an explicit override
that bypasses the check (operator accepts responsibility)
- If context detection itself fails (network/import error), log a warning
and allow the job to proceed (fail-open) so detection gaps don't block
otherwise-working jobs
- Fix pre-existing IndentationError in `tick()` result loop (missing
`try:` block introduced in #353 parallel-execution refactor)
- Export `ModelContextError` and `CRON_MIN_CONTEXT_TOKENS` from `cron/__init__.py`
- Add 8 new tests covering all branches of `_check_model_context_compat`
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the installed run_agent.py diverges from what scheduler.py expects,
every cron job fails with TypeError on AIAgent.__init__() — a silent total
outage that cascades into gateway restarts, asyncio shutdown errors, and
auth token expiry.
This commit adds a _validate_agent_interface() guard that:
- Inspects AIAgent.__init__ at runtime via inspect.signature
- Verifies every kwarg the scheduler passes exists in the constructor
- Fails fast with a clear remediation message on mismatch
- Runs once per gateway process (cached, zero per-job overhead)
The guard is called at the top of run_job() before any work begins.
It would have caught the tool_choice TypeError that caused 1,199 failures
across 55 jobs (meta-issue #343).
Includes 3 tests: pass, fail, and cache verification.
Fixes#352
Problem: When the gateway restarts, Python's interpreter enters
shutdown phase while the last cron tick is still processing jobs.
ThreadPoolExecutor.submit() raises RuntimeError("cannot schedule
new futures after interpreter shutdown") for every remaining job.
This cascades through the entire tick queue.
Fix (two-part):
1. run_job(): Wrap ThreadPoolExecutor creation + submit in try/except.
On RuntimeError, fall back to synchronous execution (same thread)
so the job at least attempts instead of dying silently.
2. tick(): Check sys.is_finalizing() before each job. If the
interpreter is shutting down, stop processing immediately
instead of wasting time on doomed ThreadPoolExecutor.submit() calls.
Issue #342: Cron ticker thread not starting in gateway
Root cause: asyncio.get_running_loop() can raise RuntimeError in edge cases,
and ticker thread can die silently without restart.
Fix:
1. Wrap get_running_loop() in try/except with fallback
2. Add explicit logger.info when ticker starts
3. Add async monitor that restarts ticker if it dies
4. Log PID and thread name for debugging