* fix(session-db): survive CLI/gateway concurrent write contention
Closes#3139
Three layered fixes for the scenario where CLI and gateway write to
state.db concurrently, causing create_session() to fail with
'database is locked' and permanently disabling session_search on the
gateway side.
1. Increase SQLite connection timeout: 10s -> 30s
hermes_state.py: longer window for the WAL writer to finish a batch
flush before the other process gives up entirely.
2. INSERT OR IGNORE in create_session
hermes_state.py: prevents IntegrityError on duplicate session IDs
(e.g. gateway restarts while CLI session is still alive).
3. Don't null out _session_db on create_session failure (main fix)
run_agent.py: a transient lock at agent startup must not permanently
disable session_search for the lifetime of that agent instance.
_session_db now stays alive so subsequent flushes and searches work
once the lock clears.
4. New ensure_session() helper + call it during flush
hermes_state.py: INSERT OR IGNORE for a minimal session row.
run_agent.py _flush_messages_to_session_db: calls ensure_session()
before appending messages, so the FK constraint is satisfied even
when create_session() failed at startup. No-op when the row exists.
* fix(state): release lock between context queries in search_messages
The context-window queries (one per FTS5 match) were running inside
the same lock acquisition as the primary FTS5 query, holding the lock
for O(N) sequential SQLite round-trips. Move per-match context fetches
outside the outer lock block so each acquires the lock independently,
keeping critical sections short and allowing other threads to interleave.
* fix(session): prefer longer source in load_transcript to prevent legacy truncation
When a long-lived session pre-dates SQLite storage (e.g. sessions
created before the DB layer was introduced, or after a clean
deployment that reset the DB), _flush_messages_to_session_db only
writes the *new* messages from the current turn to SQLite — it skips
messages already present in conversation_history, assuming they are
already persisted.
That assumption fails for legacy JSONL-only sessions:
Turn N (first after DB migration):
load_transcript(id) → SQLite: 0 → falls back to JSONL: 994 ✓
_flush_messages_to_session_db: skip first 994, write 2 new → SQLite: 2
Turn N+1:
load_transcript(id) → SQLite: 2 → returns immediately ✗
Agent sees 2 messages of history instead of 996
The same pattern causes the reported symptom: session JSON truncated
to 4 messages (_save_session_log writes agent.messages which only has
2 history + 2 new = 4).
Fix: always load both sources and return whichever is longer. For a
fully-migrated session SQLite will always be ≥ JSONL, so there is no
regression. For a legacy session that hasn't been bootstrapped yet,
JSONL wins and the full history is restored.
Closes#3212
* test: add load_transcript source preference tests for #3212
Covers: JSONL longer returns JSONL, SQLite longer returns SQLite,
SQLite empty falls back to JSONL, both empty returns empty, equal
length prefers SQLite (richer reasoning fields).
---------
Co-authored-by: Mibayy <mibayy@hermes.ai>
Co-authored-by: kewe63 <kewe.3217@gmail.com>
Co-authored-by: Mibayy <mibayy@users.noreply.github.com>