fix(state): SQLite concurrency hardening + session transcript integrity (#3249)
* fix(session-db): survive CLI/gateway concurrent write contention Closes #3139 Three layered fixes for the scenario where CLI and gateway write to state.db concurrently, causing create_session() to fail with 'database is locked' and permanently disabling session_search on the gateway side. 1. Increase SQLite connection timeout: 10s -> 30s hermes_state.py: longer window for the WAL writer to finish a batch flush before the other process gives up entirely. 2. INSERT OR IGNORE in create_session hermes_state.py: prevents IntegrityError on duplicate session IDs (e.g. gateway restarts while CLI session is still alive). 3. Don't null out _session_db on create_session failure (main fix) run_agent.py: a transient lock at agent startup must not permanently disable session_search for the lifetime of that agent instance. _session_db now stays alive so subsequent flushes and searches work once the lock clears. 4. New ensure_session() helper + call it during flush hermes_state.py: INSERT OR IGNORE for a minimal session row. run_agent.py _flush_messages_to_session_db: calls ensure_session() before appending messages, so the FK constraint is satisfied even when create_session() failed at startup. No-op when the row exists. * fix(state): release lock between context queries in search_messages The context-window queries (one per FTS5 match) were running inside the same lock acquisition as the primary FTS5 query, holding the lock for O(N) sequential SQLite round-trips. Move per-match context fetches outside the outer lock block so each acquires the lock independently, keeping critical sections short and allowing other threads to interleave. * fix(session): prefer longer source in load_transcript to prevent legacy truncation When a long-lived session pre-dates SQLite storage (e.g. sessions created before the DB layer was introduced, or after a clean deployment that reset the DB), _flush_messages_to_session_db only writes the *new* messages from the current turn to SQLite — it skips messages already present in conversation_history, assuming they are already persisted. That assumption fails for legacy JSONL-only sessions: Turn N (first after DB migration): load_transcript(id) → SQLite: 0 → falls back to JSONL: 994 ✓ _flush_messages_to_session_db: skip first 994, write 2 new → SQLite: 2 Turn N+1: load_transcript(id) → SQLite: 2 → returns immediately ✗ Agent sees 2 messages of history instead of 996 The same pattern causes the reported symptom: session JSON truncated to 4 messages (_save_session_log writes agent.messages which only has 2 history + 2 new = 4). Fix: always load both sources and return whichever is longer. For a fully-migrated session SQLite will always be ≥ JSONL, so there is no regression. For a legacy session that hasn't been bootstrapped yet, JSONL wins and the full history is restored. Closes #3212 * test: add load_transcript source preference tests for #3212 Covers: JSONL longer returns JSONL, SQLite longer returns SQLite, SQLite empty falls back to JSONL, both empty returns empty, equal length prefers SQLite (richer reasoning fields). --------- Co-authored-by: Mibayy <mibayy@hermes.ai> Co-authored-by: kewe63 <kewe.3217@gmail.com> Co-authored-by: Mibayy <mibayy@users.noreply.github.com>
This commit is contained in:
19
run_agent.py
19
run_agent.py
@@ -893,8 +893,15 @@ class AIAgent:
|
||||
user_id=None,
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning("Session DB create_session failed — messages will NOT be indexed: %s", e)
|
||||
self._session_db = None # prevent silent data loss on every subsequent flush
|
||||
# Transient SQLite lock contention (e.g. CLI and gateway writing
|
||||
# concurrently) must NOT permanently disable session_search for
|
||||
# this agent. Keep _session_db alive — subsequent message
|
||||
# flushes and session_search calls will still work once the
|
||||
# lock clears. The session row may be missing from the index
|
||||
# for this run, but that is recoverable (flushes upsert rows).
|
||||
logger.warning(
|
||||
"Session DB create_session failed (session_search still available): %s", e
|
||||
)
|
||||
|
||||
# In-memory todo list for task planning (one per agent/session)
|
||||
from tools.todo_tool import TodoStore
|
||||
@@ -1578,6 +1585,14 @@ class AIAgent:
|
||||
return
|
||||
self._apply_persist_user_message_override(messages)
|
||||
try:
|
||||
# If create_session() failed at startup (e.g. transient lock), the
|
||||
# session row may not exist yet. ensure_session() uses INSERT OR
|
||||
# IGNORE so it is a no-op when the row is already there.
|
||||
self._session_db.ensure_session(
|
||||
self.session_id,
|
||||
source=self.platform or "cli",
|
||||
model=self.model,
|
||||
)
|
||||
start_idx = len(conversation_history) if conversation_history else 0
|
||||
flush_from = max(start_idx, self._last_flushed_db_idx)
|
||||
for msg in messages[flush_from:]:
|
||||
|
||||
Reference in New Issue
Block a user