Files
timmy-config/docs/PERFORMANCE_OPTIMIZATIONS.md
STEP35 Burn Agent 380012c791
Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 22s
Smoke Test / smoke (pull_request) Failing after 22s
Validate Config / YAML Lint (pull_request) Failing after 16s
Validate Config / JSON Validate (pull_request) Successful in 23s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 56s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Cron Syntax Check (pull_request) Successful in 11s
Validate Config / Shell Script Lint (pull_request) Failing after 55s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 8s
Validate Config / Playbook Schema Validation (pull_request) Successful in 17s
PR Checklist / pr-checklist (pull_request) Successful in 5m6s
Architecture Lint / Lint Repository (pull_request) Failing after 24s
docs/security/wizard-bootstrap: extract from hermes-sovereign to top-level (#337)
Extract organizational artifacts from hermes-sovereign/ subdirectory to
timmy-config top-level directories for clearer separation of concerns.

Moved:
- docs/: 19 markdown files from hermes-sovereign/docs/ (DEPLOY.md,
  SECURITY_AUDIT_REPORT.md, SECURE_CODING_GUIDELINES.md,
  PERFORMANCE_ANALYSIS_REPORT.md, PERFORMANCE_HOTSPOTS_QUICKREF.md,
  PERFORMANCE_OPTIMIZATIONS.md, SECURITY_FIXES_CHECKLIST.md,
  SECURITY_MITIGATION_ROADMAP.md, TEST_ANALYSIS_REPORT.md,
  TEST_OPTIMIZATION_GUIDE.md, V-006_FIX_SUMMARY.md, and more)
- security/: validate_security.py from hermes-sovereign/security/
  (creates top-level security/ directory)
- wizard-bootstrap/: 7 files from hermes-sovereign/wizard-bootstrap/
  (FORGE_OPERATIONS_GUIDE.md, WIZARD_ENVIRONMENT_CONTRACT.md,
  dependency_checker.py, monthly_audit.py, skills_audit.py,
  wizard_bootstrap.py, __init__.py)
- docs/notebooks/: 2 notebook files from hermes-sovereign/notebooks/
  (agent_task_system_health.ipynb, agent_task_system_health.py)

Empty source directories (hermes-sovereign/docs/, security/,
wizard-bootstrap/, notebooks/) removed.

This reorganization establishes timmy-config as the canonical home
for operational documentation, security tooling, and wizard bootstrap
infrastructure — extracted from the hermes-agent sidecar subtree.

Closes #337
2026-04-30 01:40:46 -04:00

5.2 KiB

Performance Optimizations for run_agent.py

Summary of Changes

This document describes the async I/O and performance optimizations applied to run_agent.py to fix blocking operations and improve overall responsiveness.


1. Session Log Batching (PROBLEM 1: Lines 2158-2222)

Problem

_save_session_log() performed blocking file I/O on every conversation turn, causing:

  • UI freezing during rapid message exchanges
  • Unnecessary disk writes (JSON file was overwritten every turn)
  • Synchronous json.dump() and fsync() blocking the main thread

Solution

Implemented async batching with the following components:

New Methods:

  • _init_session_log_batcher() - Initialize batching infrastructure
  • _save_session_log() - Updated to use non-blocking batching
  • _flush_session_log_async() - Flush writes in background thread
  • _write_session_log_sync() - Actual blocking I/O (runs in thread pool)
  • _deferred_session_log_flush() - Delayed flush for batching
  • _shutdown_session_log_batcher() - Cleanup and flush on exit

Key Features:

  • Time-based batching: Minimum 500ms between writes
  • Deferred flushing: Rapid successive calls are batched
  • Thread pool: Single-worker executor prevents concurrent write conflicts
  • Atexit cleanup: Ensures pending logs are flushed on exit
  • Backward compatible: Same method signature, no breaking changes

Performance Impact:

  • Before: Every turn blocks on disk I/O (~5-20ms per write)
  • After: Updates cached in memory, flushed every 500ms or on exit
  • 10 rapid calls now result in ~1-2 writes instead of 10

2. Todo Store Hydration Caching (PROBLEM 2: Lines 2269-2297)

Problem

_hydrate_todo_store() performed O(n) history scan on every message:

  • Scanned entire conversation history backwards
  • No caching between calls
  • Re-parsed JSON for every message check
  • Gateway mode creates fresh AIAgent per message, making this worse

Solution

Implemented result caching with scan limiting:

Key Changes:

# Added caching flags
self._todo_store_hydrated  # Marks if hydration already done
self._todo_cache_key        # Caches history object id

# Added scan limit for very long histories
scan_limit = 100  # Only scan last 100 messages

Performance Impact:

  • Before: O(n) scan every call, parsing JSON for each tool message
  • After: O(1) cached check, skips redundant work
  • First call: Scans up to 100 messages (limited)
  • Subsequent calls: <1μs cached check

3. API Call Timeouts (PROBLEM 3: Lines 3759-3826)

Problem

_anthropic_messages_create() and _interruptible_api_call() had:

  • No timeout handling - could block indefinitely
  • 300ms polling interval for interrupt detection (sluggish)
  • No timeout for OpenAI-compatible endpoints

Solution

Added comprehensive timeout handling:

Changes to _anthropic_messages_create():

  • Added timeout: float = 300.0 parameter (5 minutes default)
  • Passes timeout to Anthropic SDK

Changes to _interruptible_api_call():

  • Added timeout: float = 300.0 parameter
  • Reduced polling interval from 300ms to 50ms (6x faster interrupt response)
  • Added elapsed time tracking
  • Raises TimeoutError if API call exceeds timeout
  • Force-closes clients on timeout to prevent resource leaks
  • Passes timeout to OpenAI-compatible endpoints

Performance Impact:

  • Before: Could hang forever on stuck connections
  • After: Guaranteed timeout after 5 minutes (configurable)
  • Interrupt response: 300ms → 50ms (6x faster)

Backward Compatibility

All changes maintain 100% backward compatibility:

  1. Session logging: Same method signature, behavior is additive
  2. Todo hydration: Same signature, caching is transparent
  3. API calls: New timeout parameter has sensible default (300s)

No existing code needs modification to benefit from these optimizations.


Testing

Run the verification script:

python3 -c "
import ast
with open('run_agent.py') as f:
    source = f.read()
tree = ast.parse(source)

methods = ['_init_session_log_batcher', '_write_session_log_sync', 
           '_shutdown_session_log_batcher', '_hydrate_todo_store',
           '_interruptible_api_call']

for node in ast.walk(tree):
    if isinstance(node, ast.FunctionDef) and node.name in methods:
        print(f'✓ Found {node.name}')
print('\nAll optimizations verified!')
"

Lines Modified

Function Line Range Change Type
_init_session_log_batcher ~2168-2178 NEW
_save_session_log ~2178-2230 MODIFIED
_flush_session_log_async ~2230-2240 NEW
_write_session_log_sync ~2240-2300 NEW
_deferred_session_log_flush ~2300-2305 NEW
_shutdown_session_log_batcher ~2305-2315 NEW
_hydrate_todo_store ~2320-2360 MODIFIED
_anthropic_messages_create ~3870-3890 MODIFIED
_interruptible_api_call ~3895-3970 MODIFIED

Future Improvements

Potential additional optimizations:

  1. Use aiofiles for true async file I/O (requires aiofiles dependency)
  2. Batch SQLite writes in _flush_messages_to_session_db
  3. Add compression for large session logs
  4. Implement write-behind caching for checkpoint manager

Optimizations implemented: 2026-03-31