# Deep Analysis: Agent Core (run_agent.py + agent/*.py) ## Executive Summary The AIAgent class is a sophisticated conversation orchestrator (~8500 lines) with multi-provider support, parallel tool execution, context compression, and robust error handling. This analysis covers the state machine, retry logic, context management, optimizations, and potential issues. --- ## 1. State Machine Diagram of Conversation Flow ``` ┌─────────────────────────────────────────────────────────────────────────────────┐ │ AIAgent Conversation State Machine │ └─────────────────────────────────────────────────────────────────────────────────┘ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ ┌─────────────┐ │ START │────▶│ INIT │────▶│ BUILD_SYSTEM │────▶│ USER │ │ │ │ (config) │ │ _PROMPT │ │ INPUT │ └─────────────┘ └─────────────┘ └─────────────────┘ └──────┬──────┘ │ ┌──────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ ┌─────────────┐ │ API_CALL │◄────│ PREPARE │◄────│ HONCHO_PREFETCH│◄────│ COMPRESS? │ │ (stream) │ │ _MESSAGES │ │ (context) │ │ (threshold)│ └──────┬──────┘ └─────────────┘ └─────────────────┘ └─────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────────┐ │ API Response Handler │ ├─────────────────────────────────────────────────────────────────────────────────┤ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ STOP │ │ TOOL_CALLS │ │ LENGTH │ │ ERROR │ │ │ │ (finish) │ │ (execute) │ │ (truncate) │ │ (retry) │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ RETURN │ │ EXECUTE │ │ CONTINUATION│ │ FALLBACK/ │ │ │ │ RESPONSE │ │ TOOLS │ │ REQUEST │ │ COMPRESS │ │ │ │ │ │ (parallel/ │ │ │ │ │ │ │ │ │ │ sequential) │ │ │ │ │ │ │ └─────────────┘ └──────┬──────┘ └─────────────┘ └─────────────┘ │ │ │ │ │ └─────────────────────────────────┐ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ APPEND_RESULTS │──────────┘ │ │ (loop back) │ │ └─────────────────┘ └─────────────────────────────────────────────────────────────────────────────────┘ Key States: ─────────── 1. INIT: Agent initialization, client setup, tool loading 2. BUILD_SYSTEM_PROMPT: Cached system prompt assembly with skills/memory 3. USER_INPUT: Message injection with Honcho turn context 4. COMPRESS?: Context threshold check (50% default) 5. API_CALL: Streaming/non-streaming LLM request 6. TOOL_EXECUTION: Parallel (safe) or sequential (interactive) tool calls 7. FALLBACK: Provider failover on errors 8. RETURN: Final response with metadata Transitions: ──────────── - INTERRUPT: Any state → immediate cleanup → RETURN - MAX_ITERATIONS: API_CALL → RETURN (budget exhausted) - 413/CONTEXT_ERROR: API_CALL → COMPRESS → retry - 401/429: API_CALL → FALLBACK → retry ``` ### Sub-State: Tool Execution ``` ┌─────────────────────────────────────────────────────────────┐ │ Tool Execution Flow │ └─────────────────────────────────────────────────────────────┘ ┌─────────────────┐ │ RECEIVE_BATCH │ └────────┬────────┘ │ ┌────┴────┐ │ Parallel?│ └────┬────┘ YES / \ NO / \ ▼ ▼ ┌─────────┐ ┌─────────┐ │CONCURRENT│ │SEQUENTIAL│ │(ThreadPool│ │(for loop)│ │ max=8) │ │ │ └────┬────┘ └────┬────┘ │ │ ▼ ▼ ┌─────────┐ ┌─────────┐ │ _invoke_│ │ _invoke_│ │ _tool() │ │ _tool() │ (per tool) │ (workers)│ │ │ └────┬────┘ └────┬────┘ │ │ └────────────┘ │ ▼ ┌───────────────┐ │ CHECKPOINT? │ (write_file/patch/terminal) └───────┬───────┘ │ ▼ ┌───────────────┐ │ BUDGET_WARNING│ (inject if >70% iterations) └───────┬───────┘ │ ▼ ┌───────────────┐ │ APPEND_TO_MSGS│ └───────────────┘ ``` --- ## 2. All Retry/Fallback Logic Identified ### 2.1 API Call Retry Loop (lines 6420-7351) ```python # Primary retry configuration max_retries = 3 retry_count = 0 # Retryable errors (with backoff): - Timeout errors (httpx.ReadTimeout, ConnectTimeout, PoolTimeout) - Connection errors (ConnectError, RemoteProtocolError, ConnectionError) - SSE connection drops ("connection lost", "network error") - Rate limits (429) - with Retry-After header respect # Backoff strategy: wait_time = min(2 ** retry_count, 60) # 2s, 4s, 8s max 60s # Rate limits: use Retry-After header (capped at 120s) ``` ### 2.2 Streaming Retry Logic (lines 4157-4268) ```python _max_stream_retries = int(os.getenv("HERMES_STREAM_RETRIES", 2)) # Streaming-specific fallbacks: 1. Streaming fails after partial delivery → NO retry (partial content shown) 2. Streaming fails BEFORE delivery → fallback to non-streaming 3. Stale stream detection (>180s, scaled to 300s for >100K tokens) → kill connection ``` ### 2.3 Provider Fallback Chain (lines 4334-4443) ```python # Fallback chain from config (fallback_model / fallback_providers) self._fallback_chain = [...] # List of {provider, model} dicts self._fallback_index = 0 # Current position in chain # Trigger conditions: - max_retries exhausted - Rate limit (429) with fallback available - Non-retryable 4xx error (401, 403, 404, 422) - Empty/malformed response after retries # Fallback activation: _try_activate_fallback() → swaps client, model, base_url in-place ``` ### 2.4 Context Length Error Handling (lines 6998-7164) ```python # 413 Payload Too Large: max_compression_attempts = 3 # Compress context and retry # Context length exceeded: CONTEXT_PROBE_TIERS = [128_000, 64_000, 32_000, 16_000, 8_000] # Step down through tiers on error ``` ### 2.5 Authentication Refresh Retry (lines 6904-6950) ```python # Codex OAuth (401): codex_auth_retry_attempted = False # Once per request _try_refresh_codex_client_credentials() # Nous Portal (401): nous_auth_retry_attempted = False _try_refresh_nous_client_credentials() # Anthropic (401): anthropic_auth_retry_attempted = False _try_refresh_anthropic_client_credentials() ``` ### 2.6 Length Continuation Retry (lines 6639-6765) ```python # Response truncated (finish_reason='length'): length_continue_retries = 0 max_continuation_retries = 3 # Request continuation with prompt: "[System: Your previous response was truncated... Continue exactly where you left off]" ``` ### 2.7 Tool Call Validation Retries (lines 7400-7500) ```python # Invalid tool name: 3 repair attempts # 1. Lowercase # 2. Normalize (hyphens/spaces to underscores) # 3. Fuzzy match (difflib, cutoff=0.7) # Invalid JSON arguments: 3 retries # Empty content after think blocks: 3 retries # Incomplete scratchpad: 3 retries ``` --- ## 3. Context Window Management Analysis ### 3.1 Multi-Layer Context System ``` ┌────────────────────────────────────────────────────────────────────────┐ │ Context Architecture │ ├────────────────────────────────────────────────────────────────────────┤ │ Layer 1: System Prompt (cached per session) │ │ - SOUL.md or DEFAULT_AGENT_IDENTITY │ │ - Memory blocks (MEMORY.md, USER.md) │ │ - Skills index │ │ - Context files (AGENTS.md, .cursorrules) │ │ - Timestamp, platform hints │ │ - ~2K-10K tokens typical │ ├────────────────────────────────────────────────────────────────────────┤ │ Layer 2: Conversation History │ │ - User/assistant/tool messages │ │ - Protected head (first 3 messages) │ │ - Protected tail (last N messages by token budget) │ │ - Compressible middle section │ ├────────────────────────────────────────────────────────────────────────┤ │ Layer 3: Tool Definitions │ │ - ~20-30K tokens with many tools │ │ - Filtered by enabled/disabled toolsets │ ├────────────────────────────────────────────────────────────────────────┤ │ Layer 4: Ephemeral Context (API call only) │ │ - Prefill messages │ │ - Honcho turn context │ │ - Plugin context │ │ - Ephemeral system prompt │ └────────────────────────────────────────────────────────────────────────┘ ``` ### 3.2 ContextCompressor Algorithm (agent/context_compressor.py) ```python # Configuration: threshold_percent = 0.50 # Compress at 50% of context length protect_first_n = 3 # Head protection protect_last_n = 20 # Tail protection (message count fallback) tail_token_budget = 20_000 # Tail protection (token budget) summary_target_ratio = 0.20 # 20% of compressed content for summary # Compression phases: 1. Prune old tool results (cheap pre-pass) 2. Determine boundaries (head + tail protection) 3. Generate structured summary via LLM 4. Sanitize tool_call/tool_result pairs 5. Assemble compressed message list # Iterative summary updates: _previous_summary = None # Stored for next compression ``` ### 3.3 Context Length Detection Hierarchy ```python # Detection priority (model_metadata.py): 1. Config override (config.yaml model.context_length) 2. Custom provider config (custom_providers[].models[].context_length) 3. models.dev registry lookup 4. OpenRouter API metadata 5. Endpoint /models probe (local servers) 6. Hardcoded DEFAULT_CONTEXT_LENGTHS 7. Context probing (trial-and-error tiers) 8. DEFAULT_FALLBACK_CONTEXT (128K) ``` ### 3.4 Prompt Caching (Anthropic) ```python # System-and-3 strategy: # - 4 cache_control breakpoints max # - System prompt (stable) # - Last 3 non-system messages (rolling window) # - 5m or 1h TTL # Activation conditions: _is_openrouter_url() and "claude" in model.lower() # OR native Anthropic endpoint ``` ### 3.5 Context Pressure Monitoring ```python # User-facing warnings (not injected to LLM): _context_pressure_warned = False # Thresholds: _budget_caution_threshold = 0.7 # 70% - nudge to wrap up _budget_warning_threshold = 0.9 # 90% - urgent # Injection method: # Added to last tool result JSON as _budget_warning field ``` --- ## 4. Ten Performance Optimization Opportunities ### 4.1 Tool Call Deduplication (Missing) **Current**: No deduplication of identical tool calls within a batch **Impact**: Redundant API calls, wasted tokens **Fix**: Add `_deduplicate_tool_calls()` before execution (already implemented but only for delegate_task) ### 4.2 Context Compression Frequency **Current**: Compress only at threshold crossing **Impact**: Sudden latency spike during compression **Fix**: Background compression prediction + prefetch ### 4.3 Skills Prompt Cache Invalidation **Current**: LRU cache keyed by (skills_dir, tools, toolsets) **Issue**: External skill file changes may not invalidate cache **Fix**: Add file watcher or mtime check before cache hit ### 4.4 Streaming Response Buffering **Current**: Accumulates all deltas in memory **Impact**: Memory bloat for long responses **Fix**: Stream directly to output with minimal buffering ### 4.5 Tool Result Truncation Timing **Current**: Truncates after tool execution completes **Impact**: Wasted time on tools returning huge outputs **Fix**: Streaming truncation during tool execution ### 4.6 Concurrent Tool Execution Limits **Current**: Fixed _MAX_TOOL_WORKERS = 8 **Issue**: Not tuned by available CPU/memory **Fix**: Dynamic worker count based on system resources ### 4.7 API Client Connection Pooling **Current**: Creates new client per interruptible request **Issue**: Connection overhead **Fix**: Connection pool with proper cleanup ### 4.8 Model Metadata Cache TTL **Current**: 1 hour fixed TTL for OpenRouter metadata **Issue**: Stale pricing/context data **Fix**: Adaptive TTL based on error rates ### 4.9 Honcho Context Prefetch **Current**: Prefetch queued at turn end, consumed next turn **Issue**: First turn has no prefetch **Fix**: Pre-warm cache on session creation ### 4.10 Session DB Write Batching **Current**: Per-message writes to SQLite **Impact**: I/O overhead **Fix**: Batch writes with periodic flush --- ## 5. Five Potential Race Conditions or Bugs ### 5.1 Interrupt Propagation Race (HIGH SEVERITY) **Location**: run_agent.py lines 2253-2259 ```python with self._active_children_lock: children_copy = list(self._active_children) for child in children_copy: child.interrupt(message) # Child may be gone ``` **Issue**: Child agent may be removed from `_active_children` between copy and iteration **Fix**: Check if child still exists in list before calling interrupt ### 5.2 Concurrent Tool Execution Order **Location**: run_agent.py lines 5308-5478 ```python # Results collected in order, but execution is concurrent results = [None] * num_tools def _run_tool(index, ...): results[index] = (function_name, ..., result, ...) ``` **Issue**: If tool A depends on tool B's side effects, concurrent execution may fail **Fix**: Document that parallel tools must be independent; add dependency tracking ### 5.3 Session DB Concurrent Access **Location**: run_agent.py lines 1716-1755 ```python if not self._session_db: return # ... multiple DB operations without transaction ``` **Issue**: Gateway creates multiple AIAgent instances; SQLite may lock **Fix**: Add proper transaction wrapping and retry logic ### 5.4 Context Compressor State Mutation **Location**: agent/context_compressor.py lines 545-677 ```python messages, pruned_count = self._prune_old_tool_results(messages, ...) # messages is modified copy, but original may be referenced elsewhere ``` **Issue**: Deep copy is shallow for nested structures; tool_calls may be shared **Fix**: Ensure deep copy of entire message structure ### 5.5 Tool Call ID Collision **Location**: run_agent.py lines 2910-2954 ```python def _derive_responses_function_call_id(self, call_id, response_item_id): # Multiple derivations may collide return f"fc_{sanitized[:48]}" ``` **Issue**: Truncated IDs may collide in long conversations **Fix**: Use full UUIDs or ensure uniqueness with counter --- ## Appendix: Key Files and Responsibilities | File | Lines | Responsibility | |------|-------|----------------| | run_agent.py | ~8500 | Main AIAgent class, conversation loop | | agent/prompt_builder.py | ~816 | System prompt assembly, skills indexing | | agent/context_compressor.py | ~676 | Context compression, summarization | | agent/auxiliary_client.py | ~1822 | Side-task LLM client routing | | agent/model_metadata.py | ~930 | Context length detection, pricing | | agent/display.py | ~771 | CLI feedback, spinners | | agent/prompt_caching.py | ~72 | Anthropic cache control | | agent/trajectory.py | ~56 | Trajectory format conversion | | agent/models_dev.py | ~172 | models.dev registry integration | --- ## Summary Statistics - **Total Core Code**: ~13,000 lines - **State Machine States**: 8 primary, 4 sub-states - **Retry Mechanisms**: 7 distinct types - **Context Layers**: 4 layers with compression - **Potential Issues**: 5 identified (1 high severity) - **Optimization Opportunities**: 10 identified