hermes-agent/agent_core_analysis.md

# Deep Analysis: Agent Core (run_agent.py + agent/*.py)

## Executive Summary

The AIAgent class is a sophisticated conversation orchestrator (~8500 lines) with multi-provider support, parallel tool execution, context compression, and robust error handling. This analysis covers the state machine, retry logic, context management, optimizations, and potential issues.

---

## 1. State Machine Diagram of Conversation Flow

```
┌─────────────────────────────────────────────────────────────────────────────────┐
│                         AIAgent Conversation State Machine                       │
└─────────────────────────────────────────────────────────────────────────────────┘

┌─────────────┐     ┌─────────────┐     ┌─────────────────┐     ┌─────────────┐
│   START     │────▶│  INIT       │────▶│  BUILD_SYSTEM   │────▶│   USER      │
│             │     │  (config)   │     │  _PROMPT        │     │   INPUT     │
└─────────────┘     └─────────────┘     └─────────────────┘     └──────┬──────┘
                                                                       │
    ┌──────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────────┐     ┌─────────────┐
│   API_CALL  │◄────│  PREPARE    │◄────│  HONCHO_PREFETCH│◄────│  COMPRESS?  │
│   (stream)  │     │  _MESSAGES  │     │  (context)      │     │  (threshold)│
└──────┬──────┘     └─────────────┘     └─────────────────┘     └─────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                              API Response Handler                                │
├─────────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐      │
│  │   STOP      │    │  TOOL_CALLS │    │   LENGTH    │    │   ERROR     │      │
│  │  (finish)   │    │  (execute)  │    │ (truncate)  │    │  (retry)    │      │
│  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘    └──────┬──────┘      │
│         │                  │                  │                  │             │
│         ▼                  ▼                  ▼                  ▼             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐      │
│  │   RETURN    │    │  EXECUTE    │    │ CONTINUATION│    │  FALLBACK/  │      │
│  │  RESPONSE   │    │  TOOLS      │    │   REQUEST   │    │  COMPRESS   │      │
│  │             │    │  (parallel/ │    │             │    │             │      │
│  │             │    │ sequential) │    │             │    │             │      │
│  └─────────────┘    └──────┬──────┘    └─────────────┘    └─────────────┘      │
│                            │                                                   │
│                            └─────────────────────────────────┐                 │
│                                                              ▼                 │
│                                                   ┌─────────────────┐          │
│                                                   │  APPEND_RESULTS │──────────┘
│                                                   │  (loop back)    │
│                                                   └─────────────────┘
└─────────────────────────────────────────────────────────────────────────────────┘

Key States:
───────────
1. INIT: Agent initialization, client setup, tool loading
2. BUILD_SYSTEM_PROMPT: Cached system prompt assembly with skills/memory
3. USER_INPUT: Message injection with Honcho turn context
4. COMPRESS?: Context threshold check (50% default)
5. API_CALL: Streaming/non-streaming LLM request
6. TOOL_EXECUTION: Parallel (safe) or sequential (interactive) tool calls
7. FALLBACK: Provider failover on errors
8. RETURN: Final response with metadata

Transitions:
────────────
- INTERRUPT: Any state → immediate cleanup → RETURN
- MAX_ITERATIONS: API_CALL → RETURN (budget exhausted)
- 413/CONTEXT_ERROR: API_CALL → COMPRESS → retry
- 401/429: API_CALL → FALLBACK → retry
```

### Sub-State: Tool Execution

```
┌─────────────────────────────────────────────────────────────┐
│                    Tool Execution Flow                       │
└─────────────────────────────────────────────────────────────┘

┌─────────────────┐
│  RECEIVE_BATCH  │
└────────┬────────┘
         │
    ┌────┴────┐
    │ Parallel?│
    └────┬────┘
   YES /  \ NO
      /    \
     ▼      ▼
┌─────────┐  ┌─────────┐
│CONCURRENT│  │SEQUENTIAL│
│(ThreadPool│  │(for loop)│
│  max=8)  │  │         │
└────┬────┘  └────┬────┘
     │            │
     ▼            ▼
┌─────────┐  ┌─────────┐
│ _invoke_│  │ _invoke_│
│ _tool() │  │ _tool() │ (per tool)
│ (workers)│  │         │
└────┬────┘  └────┬────┘
     │            │
     └────────────┘
            │
            ▼
    ┌───────────────┐
    │ CHECKPOINT?   │ (write_file/patch/terminal)
    └───────┬───────┘
            │
            ▼
    ┌───────────────┐
    │ BUDGET_WARNING│ (inject if >70% iterations)
    └───────┬───────┘
            │
            ▼
    ┌───────────────┐
    │ APPEND_TO_MSGS│
    └───────────────┘
```

---

## 2. All Retry/Fallback Logic Identified

### 2.1 API Call Retry Loop (lines 6420-7351)

```python
# Primary retry configuration
max_retries = 3
retry_count = 0

# Retryable errors (with backoff):
- Timeout errors (httpx.ReadTimeout, ConnectTimeout, PoolTimeout)
- Connection errors (ConnectError, RemoteProtocolError, ConnectionError)
- SSE connection drops ("connection lost", "network error")
- Rate limits (429) - with Retry-After header respect

# Backoff strategy:
wait_time = min(2 ** retry_count, 60)  # 2s, 4s, 8s max 60s
# Rate limits: use Retry-After header (capped at 120s)
```

### 2.2 Streaming Retry Logic (lines 4157-4268)

```python
_max_stream_retries = int(os.getenv("HERMES_STREAM_RETRIES", 2))

# Streaming-specific fallbacks:
1. Streaming fails after partial delivery → NO retry (partial content shown)
2. Streaming fails BEFORE delivery → fallback to non-streaming
3. Stale stream detection (>180s, scaled to 300s for >100K tokens) → kill connection
```

### 2.3 Provider Fallback Chain (lines 4334-4443)

```python
# Fallback chain from config (fallback_model / fallback_providers)
self._fallback_chain = [...]  # List of {provider, model} dicts
self._fallback_index = 0      # Current position in chain

# Trigger conditions:
- max_retries exhausted
- Rate limit (429) with fallback available
- Non-retryable 4xx error (401, 403, 404, 422)
- Empty/malformed response after retries

# Fallback activation:
_try_activate_fallback() → swaps client, model, base_url in-place
```

### 2.4 Context Length Error Handling (lines 6998-7164)

```python
# 413 Payload Too Large:
max_compression_attempts = 3
# Compress context and retry

# Context length exceeded:
CONTEXT_PROBE_TIERS = [128_000, 64_000, 32_000, 16_000, 8_000]
# Step down through tiers on error
```

### 2.5 Authentication Refresh Retry (lines 6904-6950)

```python
# Codex OAuth (401):
codex_auth_retry_attempted = False  # Once per request
_try_refresh_codex_client_credentials()

# Nous Portal (401):
nous_auth_retry_attempted = False
_try_refresh_nous_client_credentials()

# Anthropic (401):
anthropic_auth_retry_attempted = False
_try_refresh_anthropic_client_credentials()
```

### 2.6 Length Continuation Retry (lines 6639-6765)

```python
# Response truncated (finish_reason='length'):
length_continue_retries = 0
max_continuation_retries = 3

# Request continuation with prompt:
"[System: Your previous response was truncated... Continue exactly where you left off]"
```

### 2.7 Tool Call Validation Retries (lines 7400-7500)

```python
# Invalid tool name: 3 repair attempts
# 1. Lowercase
# 2. Normalize (hyphens/spaces to underscores)
# 3. Fuzzy match (difflib, cutoff=0.7)

# Invalid JSON arguments: 3 retries
# Empty content after think blocks: 3 retries
# Incomplete scratchpad: 3 retries
```

---

## 3. Context Window Management Analysis

### 3.1 Multi-Layer Context System

```
┌────────────────────────────────────────────────────────────────────────┐
│                        Context Architecture                             │
├────────────────────────────────────────────────────────────────────────┤
│ Layer 1: System Prompt (cached per session)                            │
│   - SOUL.md or DEFAULT_AGENT_IDENTITY                                  │
│   - Memory blocks (MEMORY.md, USER.md)                                 │
│   - Skills index                                                       │
│   - Context files (AGENTS.md, .cursorrules)                            │
│   - Timestamp, platform hints                                          │
│   - ~2K-10K tokens typical                                            │
├────────────────────────────────────────────────────────────────────────┤
│ Layer 2: Conversation History                                          │
│   - User/assistant/tool messages                                       │
│   - Protected head (first 3 messages)                                  │
│   - Protected tail (last N messages by token budget)                   │
│   - Compressible middle section                                        │
├────────────────────────────────────────────────────────────────────────┤
│ Layer 3: Tool Definitions                                              │
│   - ~20-30K tokens with many tools                                     │
│   - Filtered by enabled/disabled toolsets                              │
├────────────────────────────────────────────────────────────────────────┤
│ Layer 4: Ephemeral Context (API call only)                             │
│   - Prefill messages                                                   │
│   - Honcho turn context                                                │
│   - Plugin context                                                     │
│   - Ephemeral system prompt                                            │
└────────────────────────────────────────────────────────────────────────┘
```

### 3.2 ContextCompressor Algorithm (agent/context_compressor.py)

```python
# Configuration:
threshold_percent = 0.50        # Compress at 50% of context length
protect_first_n = 3             # Head protection
protect_last_n = 20             # Tail protection (message count fallback)
tail_token_budget = 20_000      # Tail protection (token budget)
summary_target_ratio = 0.20     # 20% of compressed content for summary

# Compression phases:
1. Prune old tool results (cheap pre-pass)
2. Determine boundaries (head + tail protection)
3. Generate structured summary via LLM
4. Sanitize tool_call/tool_result pairs
5. Assemble compressed message list

# Iterative summary updates:
_previous_summary = None  # Stored for next compression
```

### 3.3 Context Length Detection Hierarchy

```python
# Detection priority (model_metadata.py):
1. Config override (config.yaml model.context_length)
2. Custom provider config (custom_providers[].models[].context_length)
3. models.dev registry lookup
4. OpenRouter API metadata
5. Endpoint /models probe (local servers)
6. Hardcoded DEFAULT_CONTEXT_LENGTHS
7. Context probing (trial-and-error tiers)
8. DEFAULT_FALLBACK_CONTEXT (128K)
```

### 3.4 Prompt Caching (Anthropic)

```python
# System-and-3 strategy:
# - 4 cache_control breakpoints max
# - System prompt (stable)
# - Last 3 non-system messages (rolling window)
# - 5m or 1h TTL

# Activation conditions:
_is_openrouter_url() and "claude" in model.lower()
# OR native Anthropic endpoint
```

### 3.5 Context Pressure Monitoring

```python
# User-facing warnings (not injected to LLM):
_context_pressure_warned = False

# Thresholds:
_budget_caution_threshold = 0.7   # 70% - nudge to wrap up
_budget_warning_threshold = 0.9   # 90% - urgent

# Injection method:
# Added to last tool result JSON as _budget_warning field
```

---

## 4. Ten Performance Optimization Opportunities

### 4.1 Tool Call Deduplication (Missing)
**Current**: No deduplication of identical tool calls within a batch
**Impact**: Redundant API calls, wasted tokens
**Fix**: Add `_deduplicate_tool_calls()` before execution (already implemented but only for delegate_task)

### 4.2 Context Compression Frequency
**Current**: Compress only at threshold crossing
**Impact**: Sudden latency spike during compression
**Fix**: Background compression prediction + prefetch

### 4.3 Skills Prompt Cache Invalidation
**Current**: LRU cache keyed by (skills_dir, tools, toolsets)
**Issue**: External skill file changes may not invalidate cache
**Fix**: Add file watcher or mtime check before cache hit

### 4.4 Streaming Response Buffering
**Current**: Accumulates all deltas in memory
**Impact**: Memory bloat for long responses
**Fix**: Stream directly to output with minimal buffering

### 4.5 Tool Result Truncation Timing
**Current**: Truncates after tool execution completes
**Impact**: Wasted time on tools returning huge outputs
**Fix**: Streaming truncation during tool execution

### 4.6 Concurrent Tool Execution Limits
**Current**: Fixed _MAX_TOOL_WORKERS = 8
**Issue**: Not tuned by available CPU/memory
**Fix**: Dynamic worker count based on system resources

### 4.7 API Client Connection Pooling
**Current**: Creates new client per interruptible request
**Issue**: Connection overhead
**Fix**: Connection pool with proper cleanup

### 4.8 Model Metadata Cache TTL
**Current**: 1 hour fixed TTL for OpenRouter metadata
**Issue**: Stale pricing/context data
**Fix**: Adaptive TTL based on error rates

### 4.9 Honcho Context Prefetch
**Current**: Prefetch queued at turn end, consumed next turn
**Issue**: First turn has no prefetch
**Fix**: Pre-warm cache on session creation

### 4.10 Session DB Write Batching
**Current**: Per-message writes to SQLite
**Impact**: I/O overhead
**Fix**: Batch writes with periodic flush

---

## 5. Five Potential Race Conditions or Bugs

### 5.1 Interrupt Propagation Race (HIGH SEVERITY)
**Location**: run_agent.py lines 2253-2259

```python
with self._active_children_lock:
    children_copy = list(self._active_children)
for child in children_copy:
    child.interrupt(message)  # Child may be gone
```

**Issue**: Child agent may be removed from `_active_children` between copy and iteration
**Fix**: Check if child still exists in list before calling interrupt

### 5.2 Concurrent Tool Execution Order
**Location**: run_agent.py lines 5308-5478

```python
# Results collected in order, but execution is concurrent
results = [None] * num_tools
def _run_tool(index, ...):
    results[index] = (function_name, ..., result, ...)
```

**Issue**: If tool A depends on tool B's side effects, concurrent execution may fail
**Fix**: Document that parallel tools must be independent; add dependency tracking

### 5.3 Session DB Concurrent Access
**Location**: run_agent.py lines 1716-1755

```python
if not self._session_db:
    return
# ... multiple DB operations without transaction
```

**Issue**: Gateway creates multiple AIAgent instances; SQLite may lock
**Fix**: Add proper transaction wrapping and retry logic

### 5.4 Context Compressor State Mutation
**Location**: agent/context_compressor.py lines 545-677

```python
messages, pruned_count = self._prune_old_tool_results(messages, ...)
# messages is modified copy, but original may be referenced elsewhere
```

**Issue**: Deep copy is shallow for nested structures; tool_calls may be shared
**Fix**: Ensure deep copy of entire message structure

### 5.5 Tool Call ID Collision
**Location**: run_agent.py lines 2910-2954

```python
def _derive_responses_function_call_id(self, call_id, response_item_id):
    # Multiple derivations may collide
    return f"fc_{sanitized[:48]}"
```

**Issue**: Truncated IDs may collide in long conversations
**Fix**: Use full UUIDs or ensure uniqueness with counter

---

## Appendix: Key Files and Responsibilities

| File | Lines | Responsibility |
|------|-------|----------------|
| run_agent.py | ~8500 | Main AIAgent class, conversation loop |
| agent/prompt_builder.py | ~816 | System prompt assembly, skills indexing |
| agent/context_compressor.py | ~676 | Context compression, summarization |
| agent/auxiliary_client.py | ~1822 | Side-task LLM client routing |
| agent/model_metadata.py | ~930 | Context length detection, pricing |
| agent/display.py | ~771 | CLI feedback, spinners |
| agent/prompt_caching.py | ~72 | Anthropic cache control |
| agent/trajectory.py | ~56 | Trajectory format conversion |
| agent/models_dev.py | ~172 | models.dev registry integration |

---

## Summary Statistics

- **Total Core Code**: ~13,000 lines
- **State Machine States**: 8 primary, 4 sub-states
- **Retry Mechanisms**: 7 distinct types
- **Context Layers**: 4 layers with compression
- **Potential Issues**: 5 identified (1 high severity)
- **Optimization Opportunities**: 10 identified