Replace shell=True with list-based subprocess execution to prevent command injection via malicious user input. Changes: - tools/transcription_tools.py: Use shlex.split() + shell=False - tools/environments/docker.py: List-based commands with container ID validation Fixes CVE-level vulnerability where malicious file paths or container IDs could inject arbitrary commands. CVSS: 9.8 (Critical) Refs: V-001 in SECURITY_AUDIT_REPORT.md
21 KiB
Deep Analysis: Agent Core (run_agent.py + agent/*.py)
Executive Summary
The AIAgent class is a sophisticated conversation orchestrator (~8500 lines) with multi-provider support, parallel tool execution, context compression, and robust error handling. This analysis covers the state machine, retry logic, context management, optimizations, and potential issues.
1. State Machine Diagram of Conversation Flow
┌─────────────────────────────────────────────────────────────────────────────────┐
│ AIAgent Conversation State Machine │
└─────────────────────────────────────────────────────────────────────────────────┘
┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ ┌─────────────┐
│ START │────▶│ INIT │────▶│ BUILD_SYSTEM │────▶│ USER │
│ │ │ (config) │ │ _PROMPT │ │ INPUT │
└─────────────┘ └─────────────┘ └─────────────────┘ └──────┬──────┘
│
┌──────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ ┌─────────────┐
│ API_CALL │◄────│ PREPARE │◄────│ HONCHO_PREFETCH│◄────│ COMPRESS? │
│ (stream) │ │ _MESSAGES │ │ (context) │ │ (threshold)│
└──────┬──────┘ └─────────────┘ └─────────────────┘ └─────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ API Response Handler │
├─────────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ STOP │ │ TOOL_CALLS │ │ LENGTH │ │ ERROR │ │
│ │ (finish) │ │ (execute) │ │ (truncate) │ │ (retry) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ RETURN │ │ EXECUTE │ │ CONTINUATION│ │ FALLBACK/ │ │
│ │ RESPONSE │ │ TOOLS │ │ REQUEST │ │ COMPRESS │ │
│ │ │ │ (parallel/ │ │ │ │ │ │
│ │ │ │ sequential) │ │ │ │ │ │
│ └─────────────┘ └──────┬──────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ └─────────────────────────────────┐ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ APPEND_RESULTS │──────────┘
│ │ (loop back) │
│ └─────────────────┘
└─────────────────────────────────────────────────────────────────────────────────┘
Key States:
───────────
1. INIT: Agent initialization, client setup, tool loading
2. BUILD_SYSTEM_PROMPT: Cached system prompt assembly with skills/memory
3. USER_INPUT: Message injection with Honcho turn context
4. COMPRESS?: Context threshold check (50% default)
5. API_CALL: Streaming/non-streaming LLM request
6. TOOL_EXECUTION: Parallel (safe) or sequential (interactive) tool calls
7. FALLBACK: Provider failover on errors
8. RETURN: Final response with metadata
Transitions:
────────────
- INTERRUPT: Any state → immediate cleanup → RETURN
- MAX_ITERATIONS: API_CALL → RETURN (budget exhausted)
- 413/CONTEXT_ERROR: API_CALL → COMPRESS → retry
- 401/429: API_CALL → FALLBACK → retry
Sub-State: Tool Execution
┌─────────────────────────────────────────────────────────────┐
│ Tool Execution Flow │
└─────────────────────────────────────────────────────────────┘
┌─────────────────┐
│ RECEIVE_BATCH │
└────────┬────────┘
│
┌────┴────┐
│ Parallel?│
└────┬────┘
YES / \ NO
/ \
▼ ▼
┌─────────┐ ┌─────────┐
│CONCURRENT│ │SEQUENTIAL│
│(ThreadPool│ │(for loop)│
│ max=8) │ │ │
└────┬────┘ └────┬────┘
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ _invoke_│ │ _invoke_│
│ _tool() │ │ _tool() │ (per tool)
│ (workers)│ │ │
└────┬────┘ └────┬────┘
│ │
└────────────┘
│
▼
┌───────────────┐
│ CHECKPOINT? │ (write_file/patch/terminal)
└───────┬───────┘
│
▼
┌───────────────┐
│ BUDGET_WARNING│ (inject if >70% iterations)
└───────┬───────┘
│
▼
┌───────────────┐
│ APPEND_TO_MSGS│
└───────────────┘
2. All Retry/Fallback Logic Identified
2.1 API Call Retry Loop (lines 6420-7351)
# Primary retry configuration
max_retries = 3
retry_count = 0
# Retryable errors (with backoff):
- Timeout errors (httpx.ReadTimeout, ConnectTimeout, PoolTimeout)
- Connection errors (ConnectError, RemoteProtocolError, ConnectionError)
- SSE connection drops ("connection lost", "network error")
- Rate limits (429) - with Retry-After header respect
# Backoff strategy:
wait_time = min(2 ** retry_count, 60) # 2s, 4s, 8s max 60s
# Rate limits: use Retry-After header (capped at 120s)
2.2 Streaming Retry Logic (lines 4157-4268)
_max_stream_retries = int(os.getenv("HERMES_STREAM_RETRIES", 2))
# Streaming-specific fallbacks:
1. Streaming fails after partial delivery → NO retry (partial content shown)
2. Streaming fails BEFORE delivery → fallback to non-streaming
3. Stale stream detection (>180s, scaled to 300s for >100K tokens) → kill connection
2.3 Provider Fallback Chain (lines 4334-4443)
# Fallback chain from config (fallback_model / fallback_providers)
self._fallback_chain = [...] # List of {provider, model} dicts
self._fallback_index = 0 # Current position in chain
# Trigger conditions:
- max_retries exhausted
- Rate limit (429) with fallback available
- Non-retryable 4xx error (401, 403, 404, 422)
- Empty/malformed response after retries
# Fallback activation:
_try_activate_fallback() → swaps client, model, base_url in-place
2.4 Context Length Error Handling (lines 6998-7164)
# 413 Payload Too Large:
max_compression_attempts = 3
# Compress context and retry
# Context length exceeded:
CONTEXT_PROBE_TIERS = [128_000, 64_000, 32_000, 16_000, 8_000]
# Step down through tiers on error
2.5 Authentication Refresh Retry (lines 6904-6950)
# Codex OAuth (401):
codex_auth_retry_attempted = False # Once per request
_try_refresh_codex_client_credentials()
# Nous Portal (401):
nous_auth_retry_attempted = False
_try_refresh_nous_client_credentials()
# Anthropic (401):
anthropic_auth_retry_attempted = False
_try_refresh_anthropic_client_credentials()
2.6 Length Continuation Retry (lines 6639-6765)
# Response truncated (finish_reason='length'):
length_continue_retries = 0
max_continuation_retries = 3
# Request continuation with prompt:
"[System: Your previous response was truncated... Continue exactly where you left off]"
2.7 Tool Call Validation Retries (lines 7400-7500)
# Invalid tool name: 3 repair attempts
# 1. Lowercase
# 2. Normalize (hyphens/spaces to underscores)
# 3. Fuzzy match (difflib, cutoff=0.7)
# Invalid JSON arguments: 3 retries
# Empty content after think blocks: 3 retries
# Incomplete scratchpad: 3 retries
3. Context Window Management Analysis
3.1 Multi-Layer Context System
┌────────────────────────────────────────────────────────────────────────┐
│ Context Architecture │
├────────────────────────────────────────────────────────────────────────┤
│ Layer 1: System Prompt (cached per session) │
│ - SOUL.md or DEFAULT_AGENT_IDENTITY │
│ - Memory blocks (MEMORY.md, USER.md) │
│ - Skills index │
│ - Context files (AGENTS.md, .cursorrules) │
│ - Timestamp, platform hints │
│ - ~2K-10K tokens typical │
├────────────────────────────────────────────────────────────────────────┤
│ Layer 2: Conversation History │
│ - User/assistant/tool messages │
│ - Protected head (first 3 messages) │
│ - Protected tail (last N messages by token budget) │
│ - Compressible middle section │
├────────────────────────────────────────────────────────────────────────┤
│ Layer 3: Tool Definitions │
│ - ~20-30K tokens with many tools │
│ - Filtered by enabled/disabled toolsets │
├────────────────────────────────────────────────────────────────────────┤
│ Layer 4: Ephemeral Context (API call only) │
│ - Prefill messages │
│ - Honcho turn context │
│ - Plugin context │
│ - Ephemeral system prompt │
└────────────────────────────────────────────────────────────────────────┘
3.2 ContextCompressor Algorithm (agent/context_compressor.py)
# Configuration:
threshold_percent = 0.50 # Compress at 50% of context length
protect_first_n = 3 # Head protection
protect_last_n = 20 # Tail protection (message count fallback)
tail_token_budget = 20_000 # Tail protection (token budget)
summary_target_ratio = 0.20 # 20% of compressed content for summary
# Compression phases:
1. Prune old tool results (cheap pre-pass)
2. Determine boundaries (head + tail protection)
3. Generate structured summary via LLM
4. Sanitize tool_call/tool_result pairs
5. Assemble compressed message list
# Iterative summary updates:
_previous_summary = None # Stored for next compression
3.3 Context Length Detection Hierarchy
# Detection priority (model_metadata.py):
1. Config override (config.yaml model.context_length)
2. Custom provider config (custom_providers[].models[].context_length)
3. models.dev registry lookup
4. OpenRouter API metadata
5. Endpoint /models probe (local servers)
6. Hardcoded DEFAULT_CONTEXT_LENGTHS
7. Context probing (trial-and-error tiers)
8. DEFAULT_FALLBACK_CONTEXT (128K)
3.4 Prompt Caching (Anthropic)
# System-and-3 strategy:
# - 4 cache_control breakpoints max
# - System prompt (stable)
# - Last 3 non-system messages (rolling window)
# - 5m or 1h TTL
# Activation conditions:
_is_openrouter_url() and "claude" in model.lower()
# OR native Anthropic endpoint
3.5 Context Pressure Monitoring
# User-facing warnings (not injected to LLM):
_context_pressure_warned = False
# Thresholds:
_budget_caution_threshold = 0.7 # 70% - nudge to wrap up
_budget_warning_threshold = 0.9 # 90% - urgent
# Injection method:
# Added to last tool result JSON as _budget_warning field
4. Ten Performance Optimization Opportunities
4.1 Tool Call Deduplication (Missing)
Current: No deduplication of identical tool calls within a batch
Impact: Redundant API calls, wasted tokens
Fix: Add _deduplicate_tool_calls() before execution (already implemented but only for delegate_task)
4.2 Context Compression Frequency
Current: Compress only at threshold crossing Impact: Sudden latency spike during compression Fix: Background compression prediction + prefetch
4.3 Skills Prompt Cache Invalidation
Current: LRU cache keyed by (skills_dir, tools, toolsets) Issue: External skill file changes may not invalidate cache Fix: Add file watcher or mtime check before cache hit
4.4 Streaming Response Buffering
Current: Accumulates all deltas in memory Impact: Memory bloat for long responses Fix: Stream directly to output with minimal buffering
4.5 Tool Result Truncation Timing
Current: Truncates after tool execution completes Impact: Wasted time on tools returning huge outputs Fix: Streaming truncation during tool execution
4.6 Concurrent Tool Execution Limits
Current: Fixed _MAX_TOOL_WORKERS = 8 Issue: Not tuned by available CPU/memory Fix: Dynamic worker count based on system resources
4.7 API Client Connection Pooling
Current: Creates new client per interruptible request Issue: Connection overhead Fix: Connection pool with proper cleanup
4.8 Model Metadata Cache TTL
Current: 1 hour fixed TTL for OpenRouter metadata Issue: Stale pricing/context data Fix: Adaptive TTL based on error rates
4.9 Honcho Context Prefetch
Current: Prefetch queued at turn end, consumed next turn Issue: First turn has no prefetch Fix: Pre-warm cache on session creation
4.10 Session DB Write Batching
Current: Per-message writes to SQLite Impact: I/O overhead Fix: Batch writes with periodic flush
5. Five Potential Race Conditions or Bugs
5.1 Interrupt Propagation Race (HIGH SEVERITY)
Location: run_agent.py lines 2253-2259
with self._active_children_lock:
children_copy = list(self._active_children)
for child in children_copy:
child.interrupt(message) # Child may be gone
Issue: Child agent may be removed from _active_children between copy and iteration
Fix: Check if child still exists in list before calling interrupt
5.2 Concurrent Tool Execution Order
Location: run_agent.py lines 5308-5478
# Results collected in order, but execution is concurrent
results = [None] * num_tools
def _run_tool(index, ...):
results[index] = (function_name, ..., result, ...)
Issue: If tool A depends on tool B's side effects, concurrent execution may fail Fix: Document that parallel tools must be independent; add dependency tracking
5.3 Session DB Concurrent Access
Location: run_agent.py lines 1716-1755
if not self._session_db:
return
# ... multiple DB operations without transaction
Issue: Gateway creates multiple AIAgent instances; SQLite may lock Fix: Add proper transaction wrapping and retry logic
5.4 Context Compressor State Mutation
Location: agent/context_compressor.py lines 545-677
messages, pruned_count = self._prune_old_tool_results(messages, ...)
# messages is modified copy, but original may be referenced elsewhere
Issue: Deep copy is shallow for nested structures; tool_calls may be shared Fix: Ensure deep copy of entire message structure
5.5 Tool Call ID Collision
Location: run_agent.py lines 2910-2954
def _derive_responses_function_call_id(self, call_id, response_item_id):
# Multiple derivations may collide
return f"fc_{sanitized[:48]}"
Issue: Truncated IDs may collide in long conversations Fix: Use full UUIDs or ensure uniqueness with counter
Appendix: Key Files and Responsibilities
| File | Lines | Responsibility |
|---|---|---|
| run_agent.py | ~8500 | Main AIAgent class, conversation loop |
| agent/prompt_builder.py | ~816 | System prompt assembly, skills indexing |
| agent/context_compressor.py | ~676 | Context compression, summarization |
| agent/auxiliary_client.py | ~1822 | Side-task LLM client routing |
| agent/model_metadata.py | ~930 | Context length detection, pricing |
| agent/display.py | ~771 | CLI feedback, spinners |
| agent/prompt_caching.py | ~72 | Anthropic cache control |
| agent/trajectory.py | ~56 | Trajectory format conversion |
| agent/models_dev.py | ~172 | models.dev registry integration |
Summary Statistics
- Total Core Code: ~13,000 lines
- State Machine States: 8 primary, 4 sub-states
- Retry Mechanisms: 7 distinct types
- Context Layers: 4 layers with compression
- Potential Issues: 5 identified (1 high severity)
- Optimization Opportunities: 10 identified