Add context compression feature for long conversations

- Implemented automatic context compression to manage long conversations that approach the model's context limit.
- Configured the feature to summarize middle turns while protecting the first three and last four turns, ensuring important context is retained.
- Added configuration options in `cli-config.yaml` and environment variables for enabling/disabling compression and setting thresholds.
- Updated documentation in `README.md`, `cli.md`, and `.env.example` to explain the context compression functionality and its configuration.
- Enhanced the `cli.py` to load compression settings into environment variables, ensuring seamless integration with the CLI.
- Completed the implementation of context compression as outlined in the TODO list, marking it as a significant enhancement to conversation management.
This commit is contained in:
teknium1
2026-02-01 18:01:31 -08:00
parent bbeed5b5d1
commit 9b4d9452ba
7 changed files with 614 additions and 12 deletions

View File

@@ -154,3 +154,13 @@ WEB_TOOLS_DEBUG=false
VISION_TOOLS_DEBUG=false VISION_TOOLS_DEBUG=false
MOA_TOOLS_DEBUG=false MOA_TOOLS_DEBUG=false
IMAGE_TOOLS_DEBUG=false IMAGE_TOOLS_DEBUG=false
# =============================================================================
# CONTEXT COMPRESSION (Auto-shrinks long conversations)
# =============================================================================
# When conversation approaches model's context limit, middle turns are
# automatically summarized to free up space.
#
# CONTEXT_COMPRESSION_ENABLED=true # Enable auto-compression (default: true)
# CONTEXT_COMPRESSION_THRESHOLD=0.85 # Compress at 85% of context limit
# CONTEXT_COMPRESSION_MODEL=google/gemini-2.0-flash-001 # Fast model for summaries

View File

@@ -290,6 +290,41 @@ logs/
- **Trajectory Format**: Uses the same format as batch processing for consistency - **Trajectory Format**: Uses the same format as batch processing for consistency
- **Git Ignored**: `logs/` is in `.gitignore` so logs aren't committed - **Git Ignored**: `logs/` is in `.gitignore` so logs aren't committed
## Context Compression
Long conversations can exceed the model's context limit. Hermes Agent automatically compresses context when approaching the limit:
**How it works:**
1. Tracks actual token usage from API responses (`usage.prompt_tokens`)
2. When tokens reach 85% of model's context limit, triggers compression
3. Protects first 3 turns (system prompt, initial request, first response)
4. Protects last 4 turns (recent context is most relevant)
5. Summarizes middle turns using a fast/cheap model (Gemini Flash)
6. Inserts summary as a user message, conversation continues seamlessly
**Configuration (`cli-config.yaml`):**
```yaml
compression:
enabled: true # Enable auto-compression (default)
threshold: 0.85 # Compress at 85% of context limit
summary_model: "google/gemini-2.0-flash-001"
```
**Or via environment variables:**
```bash
CONTEXT_COMPRESSION_ENABLED=true
CONTEXT_COMPRESSION_THRESHOLD=0.85
CONTEXT_COMPRESSION_MODEL=google/gemini-2.0-flash-001
```
**When compression triggers, you'll see:**
```
📦 Context compression triggered (170,000 tokens ≥ 170,000 threshold)
📊 Model context limit: 200,000 tokens (85% = 170,000)
🗜️ Summarizing turns 4-15 (12 turns)
✅ Compressed: 20 → 9 messages (~45,000 tokens saved)
```
## Interactive CLI ## Interactive CLI
The CLI provides a rich interactive experience for working with the agent. The CLI provides a rich interactive experience for working with the agent.
@@ -579,6 +614,11 @@ All environment variables can be configured in the `.env` file (copy from `.env.
- `TERMINAL_SSH_PORT`: SSH port (default: `22`) - `TERMINAL_SSH_PORT`: SSH port (default: `22`)
- `TERMINAL_SSH_KEY`: Path to SSH private key (optional, uses ssh-agent if not set) - `TERMINAL_SSH_KEY`: Path to SSH private key (optional, uses ssh-agent if not set)
**Context Compression (auto-shrinks long conversations):**
- `CONTEXT_COMPRESSION_ENABLED`: Enable auto-compression (default: `true`)
- `CONTEXT_COMPRESSION_THRESHOLD`: Compress at this % of context limit (default: `0.85`)
- `CONTEXT_COMPRESSION_MODEL`: Model for generating summaries (default: `google/gemini-2.0-flash-001`)
**Browser Tool Configuration (agent-browser + Browserbase):** **Browser Tool Configuration (agent-browser + Browserbase):**
- `BROWSERBASE_API_KEY`: Browserbase API key for cloud browser execution - `BROWSERBASE_API_KEY`: Browserbase API key for cloud browser execution
- `BROWSERBASE_PROJECT_ID`: Browserbase project ID - `BROWSERBASE_PROJECT_ID`: Browserbase project ID

19
TODO.md
View File

@@ -47,7 +47,24 @@ These items need to be addressed ASAP:
- Structured JSON format for easy parsing and replay - Structured JSON format for easy parsing and replay
- Automatic on CLI runs (configurable) - Automatic on CLI runs (configurable)
### 4. Stream Thinking Summaries in Real-Time 💭 ⏸️ DEFERRED ### 4. Automatic Context Compression 🗜️ ✅ COMPLETE
- [x] **Problem:** Long conversations exceed model context limits, causing errors
- [x] **Solution:** Auto-compress middle turns when approaching limit
- [x] **Implementation:**
- Fetches model context lengths from OpenRouter `/api/v1/models` API (cached 1hr)
- Tracks actual token usage from API responses (`usage.prompt_tokens`)
- Triggers at 85% of model's context limit (configurable)
- Protects first 3 turns (system, initial request, first response)
- Protects last 4 turns (recent context most relevant)
- Summarizes middle turns using fast model (Gemini Flash)
- Inserts summary as user message, conversation continues seamlessly
- If context error occurs, attempts compression before failing
- [x] **Configuration (cli-config.yaml / env vars):**
- `CONTEXT_COMPRESSION_ENABLED` (default: true)
- `CONTEXT_COMPRESSION_THRESHOLD` (default: 0.85 = 85%)
- `CONTEXT_COMPRESSION_MODEL` (default: google/gemini-2.0-flash-001)
### 5. Stream Thinking Summaries in Real-Time 💭 ⏸️ DEFERRED
- [ ] **Problem:** Thinking/reasoning summaries not shown while streaming - [ ] **Problem:** Thinking/reasoning summaries not shown while streaming
- [ ] **Complexity:** This is a significant refactor - leaving for later - [ ] **Complexity:** This is a significant refactor - leaving for later

View File

@@ -112,6 +112,33 @@ browser:
# after this period of no activity between agent loops (default: 120 = 2 minutes) # after this period of no activity between agent loops (default: 120 = 2 minutes)
inactivity_timeout: 120 inactivity_timeout: 120
# =============================================================================
# Context Compression (Auto-shrinks long conversations)
# =============================================================================
# When conversation approaches model's context limit, middle turns are
# automatically summarized to free up space while preserving important context.
#
# HOW IT WORKS:
# 1. Tracks actual token usage from API responses (not estimates)
# 2. When prompt_tokens >= threshold% of model's context_length, triggers compression
# 3. Protects first 3 turns (system prompt, initial request, first response)
# 4. Protects last 4 turns (recent context is most relevant)
# 5. Summarizes middle turns using a fast/cheap model
# 6. Inserts summary as a user message, continues conversation seamlessly
#
compression:
# Enable automatic context compression (default: true)
# Set to false if you prefer to manage context manually or want errors on overflow
enabled: true
# Trigger compression at this % of model's context limit (default: 0.85 = 85%)
# Lower values = more aggressive compression, higher values = compress later
threshold: 0.85
# Model to use for generating summaries (fast/cheap recommended)
# This model compresses the middle turns into a concise summary
summary_model: "google/gemini-2.0-flash-001"
# ============================================================================= # =============================================================================
# Agent Behavior # Agent Behavior
# ============================================================================= # =============================================================================

17
cli.py
View File

@@ -71,6 +71,11 @@ def load_cli_config() -> Dict[str, Any]:
"browser": { "browser": {
"inactivity_timeout": 120, # Auto-cleanup inactive browser sessions after 2 min "inactivity_timeout": 120, # Auto-cleanup inactive browser sessions after 2 min
}, },
"compression": {
"enabled": True, # Auto-compress when approaching context limit
"threshold": 0.85, # Compress at 85% of model's context limit
"summary_model": "google/gemini-2.0-flash-001", # Fast/cheap model for summaries
},
"agent": { "agent": {
"max_turns": 20, "max_turns": 20,
"verbose": False, "verbose": False,
@@ -154,6 +159,18 @@ def load_cli_config() -> Dict[str, Any]:
if config_key in browser_config: if config_key in browser_config:
os.environ[env_var] = str(browser_config[config_key]) os.environ[env_var] = str(browser_config[config_key])
# Apply compression config to environment variables
compression_config = defaults.get("compression", {})
compression_env_mappings = {
"enabled": "CONTEXT_COMPRESSION_ENABLED",
"threshold": "CONTEXT_COMPRESSION_THRESHOLD",
"summary_model": "CONTEXT_COMPRESSION_MODEL",
}
for config_key, env_var in compression_env_mappings.items():
if config_key in compression_config:
os.environ[env_var] = str(compression_config[config_key])
return defaults return defaults
# Load configuration at module startup # Load configuration at module startup

View File

@@ -250,6 +250,38 @@ This is useful for:
- Replaying conversations - Replaying conversations
- Training data inspection - Training data inspection
### Context Compression
Long conversations can exceed model context limits. The CLI automatically compresses context when approaching the limit:
```yaml
# In cli-config.yaml
compression:
enabled: true # Enable auto-compression
threshold: 0.85 # Compress at 85% of context limit
summary_model: "google/gemini-2.0-flash-001"
```
**How it works:**
1. Tracks actual token usage from each API response
2. When tokens reach threshold, middle turns are summarized
3. First 3 and last 4 turns are always protected
4. Conversation continues seamlessly after compression
**When compression triggers:**
```
📦 Context compression triggered (170,000 tokens ≥ 170,000 threshold)
📊 Model context limit: 200,000 tokens (85% = 170,000)
🗜️ Summarizing turns 4-15 (12 turns)
✅ Compressed: 20 → 9 messages (~45,000 tokens saved)
```
To disable compression:
```yaml
compression:
enabled: false
```
## Quiet Mode ## Quiet Mode
The CLI runs in "quiet mode" (`HERMES_QUIET=1`), which: The CLI runs in "quiet mode" (`HERMES_QUIET=1`), which:

View File

@@ -51,6 +51,410 @@ from model_tools import get_tool_definitions, handle_function_call, check_toolse
from tools.terminal_tool import cleanup_vm from tools.terminal_tool import cleanup_vm
from tools.browser_tool import cleanup_browser from tools.browser_tool import cleanup_browser
import requests
# =============================================================================
# Model Context Management
# =============================================================================
# Cache for model metadata from OpenRouter
_model_metadata_cache: Dict[str, Dict[str, Any]] = {}
_model_metadata_cache_time: float = 0
_MODEL_CACHE_TTL = 3600 # 1 hour cache TTL
# Default context lengths for common models (fallback if API fails)
DEFAULT_CONTEXT_LENGTHS = {
"anthropic/claude-opus-4": 200000,
"anthropic/claude-opus-4.5": 200000,
"anthropic/claude-sonnet-4": 200000,
"anthropic/claude-sonnet-4-20250514": 200000,
"anthropic/claude-haiku-4.5": 200000,
"openai/gpt-4o": 128000,
"openai/gpt-4-turbo": 128000,
"openai/gpt-4o-mini": 128000,
"google/gemini-2.0-flash": 1048576,
"google/gemini-2.5-pro": 1048576,
"meta-llama/llama-3.3-70b-instruct": 131072,
"deepseek/deepseek-chat-v3": 65536,
"qwen/qwen-2.5-72b-instruct": 32768,
}
def fetch_model_metadata(force_refresh: bool = False) -> Dict[str, Dict[str, Any]]:
"""
Fetch model metadata from OpenRouter's /api/v1/models endpoint.
Results are cached for 1 hour to minimize API calls.
Returns:
Dict mapping model_id to metadata (context_length, max_completion_tokens, etc.)
"""
global _model_metadata_cache, _model_metadata_cache_time
# Return cached data if fresh
if not force_refresh and _model_metadata_cache and (time.time() - _model_metadata_cache_time) < _MODEL_CACHE_TTL:
return _model_metadata_cache
try:
response = requests.get(
"https://openrouter.ai/api/v1/models",
timeout=10
)
response.raise_for_status()
data = response.json()
# Build cache mapping model_id to relevant metadata
cache = {}
for model in data.get("data", []):
model_id = model.get("id", "")
cache[model_id] = {
"context_length": model.get("context_length", 128000),
"max_completion_tokens": model.get("top_provider", {}).get("max_completion_tokens", 4096),
"name": model.get("name", model_id),
"pricing": model.get("pricing", {}),
}
# Also cache by canonical slug if different
canonical = model.get("canonical_slug", "")
if canonical and canonical != model_id:
cache[canonical] = cache[model_id]
_model_metadata_cache = cache
_model_metadata_cache_time = time.time()
if not os.getenv("HERMES_QUIET"):
logging.debug(f"Fetched metadata for {len(cache)} models from OpenRouter")
return cache
except Exception as e:
logging.warning(f"Failed to fetch model metadata from OpenRouter: {e}")
# Return cached data even if stale, or empty dict
return _model_metadata_cache or {}
def get_model_context_length(model: str) -> int:
"""
Get the context length for a specific model.
Args:
model: Model identifier (e.g., "anthropic/claude-sonnet-4")
Returns:
Context length in tokens (defaults to 128000 if unknown)
"""
# Try to get from OpenRouter API
metadata = fetch_model_metadata()
if model in metadata:
return metadata[model].get("context_length", 128000)
# Check default fallbacks (handles partial matches)
for default_model, length in DEFAULT_CONTEXT_LENGTHS.items():
if default_model in model or model in default_model:
return length
# Conservative default
return 128000
def estimate_tokens_rough(text: str) -> int:
"""
Rough token estimate for pre-flight checks (before API call).
Uses ~4 chars per token heuristic.
For accurate counts, use the `usage.prompt_tokens` from API responses.
Args:
text: Text to estimate tokens for
Returns:
Rough estimated token count
"""
if not text:
return 0
return len(text) // 4
def estimate_messages_tokens_rough(messages: List[Dict[str, Any]]) -> int:
"""
Rough token estimate for messages (pre-flight check only).
For accurate counts, use the `usage.prompt_tokens` from API responses.
Args:
messages: List of message dicts
Returns:
Rough estimated token count
"""
total_chars = sum(len(str(msg)) for msg in messages)
return total_chars // 4
class ContextCompressor:
"""
Compresses conversation context when approaching model's context limit.
Uses similar logic to trajectory_compressor but operates in real-time:
1. Protects first few turns (system, initial user, first assistant response)
2. Protects last N turns (recent context is most relevant)
3. Summarizes middle turns when threshold is reached
Token tracking uses actual counts from API responses (usage.prompt_tokens)
rather than estimates for accuracy.
"""
def __init__(
self,
model: str,
threshold_percent: float = 0.85,
summary_model: str = "google/gemini-2.0-flash-001",
protect_first_n: int = 3,
protect_last_n: int = 4,
summary_target_tokens: int = 500,
quiet_mode: bool = False,
):
"""
Initialize the context compressor.
Args:
model: The main model being used (to determine context limit)
threshold_percent: Trigger compression at this % of context (default 85%)
summary_model: Model to use for generating summaries (cheap/fast)
protect_first_n: Number of initial turns to always keep
protect_last_n: Number of recent turns to always keep
summary_target_tokens: Target token count for summaries
quiet_mode: Suppress compression notifications
"""
self.model = model
self.threshold_percent = threshold_percent
self.summary_model = summary_model
self.protect_first_n = protect_first_n
self.protect_last_n = protect_last_n
self.summary_target_tokens = summary_target_tokens
self.quiet_mode = quiet_mode
self.context_length = get_model_context_length(model)
self.threshold_tokens = int(self.context_length * threshold_percent)
self.compression_count = 0
# Track actual token usage from API responses
self.last_prompt_tokens = 0
self.last_completion_tokens = 0
self.last_total_tokens = 0
# Initialize OpenRouter client for summarization
api_key = os.getenv("OPENROUTER_API_KEY", "")
self.client = OpenAI(
api_key=api_key,
base_url="https://openrouter.ai/api/v1"
) if api_key else None
def update_from_response(self, usage: Dict[str, Any]):
"""
Update tracked token usage from API response.
Args:
usage: The usage dict from response (contains prompt_tokens, completion_tokens, total_tokens)
"""
self.last_prompt_tokens = usage.get("prompt_tokens", 0)
self.last_completion_tokens = usage.get("completion_tokens", 0)
self.last_total_tokens = usage.get("total_tokens", 0)
def should_compress(self, prompt_tokens: int = None) -> bool:
"""
Check if context exceeds the compression threshold.
Uses actual token count from API response for accuracy.
Args:
prompt_tokens: Actual prompt tokens from last API response.
If None, uses last tracked value.
Returns:
True if compression should be triggered
"""
tokens = prompt_tokens if prompt_tokens is not None else self.last_prompt_tokens
return tokens >= self.threshold_tokens
def should_compress_preflight(self, messages: List[Dict[str, Any]]) -> bool:
"""
Quick pre-flight check using rough estimate (before API call).
Use this to avoid making an API call that would fail due to context overflow.
For post-response compression decisions, use should_compress() with actual tokens.
Args:
messages: Current conversation messages
Returns:
True if compression is likely needed
"""
rough_estimate = estimate_messages_tokens_rough(messages)
return rough_estimate >= self.threshold_tokens
def get_status(self) -> Dict[str, Any]:
"""
Get current compression status for display/logging.
Returns:
Dict with token usage and threshold info
"""
return {
"last_prompt_tokens": self.last_prompt_tokens,
"threshold_tokens": self.threshold_tokens,
"context_length": self.context_length,
"usage_percent": (self.last_prompt_tokens / self.context_length * 100) if self.context_length else 0,
"compression_count": self.compression_count,
}
def _generate_summary(self, turns_to_summarize: List[Dict[str, Any]]) -> str:
"""
Generate a concise summary of conversation turns using a fast model.
Args:
turns_to_summarize: List of message dicts to summarize
Returns:
Summary string
"""
if not self.client:
# Fallback if no API key
return "[CONTEXT SUMMARY]: Previous conversation turns have been compressed to save space. The assistant performed various actions and received responses."
# Format turns for summarization
parts = []
for i, msg in enumerate(turns_to_summarize):
role = msg.get("role", "unknown")
content = msg.get("content", "")
# Truncate very long content
if len(content) > 2000:
content = content[:1000] + "\n...[truncated]...\n" + content[-500:]
# Include tool call info if present
tool_calls = msg.get("tool_calls", [])
if tool_calls:
tool_names = [tc.get("function", {}).get("name", "?") for tc in tool_calls if isinstance(tc, dict)]
content += f"\n[Tool calls: {', '.join(tool_names)}]"
parts.append(f"[{role.upper()}]: {content}")
content_to_summarize = "\n\n".join(parts)
prompt = f"""Summarize these conversation turns concisely. This summary will replace these turns in the conversation history.
Write from a neutral perspective describing:
1. What actions were taken (tool calls, searches, file operations)
2. Key information or results obtained
3. Important decisions or findings
4. Relevant data, file names, or outputs
Keep factual and informative. Target ~{self.summary_target_tokens} tokens.
---
TURNS TO SUMMARIZE:
{content_to_summarize}
---
Write only the summary, starting with "[CONTEXT SUMMARY]:" prefix."""
try:
response = self.client.chat.completions.create(
model=self.summary_model,
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
max_tokens=self.summary_target_tokens * 2,
timeout=30.0,
)
summary = response.choices[0].message.content.strip()
if not summary.startswith("[CONTEXT SUMMARY]:"):
summary = "[CONTEXT SUMMARY]: " + summary
return summary
except Exception as e:
logging.warning(f"Failed to generate context summary: {e}")
return "[CONTEXT SUMMARY]: Previous conversation turns have been compressed. The assistant performed tool calls and received responses."
def compress(self, messages: List[Dict[str, Any]], current_tokens: int = None) -> List[Dict[str, Any]]:
"""
Compress conversation messages by summarizing middle turns.
Algorithm:
1. Keep first N turns (system prompt, initial context)
2. Keep last N turns (recent/relevant context)
3. Summarize everything in between
4. Insert summary as a user message
Args:
messages: Current conversation messages
current_tokens: Actual token count from API (for logging). If None, uses estimate.
Returns:
Compressed message list
"""
n_messages = len(messages)
# Not enough messages to compress
if n_messages <= self.protect_first_n + self.protect_last_n + 1:
if not self.quiet_mode:
print(f"⚠️ Cannot compress: only {n_messages} messages (need > {self.protect_first_n + self.protect_last_n + 1})")
return messages
# Determine compression boundaries
compress_start = self.protect_first_n
compress_end = n_messages - self.protect_last_n
# Nothing to compress
if compress_start >= compress_end:
return messages
# Extract turns to summarize
turns_to_summarize = messages[compress_start:compress_end]
# Use actual token count if provided, otherwise estimate
display_tokens = current_tokens if current_tokens else self.last_prompt_tokens or estimate_messages_tokens_rough(messages)
if not self.quiet_mode:
print(f"\n📦 Context compression triggered ({display_tokens:,} tokens ≥ {self.threshold_tokens:,} threshold)")
print(f" 📊 Model context limit: {self.context_length:,} tokens ({self.threshold_percent*100:.0f}% = {self.threshold_tokens:,})")
print(f" 🗜️ Summarizing turns {compress_start+1}-{compress_end} ({len(turns_to_summarize)} turns)")
# Generate summary
summary = self._generate_summary(turns_to_summarize)
# Build compressed messages
compressed = []
# Keep protected head turns
for i in range(compress_start):
msg = messages[i].copy()
# Add notice to system message on first compression
if i == 0 and msg.get("role") == "system" and self.compression_count == 0:
msg["content"] = msg.get("content", "") + "\n\n[Note: Some earlier conversation turns may be summarized to preserve context space.]"
compressed.append(msg)
# Add summary as user message
compressed.append({
"role": "user",
"content": summary
})
# Keep protected tail turns
for i in range(compress_end, n_messages):
compressed.append(messages[i].copy())
self.compression_count += 1
if not self.quiet_mode:
# Estimate new size (actual will be known after next API call)
new_estimate = estimate_messages_tokens_rough(compressed)
saved_estimate = display_tokens - new_estimate
print(f" ✅ Compressed: {n_messages}{len(compressed)} messages (~{saved_estimate:,} tokens saved)")
print(f" 💡 Compression #{self.compression_count} complete")
return compressed
# ============================================================================= # =============================================================================
# Default System Prompt Components # Default System Prompt Components
@@ -364,6 +768,30 @@ class AIAgent:
# Track conversation messages for session logging # Track conversation messages for session logging
self._session_messages: List[Dict[str, Any]] = [] self._session_messages: List[Dict[str, Any]] = []
# Initialize context compressor for automatic context management
# Compresses conversation when approaching model's context limit
# Configuration via environment variables (can be set in .env or cli-config.yaml)
compression_threshold = float(os.getenv("CONTEXT_COMPRESSION_THRESHOLD", "0.85"))
compression_model = os.getenv("CONTEXT_COMPRESSION_MODEL", "google/gemini-2.0-flash-001")
compression_enabled = os.getenv("CONTEXT_COMPRESSION_ENABLED", "true").lower() in ("true", "1", "yes")
self.context_compressor = ContextCompressor(
model=self.model,
threshold_percent=compression_threshold,
summary_model=compression_model,
protect_first_n=3, # Keep system, first user, first assistant
protect_last_n=4, # Keep recent context
summary_target_tokens=500,
quiet_mode=self.quiet_mode,
)
self.compression_enabled = compression_enabled
if not self.quiet_mode:
if compression_enabled:
print(f"📊 Context limit: {self.context_compressor.context_length:,} tokens (compress at {int(compression_threshold*100)}% = {self.context_compressor.threshold_tokens:,})")
else:
print(f"📊 Context limit: {self.context_compressor.context_length:,} tokens (auto-compression disabled)")
# Pools of kawaii faces for random selection # Pools of kawaii faces for random selection
KAWAII_SEARCH = [ KAWAII_SEARCH = [
@@ -1105,6 +1533,18 @@ class AIAgent:
"error": "First response truncated due to output length limit" "error": "First response truncated due to output length limit"
} }
# Track actual token usage from response for context management
if hasattr(response, 'usage') and response.usage:
usage_dict = {
"prompt_tokens": getattr(response.usage, 'prompt_tokens', 0),
"completion_tokens": getattr(response.usage, 'completion_tokens', 0),
"total_tokens": getattr(response.usage, 'total_tokens', 0),
}
self.context_compressor.update_from_response(usage_dict)
if self.verbose_logging:
logging.debug(f"Token usage: prompt={usage_dict['prompt_tokens']:,}, completion={usage_dict['completion_tokens']:,}, total={usage_dict['total_tokens']:,}")
break # Success, exit retry loop break # Success, exit retry loop
except Exception as api_error: except Exception as api_error:
@@ -1132,17 +1572,28 @@ class AIAgent:
]) ])
if is_context_length_error: if is_context_length_error:
print(f"{self.log_prefix} Context length exceeded - this error cannot be resolved by retrying.") print(f"{self.log_prefix}⚠️ Context length exceeded - attempting compression...")
print(f"{self.log_prefix} 💡 The conversation has accumulated too much content from tool responses.")
logging.error(f"{self.log_prefix}Context length exceeded: {approx_tokens:,} tokens. Cannot continue.") # Try to compress and retry
# Return a partial result instead of crashing original_len = len(messages)
return { messages = self.context_compressor.compress(messages, current_tokens=approx_tokens)
"messages": messages,
"completed": False, if len(messages) < original_len:
"api_calls": api_call_count, # Compression was possible, retry
"error": f"Context length exceeded ({approx_tokens:,} tokens). Conversation terminated early.", print(f"{self.log_prefix} 🗜️ Compressed {original_len}{len(messages)} messages, retrying...")
"partial": True continue # Retry with compressed messages
} else:
# Can't compress further
print(f"{self.log_prefix}❌ Context length exceeded and cannot compress further.")
print(f"{self.log_prefix} 💡 The conversation has accumulated too much content.")
logging.error(f"{self.log_prefix}Context length exceeded: {approx_tokens:,} tokens. Cannot compress further.")
return {
"messages": messages,
"completed": False,
"api_calls": api_call_count,
"error": f"Context length exceeded ({approx_tokens:,} tokens). Cannot compress further.",
"partial": True
}
if retry_count > max_retries: if retry_count > max_retries:
print(f"{self.log_prefix}❌ Max retries ({max_retries}) exceeded. Giving up.") print(f"{self.log_prefix}❌ Max retries ({max_retries}) exceeded. Giving up.")
@@ -1351,6 +1802,14 @@ class AIAgent:
if self.tool_delay > 0 and i < len(assistant_message.tool_calls): if self.tool_delay > 0 and i < len(assistant_message.tool_calls):
time.sleep(self.tool_delay) time.sleep(self.tool_delay)
# Check if context compression is needed before next API call
# Uses actual token count from last API response
if self.compression_enabled and self.context_compressor.should_compress():
messages = self.context_compressor.compress(
messages,
current_tokens=self.context_compressor.last_prompt_tokens
)
# Continue loop for next response # Continue loop for next response
continue continue