# Context Compression and Caching Hermes Agent uses a dual compression system and Anthropic prompt caching to manage context window usage efficiently across long conversations. Source files: `agent/context_compressor.py`, `agent/prompt_caching.py`, `gateway/run.py` (session hygiene), `run_agent.py` (search for `_compress_context`) ## Dual Compression System Hermes has two separate compression layers that operate independently: ``` ┌──────────────────────────┐ Incoming message │ Gateway Session Hygiene │ Fires at 85% of context ─────────────────► │ (pre-agent, rough est.) │ Safety net for large sessions └─────────────┬────────────┘ │ ▼ ┌──────────────────────────┐ │ Agent ContextCompressor │ Fires at 50% of context (default) │ (in-loop, real tokens) │ Normal context management └──────────────────────────┘ ``` ### 1. Gateway Session Hygiene (85% threshold) Located in `gateway/run.py` (search for `_maybe_compress_session`). This is a **safety net** that runs before the agent processes a message. It prevents API failures when sessions grow too large between turns (e.g., overnight accumulation in Telegram/Discord). - **Threshold**: Fixed at 85% of model context length - **Token source**: Prefers actual API-reported tokens from last turn; falls back to rough character-based estimate (`estimate_messages_tokens_rough`) - **Fires**: Only when `len(history) >= 4` and compression is enabled - **Purpose**: Catch sessions that escaped the agent's own compressor The gateway hygiene threshold is intentionally higher than the agent's compressor. Setting it at 50% (same as the agent) caused premature compression on every turn in long gateway sessions. ### 2. Agent ContextCompressor (50% threshold, configurable) Located in `agent/context_compressor.py`. This is the **primary compression system** that runs inside the agent's tool loop with access to accurate, API-reported token counts. ## Configuration All compression settings are read from `config.yaml` under the `compression` key: ```yaml compression: enabled: true # Enable/disable compression (default: true) threshold: 0.50 # Fraction of context window (default: 0.50 = 50%) target_ratio: 0.20 # How much of threshold to keep as tail (default: 0.20) protect_last_n: 20 # Minimum protected tail messages (default: 20) summary_model: null # Override model for summaries (default: uses auxiliary) ``` ### Parameter Details | Parameter | Default | Range | Description | |-----------|---------|-------|-------------| | `threshold` | `0.50` | 0.0-1.0 | Compression triggers when prompt tokens ≥ `threshold × context_length` | | `target_ratio` | `0.20` | 0.10-0.80 | Controls tail protection token budget: `threshold_tokens × target_ratio` | | `protect_last_n` | `20` | ≥1 | Minimum number of recent messages always preserved | | `protect_first_n` | `3` | (hardcoded) | System prompt + first exchange always preserved | ### Computed Values (for a 200K context model at defaults) ``` context_length = 200,000 threshold_tokens = 200,000 × 0.50 = 100,000 tail_token_budget = 100,000 × 0.20 = 20,000 max_summary_tokens = min(200,000 × 0.05, 12,000) = 10,000 ``` ## Compression Algorithm The `ContextCompressor.compress()` method follows a 4-phase algorithm: ### Phase 1: Prune Old Tool Results (cheap, no LLM call) Old tool results (>200 chars) outside the protected tail are replaced with: ``` [Old tool output cleared to save context space] ``` This is a cheap pre-pass that saves significant tokens from verbose tool outputs (file contents, terminal output, search results). ### Phase 2: Determine Boundaries ``` ┌─────────────────────────────────────────────────────────────┐ │ Message list │ │ │ │ [0..2] ← protect_first_n (system + first exchange) │ │ [3..N] ← middle turns → SUMMARIZED │ │ [N..end] ← tail (by token budget OR protect_last_n) │ │ │ └─────────────────────────────────────────────────────────────┘ ``` Tail protection is **token-budget based**: walks backward from the end, accumulating tokens until the budget is exhausted. Falls back to the fixed `protect_last_n` count if the budget would protect fewer messages. Boundaries are aligned to avoid splitting tool_call/tool_result groups. The `_align_boundary_backward()` method walks past consecutive tool results to find the parent assistant message, keeping groups intact. ### Phase 3: Generate Structured Summary The middle turns are summarized using the auxiliary LLM with a structured template: ``` ## Goal [What the user is trying to accomplish] ## Constraints & Preferences [User preferences, coding style, constraints, important decisions] ## Progress ### Done [Completed work — specific file paths, commands run, results] ### In Progress [Work currently underway] ### Blocked [Any blockers or issues encountered] ## Key Decisions [Important technical decisions and why] ## Relevant Files [Files read, modified, or created — with brief note on each] ## Next Steps [What needs to happen next] ## Critical Context [Specific values, error messages, configuration details] ``` Summary budget scales with the amount of content being compressed: - Formula: `content_tokens × 0.20` (the `_SUMMARY_RATIO` constant) - Minimum: 2,000 tokens - Maximum: `min(context_length × 0.05, 12,000)` tokens ### Phase 4: Assemble Compressed Messages The compressed message list is: 1. Head messages (with a note appended to system prompt on first compression) 2. Summary message (role chosen to avoid consecutive same-role violations) 3. Tail messages (unmodified) Orphaned tool_call/tool_result pairs are cleaned up by `_sanitize_tool_pairs()`: - Tool results referencing removed calls → removed - Tool calls whose results were removed → stub result injected ### Iterative Re-compression On subsequent compressions, the previous summary is passed to the LLM with instructions to **update** it rather than summarize from scratch. This preserves information across multiple compactions — items move from "In Progress" to "Done", new progress is added, and obsolete information is removed. The `_previous_summary` field on the compressor instance stores the last summary text for this purpose. ## Before/After Example ### Before Compression (45 messages, ~95K tokens) ``` [0] system: "You are a helpful assistant..." (system prompt) [1] user: "Help me set up a FastAPI project" [2] assistant: terminal: mkdir project [3] tool: "directory created" [4] assistant: write_file: main.py [5] tool: "file written (2.3KB)" ... 30 more turns of file editing, testing, debugging ... [38] assistant: terminal: pytest [39] tool: "8 passed, 2 failed\n..." (5KB output) [40] user: "Fix the failing tests" [41] assistant: read_file: tests/test_api.py [42] tool: "import pytest\n..." (3KB) [43] assistant: "I see the issue with the test fixtures..." [44] user: "Great, also add error handling" ``` ### After Compression (25 messages, ~45K tokens) ``` [0] system: "You are a helpful assistant... [Note: Some earlier conversation turns have been compacted...]" [1] user: "Help me set up a FastAPI project" [2] assistant: "[CONTEXT COMPACTION] Earlier turns were compacted... ## Goal Set up a FastAPI project with tests and error handling ## Progress ### Done - Created project structure: main.py, tests/, requirements.txt - Implemented 5 API endpoints in main.py - Wrote 10 test cases in tests/test_api.py - 8/10 tests passing ### In Progress - Fixing 2 failing tests (test_create_user, test_delete_user) ## Relevant Files - main.py — FastAPI app with 5 endpoints - tests/test_api.py — 10 test cases - requirements.txt — fastapi, pytest, httpx ## Next Steps - Fix failing test fixtures - Add error handling" [3] user: "Fix the failing tests" [4] assistant: read_file: tests/test_api.py [5] tool: "import pytest\n..." [6] assistant: "I see the issue with the test fixtures..." [7] user: "Great, also add error handling" ``` ## Prompt Caching (Anthropic) Source: `agent/prompt_caching.py` Reduces input token costs by ~75% on multi-turn conversations by caching the conversation prefix. Uses Anthropic's `cache_control` breakpoints. ### Strategy: system_and_3 Anthropic allows a maximum of 4 `cache_control` breakpoints per request. Hermes uses the "system_and_3" strategy: ``` Breakpoint 1: System prompt (stable across all turns) Breakpoint 2: 3rd-to-last non-system message ─┐ Breakpoint 3: 2nd-to-last non-system message ├─ Rolling window Breakpoint 4: Last non-system message ─┘ ``` ### How It Works `apply_anthropic_cache_control()` deep-copies the messages and injects `cache_control` markers: ```python # Cache marker format marker = {"type": "ephemeral"} # Or for 1-hour TTL: marker = {"type": "ephemeral", "ttl": "1h"} ``` The marker is applied differently based on content type: | Content Type | Where Marker Goes | |-------------|-------------------| | String content | Converted to `[{"type": "text", "text": ..., "cache_control": ...}]` | | List content | Added to the last element's dict | | None/empty | Added as `msg["cache_control"]` | | Tool messages | Added as `msg["cache_control"]` (native Anthropic only) | ### Cache-Aware Design Patterns 1. **Stable system prompt**: The system prompt is breakpoint 1 and cached across all turns. Avoid mutating it mid-conversation (compression appends a note only on the first compaction). 2. **Message ordering matters**: Cache hits require prefix matching. Adding or removing messages in the middle invalidates the cache for everything after. 3. **Compression cache interaction**: After compression, the cache is invalidated for the compressed region but the system prompt cache survives. The rolling 3-message window re-establishes caching within 1-2 turns. 4. **TTL selection**: Default is `5m` (5 minutes). Use `1h` for long-running sessions where the user takes breaks between turns. ### Enabling Prompt Caching Prompt caching is automatically enabled when: - The model is an Anthropic Claude model (detected by model name) - The provider supports `cache_control` (native Anthropic API or OpenRouter) ```yaml # config.yaml — TTL is configurable model: cache_ttl: "5m" # "5m" or "1h" ``` The CLI shows caching status at startup: ``` 💾 Prompt caching: ENABLED (Claude via OpenRouter, 5m TTL) ``` ## Context Pressure Warnings The agent emits context pressure warnings at 85% of the compression threshold (not 85% of context — 85% of the threshold which is itself 50% of context): ``` ⚠️ Context is 85% to compaction threshold (42,500/50,000 tokens) ``` After compression, if usage drops below 85% of threshold, the warning state is cleared. If compression fails to reduce below the warning level (the conversation is too dense), the warning persists but compression won't re-trigger until the threshold is exceeded again.