Files
hermes-agent/website/docs/developer-guide/context-compression-and-caching.md
Nacho Avecilla b8dd059c40 feat(website): add skills browse and search page to docs (#4500)
Adds a Skills Hub page to the documentation site with browsable/searchable catalog of all skills (built-in, optional, and community from cached hub indexes).

- Python extraction script (website/scripts/extract-skills.py) parses SKILL.md frontmatter and hub index caches into skills.json
- React page (website/src/pages/skills/) with search, category filtering, source filtering, and expandable skill cards
- CI workflow updated to run extraction before Docusaurus build
- Deploy trigger expanded to include skills/ and optional-skills/ changes

Authored by @IAvecilla
2026-04-02 10:47:38 -07:00

12 KiB
Raw Blame History

Context Compression and Caching

Hermes Agent uses a dual compression system and Anthropic prompt caching to manage context window usage efficiently across long conversations.

Source files: agent/context_compressor.py, agent/prompt_caching.py, gateway/run.py (session hygiene), run_agent.py (lines 1146-1204)

Dual Compression System

Hermes has two separate compression layers that operate independently:

                     ┌──────────────────────────┐
  Incoming message   │   Gateway Session Hygiene │  Fires at 85% of context
  ─────────────────► │   (pre-agent, rough est.) │  Safety net for large sessions
                     └─────────────┬────────────┘
                                   │
                                   ▼
                     ┌──────────────────────────┐
                     │   Agent ContextCompressor │  Fires at 50% of context (default)
                     │   (in-loop, real tokens)  │  Normal context management
                     └──────────────────────────┘

1. Gateway Session Hygiene (85% threshold)

Located in gateway/run.py (around line 2220). This is a safety net that runs before the agent processes a message. It prevents API failures when sessions grow too large between turns (e.g., overnight accumulation in Telegram/Discord).

  • Threshold: Fixed at 85% of model context length
  • Token source: Prefers actual API-reported tokens from last turn; falls back to rough character-based estimate (estimate_messages_tokens_rough)
  • Fires: Only when len(history) >= 4 and compression is enabled
  • Purpose: Catch sessions that escaped the agent's own compressor

The gateway hygiene threshold is intentionally higher than the agent's compressor. Setting it at 50% (same as the agent) caused premature compression on every turn in long gateway sessions.

2. Agent ContextCompressor (50% threshold, configurable)

Located in agent/context_compressor.py. This is the primary compression system that runs inside the agent's tool loop with access to accurate, API-reported token counts.

Configuration

All compression settings are read from config.yaml under the compression key:

compression:
  enabled: true              # Enable/disable compression (default: true)
  threshold: 0.50            # Fraction of context window (default: 0.50 = 50%)
  target_ratio: 0.20         # How much of threshold to keep as tail (default: 0.20)
  protect_last_n: 20         # Minimum protected tail messages (default: 20)
  summary_model: null        # Override model for summaries (default: uses auxiliary)

Parameter Details

Parameter Default Range Description
threshold 0.50 0.0-1.0 Compression triggers when prompt tokens ≥ threshold × context_length
target_ratio 0.20 0.10-0.80 Controls tail protection token budget: threshold_tokens × target_ratio
protect_last_n 20 ≥1 Minimum number of recent messages always preserved
protect_first_n 3 (hardcoded) System prompt + first exchange always preserved

Computed Values (for a 200K context model at defaults)

context_length       = 200,000
threshold_tokens     = 200,000 × 0.50 = 100,000
tail_token_budget    = 100,000 × 0.20 = 20,000
max_summary_tokens   = min(200,000 × 0.05, 12,000) = 10,000

Compression Algorithm

The ContextCompressor.compress() method follows a 4-phase algorithm:

Phase 1: Prune Old Tool Results (cheap, no LLM call)

Old tool results (>200 chars) outside the protected tail are replaced with:

[Old tool output cleared to save context space]

This is a cheap pre-pass that saves significant tokens from verbose tool outputs (file contents, terminal output, search results).

Phase 2: Determine Boundaries

┌─────────────────────────────────────────────────────────────┐
│  Message list                                               │
│                                                             │
│  [0..2]  ← protect_first_n (system + first exchange)        │
│  [3..N]  ← middle turns → SUMMARIZED                        │
│  [N..end] ← tail (by token budget OR protect_last_n)        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Tail protection is token-budget based: walks backward from the end, accumulating tokens until the budget is exhausted. Falls back to the fixed protect_last_n count if the budget would protect fewer messages.

Boundaries are aligned to avoid splitting tool_call/tool_result groups. The _align_boundary_backward() method walks past consecutive tool results to find the parent assistant message, keeping groups intact.

Phase 3: Generate Structured Summary

The middle turns are summarized using the auxiliary LLM with a structured template:

## Goal
[What the user is trying to accomplish]

## Constraints & Preferences
[User preferences, coding style, constraints, important decisions]

## Progress
### Done
[Completed work — specific file paths, commands run, results]
### In Progress
[Work currently underway]
### Blocked
[Any blockers or issues encountered]

## Key Decisions
[Important technical decisions and why]

## Relevant Files
[Files read, modified, or created — with brief note on each]

## Next Steps
[What needs to happen next]

## Critical Context
[Specific values, error messages, configuration details]

Summary budget scales with the amount of content being compressed:

  • Formula: content_tokens × 0.20 (the _SUMMARY_RATIO constant)
  • Minimum: 2,000 tokens
  • Maximum: min(context_length × 0.05, 12,000) tokens

Phase 4: Assemble Compressed Messages

The compressed message list is:

  1. Head messages (with a note appended to system prompt on first compression)
  2. Summary message (role chosen to avoid consecutive same-role violations)
  3. Tail messages (unmodified)

Orphaned tool_call/tool_result pairs are cleaned up by _sanitize_tool_pairs():

  • Tool results referencing removed calls → removed
  • Tool calls whose results were removed → stub result injected

Iterative Re-compression

On subsequent compressions, the previous summary is passed to the LLM with instructions to update it rather than summarize from scratch. This preserves information across multiple compactions — items move from "In Progress" to "Done", new progress is added, and obsolete information is removed.

The _previous_summary field on the compressor instance stores the last summary text for this purpose.

Before/After Example

Before Compression (45 messages, ~95K tokens)

[0] system:    "You are a helpful assistant..." (system prompt)
[1] user:      "Help me set up a FastAPI project"
[2] assistant: <tool_call> terminal: mkdir project </tool_call>
[3] tool:      "directory created"
[4] assistant: <tool_call> write_file: main.py </tool_call>
[5] tool:      "file written (2.3KB)"
    ... 30 more turns of file editing, testing, debugging ...
[38] assistant: <tool_call> terminal: pytest </tool_call>
[39] tool:      "8 passed, 2 failed\n..."  (5KB output)
[40] user:      "Fix the failing tests"
[41] assistant: <tool_call> read_file: tests/test_api.py </tool_call>
[42] tool:      "import pytest\n..."  (3KB)
[43] assistant: "I see the issue with the test fixtures..."
[44] user:      "Great, also add error handling"

After Compression (25 messages, ~45K tokens)

[0] system:    "You are a helpful assistant...
               [Note: Some earlier conversation turns have been compacted...]"
[1] user:      "Help me set up a FastAPI project"
[2] assistant: "[CONTEXT COMPACTION] Earlier turns were compacted...

               ## Goal
               Set up a FastAPI project with tests and error handling

               ## Progress
               ### Done
               - Created project structure: main.py, tests/, requirements.txt
               - Implemented 5 API endpoints in main.py
               - Wrote 10 test cases in tests/test_api.py
               - 8/10 tests passing

               ### In Progress
               - Fixing 2 failing tests (test_create_user, test_delete_user)

               ## Relevant Files
               - main.py — FastAPI app with 5 endpoints
               - tests/test_api.py — 10 test cases
               - requirements.txt — fastapi, pytest, httpx

               ## Next Steps
               - Fix failing test fixtures
               - Add error handling"
[3] user:      "Fix the failing tests"
[4] assistant: <tool_call> read_file: tests/test_api.py </tool_call>
[5] tool:      "import pytest\n..."
[6] assistant: "I see the issue with the test fixtures..."
[7] user:      "Great, also add error handling"

Prompt Caching (Anthropic)

Source: agent/prompt_caching.py

Reduces input token costs by ~75% on multi-turn conversations by caching the conversation prefix. Uses Anthropic's cache_control breakpoints.

Strategy: system_and_3

Anthropic allows a maximum of 4 cache_control breakpoints per request. Hermes uses the "system_and_3" strategy:

Breakpoint 1: System prompt           (stable across all turns)
Breakpoint 2: 3rd-to-last non-system message  ─┐
Breakpoint 3: 2nd-to-last non-system message   ├─ Rolling window
Breakpoint 4: Last non-system message          ─┘

How It Works

apply_anthropic_cache_control() deep-copies the messages and injects cache_control markers:

# Cache marker format
marker = {"type": "ephemeral"}
# Or for 1-hour TTL:
marker = {"type": "ephemeral", "ttl": "1h"}

The marker is applied differently based on content type:

Content Type Where Marker Goes
String content Converted to [{"type": "text", "text": ..., "cache_control": ...}]
List content Added to the last element's dict
None/empty Added as msg["cache_control"]
Tool messages Added as msg["cache_control"] (native Anthropic only)

Cache-Aware Design Patterns

  1. Stable system prompt: The system prompt is breakpoint 1 and cached across all turns. Avoid mutating it mid-conversation (compression appends a note only on the first compaction).

  2. Message ordering matters: Cache hits require prefix matching. Adding or removing messages in the middle invalidates the cache for everything after.

  3. Compression cache interaction: After compression, the cache is invalidated for the compressed region but the system prompt cache survives. The rolling 3-message window re-establishes caching within 1-2 turns.

  4. TTL selection: Default is 5m (5 minutes). Use 1h for long-running sessions where the user takes breaks between turns.

Enabling Prompt Caching

Prompt caching is automatically enabled when:

  • The model is an Anthropic Claude model (detected by model name)
  • The provider supports cache_control (native Anthropic API or OpenRouter)
# config.yaml — TTL is configurable
model:
  cache_ttl: "5m"   # "5m" or "1h"

The CLI shows caching status at startup:

💾 Prompt caching: ENABLED (Claude via OpenRouter, 5m TTL)

Context Pressure Warnings

The agent emits context pressure warnings at 85% of the compression threshold (not 85% of context — 85% of the threshold which is itself 50% of context):

⚠️  Context is 85% to compaction threshold (42,500/50,000 tokens)

After compression, if usage drops below 85% of threshold, the warning state is cleared. If compression fails to reduce below the warning level (the conversation is too dense), the warning persists but compression won't re-trigger until the threshold is exceeded again.