fix(compression): replace dead summary_target_tokens with ratio-based scaling (#2554)
The summary_target_tokens parameter was accepted in the constructor,
stored on the instance, and never used — the summary budget was always
computed from hardcoded module constants (_SUMMARY_RATIO=0.20,
_MAX_SUMMARY_TOKENS=8000). This caused two compounding problems:
1. The config value was silently ignored, giving users no control
over post-compression size.
2. Fixed budgets (20K tail, 8K summary cap) didn't scale with
context window size. Switching from a 1M-context model to a
200K model would trigger compression that nuked 350K tokens
of conversation history down to ~30K.
Changes:
- Replace summary_target_tokens with summary_target_ratio (default 0.40)
which sets the post-compression target as a fraction of context_length.
Tail token budget and summary cap now scale proportionally:
MiniMax 200K → ~80K post-compression
GPT-5 1M → ~400K post-compression
- Change threshold_percent default: 0.50 → 0.80 (don't fire until
80% of context is consumed)
- Change protect_last_n default: 4 → 20 (preserve ~10 full turns)
- Summary token cap scales to 5% of context (was fixed 8K), capped
at 32K ceiling
- Read target_ratio and protect_last_n from config.yaml compression
section (both are now configurable)
- Remove hardcoded summary_target_tokens=500 from run_agent.py
- Add 5 new tests for ratio scaling, clamping, and new defaults
This commit is contained in:
@@ -232,19 +232,33 @@ browser:
|
||||
# 1. Tracks actual token usage from API responses (not estimates)
|
||||
# 2. When prompt_tokens >= threshold% of model's context_length, triggers compression
|
||||
# 3. Protects first 3 turns (system prompt, initial request, first response)
|
||||
# 4. Protects last 4 turns (recent context is most relevant)
|
||||
# 4. Protects last N turns (default 20 messages = ~10 full turns of recent context)
|
||||
# 5. Summarizes middle turns using a fast/cheap model
|
||||
# 6. Inserts summary as a user message, continues conversation seamlessly
|
||||
#
|
||||
# Post-compression size scales with the model's context window via target_ratio:
|
||||
# MiniMax 200K context → ~80K post-compression (at 0.40 ratio)
|
||||
# GPT-5 1M context → ~400K post-compression (at 0.40 ratio)
|
||||
#
|
||||
compression:
|
||||
# Enable automatic context compression (default: true)
|
||||
# Set to false if you prefer to manage context manually or want errors on overflow
|
||||
enabled: true
|
||||
|
||||
# Trigger compression at this % of model's context limit (default: 0.85 = 85%)
|
||||
# Trigger compression at this % of model's context limit (default: 0.80 = 80%)
|
||||
# Lower values = more aggressive compression, higher values = compress later
|
||||
threshold: 0.85
|
||||
threshold: 0.80
|
||||
|
||||
# Target post-compression size as a fraction of context window (default: 0.40 = 40%)
|
||||
# Controls how much context survives compression. Tail token budget and summary
|
||||
# cap scale with this value. Range: 0.10 - 0.80
|
||||
target_ratio: 0.40
|
||||
|
||||
# Number of most-recent messages to always preserve (default: 20 ≈ 10 full turns)
|
||||
# Higher values keep more recent conversation intact at the cost of more aggressive
|
||||
# compression of older turns.
|
||||
protect_last_n: 20
|
||||
|
||||
# Model to use for generating summaries (fast/cheap recommended)
|
||||
# This model compresses the middle turns into a concise summary.
|
||||
# IMPORTANT: it receives the full middle section of the conversation, so it
|
||||
|
||||
Reference in New Issue
Block a user