Some checks failed
Smoke Test / smoke (pull_request) Failing after 18s
Highest-ratio research item (Impact:4, Effort:1, Ratio:4.0). Covers decision matrix for stuffing vs RAG, our stack constraints, context budgeting, progressive loading, and smart compression.
103 lines
4.3 KiB
Markdown
103 lines
4.3 KiB
Markdown
# Long Context vs RAG Decision Framework
|
|
|
|
**Research Backlog Item #4.3** | Impact: 4 | Effort: 1 | Ratio: 4.0
|
|
**Date**: 2026-04-15
|
|
**Status**: RESEARCHED
|
|
|
|
## Executive Summary
|
|
|
|
Modern LLMs have 128K-200K+ context windows, but we still treat them like 4K models by default. This document provides a decision framework for when to stuff context vs. use RAG, based on empirical findings and our stack constraints.
|
|
|
|
## The Core Insight
|
|
|
|
**Long context ≠ better answers.** Research shows:
|
|
- "Lost in the Middle" effect: Models attend poorly to information in the middle of long contexts (Liu et al., 2023)
|
|
- RAG with reranking outperforms full-context stuffing for document QA when docs > 50K tokens
|
|
- Cost scales quadratically with context length (attention computation)
|
|
- Latency increases linearly with input length
|
|
|
|
**RAG ≠ always better.** Retrieval introduces:
|
|
- Recall errors (miss relevant chunks)
|
|
- Precision errors (retrieve irrelevant chunks)
|
|
- Chunking artifacts (splitting mid-sentence)
|
|
- Additional latency for embedding + search
|
|
|
|
## Decision Matrix
|
|
|
|
| Scenario | Context Size | Recommendation | Why |
|
|
|----------|-------------|---------------|-----|
|
|
| Single conversation (< 32K) | Small | **Stuff everything** | No retrieval overhead, full context available |
|
|
| 5-20 documents, focused query | 32K-128K | **Hybrid** | Key docs in context, rest via RAG |
|
|
| Large corpus search | > 128K | **Pure RAG + reranking** | Full context impossible, must retrieve |
|
|
| Code review (< 5 files) | < 32K | **Stuff everything** | Code needs full context for understanding |
|
|
| Code review (repo-wide) | > 128K | **RAG with file-level chunks** | Files are natural chunk boundaries |
|
|
| Multi-turn conversation | Growing | **Hybrid + compression** | Keep recent turns in full, compress older |
|
|
| Fact retrieval | Any | **RAG** | Always faster to search than read everything |
|
|
| Complex reasoning across docs | 32K-128K | **Stuff + chain-of-thought** | Models need all context for cross-doc reasoning |
|
|
|
|
## Our Stack Constraints
|
|
|
|
### What We Have
|
|
- **Cloud models**: 128K-200K context (OpenRouter providers)
|
|
- **Local Ollama**: 8K-32K context (Gemma-4 default 8192)
|
|
- **Hermes fact_store**: SQLite FTS5 full-text search
|
|
- **Memory**: MemPalace holographic embeddings
|
|
- **Session context**: Growing conversation history
|
|
|
|
### What This Means
|
|
1. **Cloud sessions**: We CAN stuff up to 128K but SHOULD we? Cost and latency matter.
|
|
2. **Local sessions**: MUST use RAG for anything beyond 8K. Long context not available.
|
|
3. **Mixed fleet**: Need a routing layer that decides per-session.
|
|
|
|
## Advanced Patterns
|
|
|
|
### 1. Progressive Context Loading
|
|
Don't load everything at once. Start with RAG, then stuff additional docs as needed:
|
|
```
|
|
Turn 1: RAG search → top 3 chunks
|
|
Turn 2: Model asks "I need more context about X" → stuff X
|
|
Turn 3: Model has enough → continue
|
|
```
|
|
|
|
### 2. Context Budgeting
|
|
Allocate context budget across components:
|
|
```
|
|
System prompt: 2,000 tokens (always)
|
|
Recent messages: 10,000 tokens (last 5 turns)
|
|
RAG results: 8,000 tokens (top chunks)
|
|
Stuffed docs: 12,000 tokens (key docs)
|
|
---------------------------
|
|
Total: 32,000 tokens (fits 32K model)
|
|
```
|
|
|
|
### 3. Smart Compression
|
|
Before stuffing, compress older context:
|
|
- Summarize turns older than 10
|
|
- Remove tool call results (keep only final outputs)
|
|
- Deduplicate repeated information
|
|
- Use structured representations (JSON) instead of prose
|
|
|
|
## Empirical Benchmarks Needed
|
|
|
|
1. **Stuffing vs RAG accuracy** on our fact_store queries
|
|
2. **Latency comparison** at 32K, 64K, 128K context
|
|
3. **Cost per query** for cloud models at various context sizes
|
|
4. **Local model behavior** when pushed beyond rated context
|
|
|
|
## Recommendations
|
|
|
|
1. **Audit current context usage**: How many sessions hit > 32K? (Low effort, high value)
|
|
2. **Implement ContextRouter**: ~50 LOC, adds routing decisions to hermes
|
|
3. **Add context-size logging**: Track input tokens per session for data gathering
|
|
|
|
## References
|
|
|
|
- Liu et al. "Lost in the Middle: How Language Models Use Long Contexts" (2023) — https://arxiv.org/abs/2307.03172
|
|
- Shi et al. "Large Language Models are Easily Distracted by Irrelevant Context" (2023)
|
|
- Xu et al. "Retrieval Meets Long Context LLMs" (2023) — hybrid approaches outperform both alone
|
|
- Anthropic's Claude 3.5 context caching — built-in prefix caching reduces cost for repeated system prompts
|
|
|
|
---
|
|
|
|
*Sovereignty and service always.*
|