Files
timmy-home/research/long-context-vs-rag-decision-framework.md
Timmy Time 6222b18a38
Some checks failed
Smoke Test / smoke (pull_request) Failing after 18s
research: Long Context vs RAG Decision Framework (backlog item #4.3)
Highest-ratio research item (Impact:4, Effort:1, Ratio:4.0).
Covers decision matrix for stuffing vs RAG, our stack constraints,
context budgeting, progressive loading, and smart compression.
2026-04-15 16:38:07 +00:00

4.3 KiB

Long Context vs RAG Decision Framework

Research Backlog Item #4.3 | Impact: 4 | Effort: 1 | Ratio: 4.0
Date: 2026-04-15
Status: RESEARCHED

Executive Summary

Modern LLMs have 128K-200K+ context windows, but we still treat them like 4K models by default. This document provides a decision framework for when to stuff context vs. use RAG, based on empirical findings and our stack constraints.

The Core Insight

Long context ≠ better answers. Research shows:

  • "Lost in the Middle" effect: Models attend poorly to information in the middle of long contexts (Liu et al., 2023)
  • RAG with reranking outperforms full-context stuffing for document QA when docs > 50K tokens
  • Cost scales quadratically with context length (attention computation)
  • Latency increases linearly with input length

RAG ≠ always better. Retrieval introduces:

  • Recall errors (miss relevant chunks)
  • Precision errors (retrieve irrelevant chunks)
  • Chunking artifacts (splitting mid-sentence)
  • Additional latency for embedding + search

Decision Matrix

Scenario Context Size Recommendation Why
Single conversation (< 32K) Small Stuff everything No retrieval overhead, full context available
5-20 documents, focused query 32K-128K Hybrid Key docs in context, rest via RAG
Large corpus search > 128K Pure RAG + reranking Full context impossible, must retrieve
Code review (< 5 files) < 32K Stuff everything Code needs full context for understanding
Code review (repo-wide) > 128K RAG with file-level chunks Files are natural chunk boundaries
Multi-turn conversation Growing Hybrid + compression Keep recent turns in full, compress older
Fact retrieval Any RAG Always faster to search than read everything
Complex reasoning across docs 32K-128K Stuff + chain-of-thought Models need all context for cross-doc reasoning

Our Stack Constraints

What We Have

  • Cloud models: 128K-200K context (OpenRouter providers)
  • Local Ollama: 8K-32K context (Gemma-4 default 8192)
  • Hermes fact_store: SQLite FTS5 full-text search
  • Memory: MemPalace holographic embeddings
  • Session context: Growing conversation history

What This Means

  1. Cloud sessions: We CAN stuff up to 128K but SHOULD we? Cost and latency matter.
  2. Local sessions: MUST use RAG for anything beyond 8K. Long context not available.
  3. Mixed fleet: Need a routing layer that decides per-session.

Advanced Patterns

1. Progressive Context Loading

Don't load everything at once. Start with RAG, then stuff additional docs as needed:

Turn 1: RAG search → top 3 chunks
Turn 2: Model asks "I need more context about X" → stuff X
Turn 3: Model has enough → continue

2. Context Budgeting

Allocate context budget across components:

System prompt:     2,000 tokens  (always)
Recent messages:  10,000 tokens  (last 5 turns)
RAG results:       8,000 tokens  (top chunks)
Stuffed docs:     12,000 tokens  (key docs)
---------------------------
Total:            32,000 tokens  (fits 32K model)

3. Smart Compression

Before stuffing, compress older context:

  • Summarize turns older than 10
  • Remove tool call results (keep only final outputs)
  • Deduplicate repeated information
  • Use structured representations (JSON) instead of prose

Empirical Benchmarks Needed

  1. Stuffing vs RAG accuracy on our fact_store queries
  2. Latency comparison at 32K, 64K, 128K context
  3. Cost per query for cloud models at various context sizes
  4. Local model behavior when pushed beyond rated context

Recommendations

  1. Audit current context usage: How many sessions hit > 32K? (Low effort, high value)
  2. Implement ContextRouter: ~50 LOC, adds routing decisions to hermes
  3. Add context-size logging: Track input tokens per session for data gathering

References

  • Liu et al. "Lost in the Middle: How Language Models Use Long Contexts" (2023) — https://arxiv.org/abs/2307.03172
  • Shi et al. "Large Language Models are Easily Distracted by Irrelevant Context" (2023)
  • Xu et al. "Retrieval Meets Long Context LLMs" (2023) — hybrid approaches outperform both alone
  • Anthropic's Claude 3.5 context caching — built-in prefix caching reduces cost for repeated system prompts

Sovereignty and service always.