Files

Smoke Test / smoke (pull_request) Failing after 18s

Details

research: Long Context vs RAG Decision Framework (backlog item #4.3)

Highest-ratio research item (Impact:4, Effort:1, Ratio:4.0).
Covers decision matrix for stuffing vs RAG, our stack constraints,
context budgeting, progressive loading, and smart compression.

2026-04-15 16:38:07 +00:00

4.3 KiB

Raw Blame History

Long Context vs RAG Decision Framework

Research Backlog Item #4.3 | Impact: 4 | Effort: 1 | Ratio: 4.0
Date: 2026-04-15
Status: RESEARCHED

Executive Summary

Modern LLMs have 128K-200K+ context windows, but we still treat them like 4K models by default. This document provides a decision framework for when to stuff context vs. use RAG, based on empirical findings and our stack constraints.

The Core Insight

Long context ≠ better answers. Research shows:

"Lost in the Middle" effect: Models attend poorly to information in the middle of long contexts (Liu et al., 2023)
RAG with reranking outperforms full-context stuffing for document QA when docs > 50K tokens
Cost scales quadratically with context length (attention computation)
Latency increases linearly with input length

RAG ≠ always better. Retrieval introduces:

Recall errors (miss relevant chunks)
Precision errors (retrieve irrelevant chunks)
Chunking artifacts (splitting mid-sentence)
Additional latency for embedding + search

Decision Matrix

Scenario	Context Size	Recommendation	Why
Single conversation (< 32K)	Small	Stuff everything	No retrieval overhead, full context available
5-20 documents, focused query	32K-128K	Hybrid	Key docs in context, rest via RAG
Large corpus search	> 128K	Pure RAG + reranking	Full context impossible, must retrieve
Code review (< 5 files)	< 32K	Stuff everything	Code needs full context for understanding
Code review (repo-wide)	> 128K	RAG with file-level chunks	Files are natural chunk boundaries
Multi-turn conversation	Growing	Hybrid + compression	Keep recent turns in full, compress older
Fact retrieval	Any	RAG	Always faster to search than read everything
Complex reasoning across docs	32K-128K	Stuff + chain-of-thought	Models need all context for cross-doc reasoning

Our Stack Constraints

What We Have

Cloud models: 128K-200K context (OpenRouter providers)
Local Ollama: 8K-32K context (Gemma-4 default 8192)
Hermes fact_store: SQLite FTS5 full-text search
Memory: MemPalace holographic embeddings
Session context: Growing conversation history

What This Means

Cloud sessions: We CAN stuff up to 128K but SHOULD we? Cost and latency matter.
Local sessions: MUST use RAG for anything beyond 8K. Long context not available.
Mixed fleet: Need a routing layer that decides per-session.

Advanced Patterns

1. Progressive Context Loading

Don't load everything at once. Start with RAG, then stuff additional docs as needed:

Turn 1: RAG search → top 3 chunks
Turn 2: Model asks "I need more context about X" → stuff X
Turn 3: Model has enough → continue

2. Context Budgeting

Allocate context budget across components:

System prompt:     2,000 tokens  (always)
Recent messages:  10,000 tokens  (last 5 turns)
RAG results:       8,000 tokens  (top chunks)
Stuffed docs:     12,000 tokens  (key docs)
---------------------------
Total:            32,000 tokens  (fits 32K model)

3. Smart Compression

Before stuffing, compress older context:

Summarize turns older than 10
Remove tool call results (keep only final outputs)
Deduplicate repeated information
Use structured representations (JSON) instead of prose

Empirical Benchmarks Needed

Stuffing vs RAG accuracy on our fact_store queries
Latency comparison at 32K, 64K, 128K context
Cost per query for cloud models at various context sizes
Local model behavior when pushed beyond rated context

Recommendations

Audit current context usage: How many sessions hit > 32K? (Low effort, high value)
Implement ContextRouter: ~50 LOC, adds routing decisions to hermes
Add context-size logging: Track input tokens per session for data gathering

References

Liu et al. "Lost in the Middle: How Language Models Use Long Contexts" (2023) — https://arxiv.org/abs/2307.03172
Shi et al. "Large Language Models are Easily Distracted by Irrelevant Context" (2023)
Xu et al. "Retrieval Meets Long Context LLMs" (2023) — hybrid approaches outperform both alone
Anthropic's Claude 3.5 context caching — built-in prefix caching reduces cost for repeated system prompts

Sovereignty and service always.

4.3 KiB Raw Blame History