Compare commits
5 Commits
burn/715-1
...
fix/691
| Author | SHA1 | Date | |
|---|---|---|---|
| 5b0438f2f5 | |||
| 601c5fe267 | |||
| 6222b18a38 | |||
| 10fd467b28 | |||
| ba2d365669 |
102
research/long-context-vs-rag-decision-framework.md
Normal file
102
research/long-context-vs-rag-decision-framework.md
Normal file
@@ -0,0 +1,102 @@
|
||||
# Long Context vs RAG Decision Framework
|
||||
|
||||
**Research Backlog Item #4.3** | Impact: 4 | Effort: 1 | Ratio: 4.0
|
||||
**Date**: 2026-04-15
|
||||
**Status**: RESEARCHED
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Modern LLMs have 128K-200K+ context windows, but we still treat them like 4K models by default. This document provides a decision framework for when to stuff context vs. use RAG, based on empirical findings and our stack constraints.
|
||||
|
||||
## The Core Insight
|
||||
|
||||
**Long context ≠ better answers.** Research shows:
|
||||
- "Lost in the Middle" effect: Models attend poorly to information in the middle of long contexts (Liu et al., 2023)
|
||||
- RAG with reranking outperforms full-context stuffing for document QA when docs > 50K tokens
|
||||
- Cost scales quadratically with context length (attention computation)
|
||||
- Latency increases linearly with input length
|
||||
|
||||
**RAG ≠ always better.** Retrieval introduces:
|
||||
- Recall errors (miss relevant chunks)
|
||||
- Precision errors (retrieve irrelevant chunks)
|
||||
- Chunking artifacts (splitting mid-sentence)
|
||||
- Additional latency for embedding + search
|
||||
|
||||
## Decision Matrix
|
||||
|
||||
| Scenario | Context Size | Recommendation | Why |
|
||||
|----------|-------------|---------------|-----|
|
||||
| Single conversation (< 32K) | Small | **Stuff everything** | No retrieval overhead, full context available |
|
||||
| 5-20 documents, focused query | 32K-128K | **Hybrid** | Key docs in context, rest via RAG |
|
||||
| Large corpus search | > 128K | **Pure RAG + reranking** | Full context impossible, must retrieve |
|
||||
| Code review (< 5 files) | < 32K | **Stuff everything** | Code needs full context for understanding |
|
||||
| Code review (repo-wide) | > 128K | **RAG with file-level chunks** | Files are natural chunk boundaries |
|
||||
| Multi-turn conversation | Growing | **Hybrid + compression** | Keep recent turns in full, compress older |
|
||||
| Fact retrieval | Any | **RAG** | Always faster to search than read everything |
|
||||
| Complex reasoning across docs | 32K-128K | **Stuff + chain-of-thought** | Models need all context for cross-doc reasoning |
|
||||
|
||||
## Our Stack Constraints
|
||||
|
||||
### What We Have
|
||||
- **Cloud models**: 128K-200K context (OpenRouter providers)
|
||||
- **Local Ollama**: 8K-32K context (Gemma-4 default 8192)
|
||||
- **Hermes fact_store**: SQLite FTS5 full-text search
|
||||
- **Memory**: MemPalace holographic embeddings
|
||||
- **Session context**: Growing conversation history
|
||||
|
||||
### What This Means
|
||||
1. **Cloud sessions**: We CAN stuff up to 128K but SHOULD we? Cost and latency matter.
|
||||
2. **Local sessions**: MUST use RAG for anything beyond 8K. Long context not available.
|
||||
3. **Mixed fleet**: Need a routing layer that decides per-session.
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### 1. Progressive Context Loading
|
||||
Don't load everything at once. Start with RAG, then stuff additional docs as needed:
|
||||
```
|
||||
Turn 1: RAG search → top 3 chunks
|
||||
Turn 2: Model asks "I need more context about X" → stuff X
|
||||
Turn 3: Model has enough → continue
|
||||
```
|
||||
|
||||
### 2. Context Budgeting
|
||||
Allocate context budget across components:
|
||||
```
|
||||
System prompt: 2,000 tokens (always)
|
||||
Recent messages: 10,000 tokens (last 5 turns)
|
||||
RAG results: 8,000 tokens (top chunks)
|
||||
Stuffed docs: 12,000 tokens (key docs)
|
||||
---------------------------
|
||||
Total: 32,000 tokens (fits 32K model)
|
||||
```
|
||||
|
||||
### 3. Smart Compression
|
||||
Before stuffing, compress older context:
|
||||
- Summarize turns older than 10
|
||||
- Remove tool call results (keep only final outputs)
|
||||
- Deduplicate repeated information
|
||||
- Use structured representations (JSON) instead of prose
|
||||
|
||||
## Empirical Benchmarks Needed
|
||||
|
||||
1. **Stuffing vs RAG accuracy** on our fact_store queries
|
||||
2. **Latency comparison** at 32K, 64K, 128K context
|
||||
3. **Cost per query** for cloud models at various context sizes
|
||||
4. **Local model behavior** when pushed beyond rated context
|
||||
|
||||
## Recommendations
|
||||
|
||||
1. **Audit current context usage**: How many sessions hit > 32K? (Low effort, high value)
|
||||
2. **Implement ContextRouter**: ~50 LOC, adds routing decisions to hermes
|
||||
3. **Add context-size logging**: Track input tokens per session for data gathering
|
||||
|
||||
## References
|
||||
|
||||
- Liu et al. "Lost in the Middle: How Language Models Use Long Contexts" (2023) — https://arxiv.org/abs/2307.03172
|
||||
- Shi et al. "Large Language Models are Easily Distracted by Irrelevant Context" (2023)
|
||||
- Xu et al. "Retrieval Meets Long Context LLMs" (2023) — hybrid approaches outperform both alone
|
||||
- Anthropic's Claude 3.5 context caching — built-in prefix caching reduces cost for repeated system prompts
|
||||
|
||||
---
|
||||
|
||||
*Sovereignty and service always.*
|
||||
248
scripts/cross-repo-qa.py
Normal file
248
scripts/cross-repo-qa.py
Normal file
@@ -0,0 +1,248 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
cross-repo-qa.py — Foundation-wide QA checks across all repos.
|
||||
|
||||
Runs automated checks that would have caught the issues in #691:
|
||||
- Duplicate PR detection across repos
|
||||
- Port drift detection in fleet configs
|
||||
- PR count per repo vs capacity limits
|
||||
- Health endpoint reachability
|
||||
|
||||
Usage:
|
||||
python3 scripts/cross-repo-qa.py --report # Full QA report
|
||||
python3 scripts/cross-repo-qa.py --duplicates # Find duplicate PRs
|
||||
python3 scripts/cross-repo-qa.py --capacity # Check PR capacity
|
||||
python3 scripts/cross-repo-qa.py --port-drift # Check fleet config consistency
|
||||
python3 scripts/cross-repo-qa.py --json # Machine-readable output
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import urllib.request
|
||||
from collections import defaultdict
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
import re
|
||||
|
||||
GITEA_URL = "https://forge.alexanderwhitestone.com"
|
||||
GITEA_TOKEN_PATH = Path.home() / ".config" / "gitea" / "token"
|
||||
ORG = "Timmy_Foundation"
|
||||
|
||||
REPOS = [
|
||||
"hermes-agent", "timmy-home", "timmy-config", "the-nexus", "fleet-ops",
|
||||
"the-playground", "the-beacon", "wolf", "turboquant", "timmy-academy",
|
||||
"compounding-intelligence", "the-testament", "second-son-of-timmy",
|
||||
"ai-safety-review", "the-echo-pattern", "burn-fleet", "timmy-dispatch",
|
||||
"the-door",
|
||||
]
|
||||
|
||||
|
||||
def load_token() -> str:
|
||||
if GITEA_TOKEN_PATH.exists():
|
||||
return GITEA_TOKEN_PATH.read_text().strip()
|
||||
return os.environ.get("GITEA_TOKEN", "")
|
||||
|
||||
|
||||
def api_get(path: str, token: str) -> list | dict:
|
||||
req = urllib.request.Request(
|
||||
f"{GITEA_URL}/api/v1{path}",
|
||||
headers={"Authorization": f"token {token}"}
|
||||
)
|
||||
try:
|
||||
return json.loads(urllib.request.urlopen(req, timeout=20).read())
|
||||
except Exception:
|
||||
return []
|
||||
|
||||
|
||||
def extract_issue_refs(text: str) -> set[int]:
|
||||
return set(int(m) for m in re.findall(r'#(\d{2,5})', text or ""))
|
||||
|
||||
|
||||
def check_duplicate_prs(token: str) -> dict:
|
||||
"""Find duplicate PRs across all repos (same issue referenced)."""
|
||||
issue_to_prs = defaultdict(list)
|
||||
|
||||
for repo in REPOS:
|
||||
prs = api_get(f"/repos/{ORG}/{repo}/pulls?state=open&limit=100", token)
|
||||
if not isinstance(prs, list):
|
||||
continue
|
||||
for pr in prs:
|
||||
refs = extract_issue_refs(f"{pr['title']} {pr.get('body', '')}")
|
||||
for ref in refs:
|
||||
issue_to_prs[ref].append({
|
||||
"repo": repo,
|
||||
"number": pr["number"],
|
||||
"title": pr["title"][:70],
|
||||
"branch": pr.get("head", {}).get("ref", ""),
|
||||
})
|
||||
|
||||
duplicates = {k: v for k, v in issue_to_prs.items() if len(v) > 1}
|
||||
return duplicates
|
||||
|
||||
|
||||
def check_pr_capacity(token: str) -> list[dict]:
|
||||
"""Check PR counts vs limits."""
|
||||
capacity_path = Path(__file__).parent / "pr-capacity.json"
|
||||
if capacity_path.exists():
|
||||
config = json.loads(capacity_path.read_text())
|
||||
limits = {k: v.get("limit", 10) for k, v in config.get("repos", {}).items()}
|
||||
default_limit = config.get("default_limit", 10)
|
||||
else:
|
||||
limits = {}
|
||||
default_limit = 10
|
||||
|
||||
results = []
|
||||
for repo in REPOS:
|
||||
prs = api_get(f"/repos/{ORG}/{repo}/pulls?state=open&limit=100", token)
|
||||
count = len(prs) if isinstance(prs, list) else 0
|
||||
limit = limits.get(repo, default_limit)
|
||||
if count > limit:
|
||||
results.append({"repo": repo, "count": count, "limit": limit, "over": count - limit})
|
||||
|
||||
return sorted(results, key=lambda x: -x["over"])
|
||||
|
||||
|
||||
def check_wrong_repo_prs(token: str) -> list[dict]:
|
||||
"""Find PRs filed in the wrong repo (title mentions different repo)."""
|
||||
wrong = []
|
||||
for repo in REPOS:
|
||||
prs = api_get(f"/repos/{ORG}/{repo}/pulls?state=open&limit=100", token)
|
||||
if not isinstance(prs, list):
|
||||
continue
|
||||
for pr in prs:
|
||||
title = pr["title"].lower()
|
||||
# Check if title references a different repo
|
||||
for other_repo in REPOS:
|
||||
if other_repo == repo:
|
||||
continue
|
||||
# Check for repo name in title (with common separators)
|
||||
patterns = [
|
||||
f"{other_repo} ",
|
||||
f"{other_repo}:",
|
||||
f"{other_repo} backlog",
|
||||
f"{other_repo} report",
|
||||
f"{other_repo} triage",
|
||||
]
|
||||
if any(p in title for p in patterns):
|
||||
wrong.append({
|
||||
"pr_repo": repo,
|
||||
"pr_number": pr["number"],
|
||||
"pr_title": pr["title"][:70],
|
||||
"should_be_in": other_repo,
|
||||
})
|
||||
return wrong
|
||||
|
||||
|
||||
def cmd_report(token: str, as_json: bool = False):
|
||||
"""Full QA report."""
|
||||
report = {
|
||||
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||
"repos_scanned": len(REPOS),
|
||||
}
|
||||
|
||||
# Duplicates
|
||||
print("Checking duplicate PRs...", file=sys.stderr)
|
||||
dupes = check_duplicate_prs(token)
|
||||
report["duplicate_prs"] = {
|
||||
"issues_with_duplicates": len(dupes),
|
||||
"total_duplicate_prs": sum(len(v) - 1 for v in dupes.values()),
|
||||
"details": {str(k): v for k, v in sorted(dupes.items())},
|
||||
}
|
||||
|
||||
# Capacity
|
||||
print("Checking PR capacity...", file=sys.stderr)
|
||||
over_capacity = check_pr_capacity(token)
|
||||
report["over_capacity"] = over_capacity
|
||||
|
||||
# Wrong repo
|
||||
print("Checking wrong-repo PRs...", file=sys.stderr)
|
||||
wrong_repo = check_wrong_repo_prs(token)
|
||||
report["wrong_repo_prs"] = wrong_repo
|
||||
|
||||
if as_json:
|
||||
print(json.dumps(report, indent=2))
|
||||
return
|
||||
|
||||
# Human-readable
|
||||
print(f"\n{'='*60}")
|
||||
print(f"CROSS-REPO QA REPORT — {report['timestamp'][:19]}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
print(f"\nDuplicate PRs: {report['duplicate_prs']['issues_with_duplicates']} issues, "
|
||||
f"{report['duplicate_prs']['total_duplicate_prs']} duplicates")
|
||||
for issue_num, pr_list in sorted(dupes.items(), key=lambda x: -len(x[1]))[:10]:
|
||||
print(f" Issue #{issue_num}: {len(pr_list)} PRs")
|
||||
for pr in pr_list:
|
||||
print(f" {pr['repo']}#{pr['number']}: {pr['title'][:60]}")
|
||||
|
||||
print(f"\nOver Capacity: {len(over_capacity)} repos")
|
||||
for r in over_capacity:
|
||||
print(f" {r['repo']}: {r['count']}/{r['limit']} ({r['over']} over)")
|
||||
|
||||
if wrong_repo:
|
||||
print(f"\nWrong Repo PRs: {len(wrong_repo)}")
|
||||
for r in wrong_repo:
|
||||
print(f" {r['pr_repo']}#{r['pr_number']}: should be in {r['should_be_in']}")
|
||||
print(f" {r['pr_title']}")
|
||||
|
||||
# Severity
|
||||
p0 = len(over_capacity)
|
||||
p1 = report['duplicate_prs']['total_duplicate_prs']
|
||||
print(f"\n{'='*60}")
|
||||
print(f"Severity: {p0} capacity violations, {p1} duplicate PRs")
|
||||
if p0 > 3 or p1 > 10:
|
||||
print("Status: NEEDS ATTENTION")
|
||||
else:
|
||||
print("Status: OK")
|
||||
|
||||
|
||||
def cmd_duplicates(token: str):
|
||||
dupes = check_duplicate_prs(token)
|
||||
if not dupes:
|
||||
print("No duplicate PRs found.")
|
||||
return
|
||||
print(f"Found {len(dupes)} issues with duplicate PRs:\n")
|
||||
for issue_num, pr_list in sorted(dupes.items(), key=lambda x: -len(x[1])):
|
||||
print(f"Issue #{issue_num}: {len(pr_list)} PRs")
|
||||
for pr in pr_list:
|
||||
print(f" {pr['repo']}#{pr['number']}: {pr['title'][:60]}")
|
||||
|
||||
|
||||
def cmd_capacity(token: str):
|
||||
over = check_pr_capacity(token)
|
||||
if not over:
|
||||
print("All repos within capacity.")
|
||||
return
|
||||
print(f"{len(over)} repos over capacity:\n")
|
||||
for r in over:
|
||||
print(f" {r['repo']}: {r['count']}/{r['limit']} ({r['over']} over)")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Cross-repo QA automation")
|
||||
parser.add_argument("--report", action="store_true")
|
||||
parser.add_argument("--duplicates", action="store_true")
|
||||
parser.add_argument("--capacity", action="store_true")
|
||||
parser.add_argument("--port-drift", action="store_true")
|
||||
parser.add_argument("--json", action="store_true", dest="as_json")
|
||||
args = parser.parse_args()
|
||||
|
||||
token = load_token()
|
||||
if not token:
|
||||
print("No Gitea token found", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
if args.duplicates:
|
||||
cmd_duplicates(token)
|
||||
elif args.capacity:
|
||||
cmd_capacity(token)
|
||||
elif args.port_drift:
|
||||
print("Port drift check: see fleet-ops registry.yaml comparison")
|
||||
else:
|
||||
cmd_report(token, args.as_json)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -17,8 +17,24 @@ from typing import Dict, Any, Optional, List
|
||||
from pathlib import Path
|
||||
from dataclasses import dataclass
|
||||
from enum import Enum
|
||||
import importlib.util
|
||||
|
||||
from harness import UniWizardHarness, House, ExecutionResult
|
||||
|
||||
def _load_local(module_name: str, filename: str):
|
||||
"""Import a module from an explicit file path, bypassing sys.path resolution."""
|
||||
spec = importlib.util.spec_from_file_location(
|
||||
module_name,
|
||||
str(Path(__file__).parent / filename),
|
||||
)
|
||||
mod = importlib.util.module_from_spec(spec)
|
||||
spec.loader.exec_module(mod)
|
||||
return mod
|
||||
|
||||
|
||||
_harness = _load_local("v2_harness", "harness.py")
|
||||
UniWizardHarness = _harness.UniWizardHarness
|
||||
House = _harness.House
|
||||
ExecutionResult = _harness.ExecutionResult
|
||||
|
||||
|
||||
class TaskType(Enum):
|
||||
|
||||
@@ -8,13 +8,30 @@ import time
|
||||
import sys
|
||||
import argparse
|
||||
import os
|
||||
import importlib.util
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
def _load_local(module_name: str, filename: str):
|
||||
"""Import a module from an explicit file path, bypassing sys.path resolution.
|
||||
|
||||
Prevents namespace collisions when multiple directories contain modules
|
||||
with the same name (e.g. uni-wizard/harness.py vs uni-wizard/v2/harness.py).
|
||||
"""
|
||||
spec = importlib.util.spec_from_file_location(
|
||||
module_name,
|
||||
str(Path(__file__).parent / filename),
|
||||
)
|
||||
mod = importlib.util.module_from_spec(spec)
|
||||
spec.loader.exec_module(mod)
|
||||
return mod
|
||||
|
||||
_harness = _load_local("v2_harness", "harness.py")
|
||||
UniWizardHarness = _harness.UniWizardHarness
|
||||
House = _harness.House
|
||||
ExecutionResult = _harness.ExecutionResult
|
||||
|
||||
from harness import UniWizardHarness, House, ExecutionResult
|
||||
from router import HouseRouter, TaskType
|
||||
from author_whitelist import AuthorWhitelist
|
||||
|
||||
|
||||
Reference in New Issue
Block a user