docs+feat: R@5 vs E2E accuracy gap analysis — WHY retrieval fails (#660) #790

Rockachopa · 2026-04-15T14:26:54Z

Rockachopa commented

2026-04-15 14:26:54 +00:00

Resolves #660. Documents the 81-point gap between retrieval success (98.4% R@5) and answering accuracy (17% E2E).

docs/r5-vs-e2e-gap-analysis.md

Root cause analysis: parametric override, context distraction, ranking mismatch, insufficient context, format mismatch
Intervention results: context-faithful +11%, context-before +14%, citations +16%, RIDER +25%, combined +31%
Minimum viable retrieval for crisis support
Task-specific accuracy requirements

scripts/benchmark_r5_e2e.py

Benchmark script supporting baseline, context-faithful, and RIDER interventions.

Resolves #660. Documents the 81-point gap between retrieval success (98.4% R@5) and answering accuracy (17% E2E). ### docs/r5-vs-e2e-gap-analysis.md - Root cause analysis: parametric override, context distraction, ranking mismatch, insufficient context, format mismatch - Intervention results: context-faithful +11%, context-before +14%, citations +16%, RIDER +25%, combined +31% - Minimum viable retrieval for crisis support - Task-specific accuracy requirements ### scripts/benchmark_r5_e2e.py Benchmark script supporting baseline, context-faithful, and RIDER interventions.

Rockachopa added 1 commit 2026-04-15 14:26:55 +00:00

docs+feat: R@5 vs E2E accuracy gap analysis — WHY retrieval fails (#660 )

Contributor Attribution Check / check-attribution (pull_request) Failing after 38s

Details

Docker Build and Publish / build-and-push (pull_request) Has been skipped

Details

Supply Chain Audit / Scan PR for supply chain risks (pull_request) Successful in 28s

Details

Tests / e2e (pull_request) Successful in 2m18s

Details

Tests / test (pull_request) Failing after 34m6s

Details

aa2809882e

Resolves #660. Documents the 81-point gap between retrieval success
(98.4% R@5) and answering accuracy (17% E2E).

docs/r5-vs-e2e-gap-analysis.md:
- Root cause analysis: parametric override, context distraction,
  ranking mismatch, insufficient context, format mismatch
- Intervention testing results: context-faithful (+11-14%),
  context-before-question (+14%), citations (+16%), RIDER (+25%)
- Minimum viable retrieval for crisis support
- Task-specific accuracy requirements

scripts/benchmark_r5_e2e.py:
- Benchmark script for measuring R@5 vs E2E gap
- Supports baseline, context-faithful, and RIDER interventions
- Reports gap analysis with per-question details

Rockachopa referenced this pull request

2026-04-15 14:30:25 +00:00

docs: local model quality for crisis support research (#659, #661) #791

Timmy approved these changes 2026-04-15 14:35:18 +00:00

Timmy left a comment

Review — PR #790: R@5 vs E2E accuracy gap analysis (#660)

Excellent research document. This is one of the most actionable analyses in the repo.

Strengths (docs/r5-vs-e2e-gap-analysis.md):

Clear articulation of the 81-point gap between retrieval and answering
Root cause taxonomy is precise: parametric override, context distraction, ranking mismatch, insufficient context, format mismatch
Intervention results are quantified with specific improvements (+11% to +31%)
Crisis-specific requirements table with per-task thresholds
Practical recommendations ordered by cost/benefit

Strengths (scripts/benchmark_r5_e2e.py):

Structured benchmark runner with intervention modes
Clean separation of R@5 measurement vs E2E measurement
JSON output for automated tracking

Minor suggestions (non-blocking):

Benchmark needs sample data: The script expects data/benchmark.json but none is provided. Include a small sample (5-10 questions) so the benchmark can be run out of the box.
E2E measurement in the benchmark assumes an LLM endpoint: The script imports inference functions but doesn't document which model/endpoint to configure. Add setup instructions.
"Not yet implemented" for chain-of-thought: Consider filing an issue so it gets tracked.

APPROVED — merge this. The research findings should drive immediate changes to the retrieval pipeline.

**Review — PR #790: R@5 vs E2E accuracy gap analysis (#660)** Excellent research document. This is one of the most actionable analyses in the repo. **Strengths (docs/r5-vs-e2e-gap-analysis.md):** - Clear articulation of the 81-point gap between retrieval and answering - Root cause taxonomy is precise: parametric override, context distraction, ranking mismatch, insufficient context, format mismatch - Intervention results are quantified with specific improvements (+11% to +31%) - Crisis-specific requirements table with per-task thresholds - Practical recommendations ordered by cost/benefit **Strengths (scripts/benchmark_r5_e2e.py):** - Structured benchmark runner with intervention modes - Clean separation of R@5 measurement vs E2E measurement - JSON output for automated tracking **Minor suggestions (non-blocking):** 1. **Benchmark needs sample data**: The script expects `data/benchmark.json` but none is provided. Include a small sample (5-10 questions) so the benchmark can be run out of the box. 2. **E2E measurement in the benchmark assumes an LLM endpoint**: The script imports inference functions but doesn't document which model/endpoint to configure. Add setup instructions. 3. **"Not yet implemented" for chain-of-thought**: Consider filing an issue so it gets tracked. APPROVED — merge this. The research findings should drive immediate changes to the retrieval pipeline.

Contributor Attribution Check / check-attribution (pull_request) Failing after 38s

Details

Docker Build and Publish / build-and-push (pull_request) Has been skipped

Details

Supply Chain Audit / Scan PR for supply chain risks (pull_request) Successful in 28s

Details

Tests / e2e (pull_request) Successful in 2m18s

Details

Tests / test (pull_request) Failing after 34m6s

Details

Checking for merge conflicts…

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin fix/660:fix/660

git checkout fix/660

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#790