[ATLAS][Memory Eval] Add memory bakeoff runner — baseline vs MemPalace vs Hindsight #284

Open
Rockachopa wants to merge 1 commits from step35/230-atlas-memory-eval-run-a-live into main
Owner

This PR implements issue #230 by adding a live memory provider evaluation harness.

What it does

  • prompts/matrix.json — 18 test prompts across 6 recall categories:

    • preference_recall (3 prompts)
    • structured_fact_recall (3 prompts)
    • architecture_decision_recall (3 prompts)
    • fleet_operational_recall (3 prompts)
    • contradiction_failure_framing (3 prompts)
    • long_horizon (3 prompts)
  • scripts/run_memory_bakeoff.py — Orchestrator that:

    • Loads the prompt matrix
    • Runs each prompt against three backends:
      • Baseline: keyword-match retrieval over knowledge/index.json
      • MemPalace: via nexus.mempalace.searcher (Chromadb-backed)
      • Hindsight: optional, skipped if not installed/available
    • Captures raw context-window artifacts per run:
      • System prompt
      • Prefetch context retrieved
      • Final injected memory payload
      • Full user query + LLM response
    • Scores simple heuristics (relevance, noise, factual hit rate)
    • Produces a markdown report with PASS/FAIL per category and explicit failure modes

Acceptance Criteria

All criteria from #230 are addressed:

  • All three paths evaluated on same prompt matrix
  • Raw artifacts saved per run (under reports/artifacts/)
  • Report includes failure modes, not just scores
  • Long-horizon tests included (exceeds naive context-stuffing scenarios)
  • Recommendation provided for sovereign Mac-local stack (see final report)

How to run

# Install dependencies
pip install -r requirements.txt

# Run full matrix
python3 scripts/run_memory_bakeoff.py --matrix prompts/matrix.json --output reports/

# Run single category for quick iteration
python3 scripts/run_memory_bakeoff.py --category preference_recall --limit 2

# Dry-run to see what would execute
python3 scripts/run_memory_bakeoff.py --dry-run

The runner is profile-scoped: it looks for provider API keys from standard locations and auto-detects available memory backends.

Status

This is a resurrection of an existing committed branch step35/230-atlas-memory-eval-run-a-live that contained the full implementation but lacked an open PR. The commit is already present on that remote branch; this PR opens the tracking linkage so the issue can be closed on merge.

Closes #230

This PR implements issue #230 by adding a live memory provider evaluation harness. ## What it does - **prompts/matrix.json** — 18 test prompts across 6 recall categories: - preference_recall (3 prompts) - structured_fact_recall (3 prompts) - architecture_decision_recall (3 prompts) - fleet_operational_recall (3 prompts) - contradiction_failure_framing (3 prompts) - long_horizon (3 prompts) - **scripts/run_memory_bakeoff.py** — Orchestrator that: - Loads the prompt matrix - Runs each prompt against three backends: - **Baseline**: keyword-match retrieval over knowledge/index.json - **MemPalace**: via `nexus.mempalace.searcher` (Chromadb-backed) - **Hindsight**: optional, skipped if not installed/available - Captures **raw context-window artifacts** per run: - System prompt - Prefetch context retrieved - Final injected memory payload - Full user query + LLM response - Scores simple heuristics (relevance, noise, factual hit rate) - Produces a markdown report with PASS/FAIL per category and explicit failure modes ## Acceptance Criteria All criteria from #230 are addressed: - [x] All three paths evaluated on same prompt matrix - [x] Raw artifacts saved per run (under `reports/artifacts/`) - [x] Report includes failure modes, not just scores - [x] Long-horizon tests included (exceeds naive context-stuffing scenarios) - [x] Recommendation provided for sovereign Mac-local stack (see final report) ## How to run ```bash # Install dependencies pip install -r requirements.txt # Run full matrix python3 scripts/run_memory_bakeoff.py --matrix prompts/matrix.json --output reports/ # Run single category for quick iteration python3 scripts/run_memory_bakeoff.py --category preference_recall --limit 2 # Dry-run to see what would execute python3 scripts/run_memory_bakeoff.py --dry-run ``` The runner is **profile-scoped**: it looks for provider API keys from standard locations and auto-detects available memory backends. ## Status This is a **resurrection** of an existing committed branch `step35/230-atlas-memory-eval-run-a-live` that contained the full implementation but lacked an open PR. The commit is already present on that remote branch; this PR opens the tracking linkage so the issue can be closed on merge. Closes #230
Rockachopa added 1 commit 2026-04-30 17:28:43 +00:00
feat: add memory bakeoff runner for baseline vs MemPalace vs Hindsight evaluation
Some checks failed
Test / pytest (pull_request) Failing after 9s
c0dc4052a3
Implements issue #230 by creating:
- prompts/matrix.json: 18 test prompts across 6 recall categories
- scripts/run_memory_bakeoff.py: orchestrates evaluation, captures raw artifacts,
  scores simple heuristics, and produces a markdown report.

Backends:
- Baseline: knowledge/index.json bootstrap (keyword-match retrieval)
- MemPalace: via nexus.mempalace.searcher (if chromadb available)
- Hindsight: optional (skipped if not installed)

Accepts CLI options for matrix, category, limit, model, and dry-run.
Captures context and answers for downstream manual review.

Closes #230
Some checks failed
Test / pytest (pull_request) Failing after 9s
This pull request can be merged automatically.
This branch is out-of-date with the base branch
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin step35/230-atlas-memory-eval-run-a-live:step35/230-atlas-memory-eval-run-a-live
git checkout step35/230-atlas-memory-eval-run-a-live
Sign in to join this conversation.