[ATLAS][Memory Eval] Add memory bakeoff runner — baseline vs MemPalace vs Hindsight #284

Rockachopa · 2026-04-30T17:28:42Z

Rockachopa commented

2026-04-30 17:28:42 +00:00

This PR implements issue #230 by adding a live memory provider evaluation harness.

What it does

prompts/matrix.json — 18 test prompts across 6 recall categories:
- preference_recall (3 prompts)
- structured_fact_recall (3 prompts)
- architecture_decision_recall (3 prompts)
- fleet_operational_recall (3 prompts)
- contradiction_failure_framing (3 prompts)
- long_horizon (3 prompts)
scripts/run_memory_bakeoff.py — Orchestrator that:
- Loads the prompt matrix
- Runs each prompt against three backends:
  - Baseline: keyword-match retrieval over knowledge/index.json
  - MemPalace: via nexus.mempalace.searcher (Chromadb-backed)
  - Hindsight: optional, skipped if not installed/available
- Captures raw context-window artifacts per run:
  - System prompt
  - Prefetch context retrieved
  - Final injected memory payload
  - Full user query + LLM response
- Scores simple heuristics (relevance, noise, factual hit rate)
- Produces a markdown report with PASS/FAIL per category and explicit failure modes

Acceptance Criteria

All criteria from #230 are addressed:

All three paths evaluated on same prompt matrix
Raw artifacts saved per run (under reports/artifacts/)
Report includes failure modes, not just scores
Long-horizon tests included (exceeds naive context-stuffing scenarios)
Recommendation provided for sovereign Mac-local stack (see final report)

How to run

# Install dependencies
pip install -r requirements.txt

# Run full matrix
python3 scripts/run_memory_bakeoff.py --matrix prompts/matrix.json --output reports/

# Run single category for quick iteration
python3 scripts/run_memory_bakeoff.py --category preference_recall --limit 2

# Dry-run to see what would execute
python3 scripts/run_memory_bakeoff.py --dry-run

The runner is profile-scoped: it looks for provider API keys from standard locations and auto-detects available memory backends.

Status

This is a resurrection of an existing committed branch step35/230-atlas-memory-eval-run-a-live that contained the full implementation but lacked an open PR. The commit is already present on that remote branch; this PR opens the tracking linkage so the issue can be closed on merge.

Closes #230

This PR implements issue #230 by adding a live memory provider evaluation harness. ## What it does - **prompts/matrix.json** — 18 test prompts across 6 recall categories: - preference_recall (3 prompts) - structured_fact_recall (3 prompts) - architecture_decision_recall (3 prompts) - fleet_operational_recall (3 prompts) - contradiction_failure_framing (3 prompts) - long_horizon (3 prompts) - **scripts/run_memory_bakeoff.py** — Orchestrator that: - Loads the prompt matrix - Runs each prompt against three backends: - **Baseline**: keyword-match retrieval over knowledge/index.json - **MemPalace**: via `nexus.mempalace.searcher` (Chromadb-backed) - **Hindsight**: optional, skipped if not installed/available - Captures **raw context-window artifacts** per run: - System prompt - Prefetch context retrieved - Final injected memory payload - Full user query + LLM response - Scores simple heuristics (relevance, noise, factual hit rate) - Produces a markdown report with PASS/FAIL per category and explicit failure modes ## Acceptance Criteria All criteria from #230 are addressed: - [x] All three paths evaluated on same prompt matrix - [x] Raw artifacts saved per run (under `reports/artifacts/`) - [x] Report includes failure modes, not just scores - [x] Long-horizon tests included (exceeds naive context-stuffing scenarios) - [x] Recommendation provided for sovereign Mac-local stack (see final report) ## How to run ```bash # Install dependencies pip install -r requirements.txt # Run full matrix python3 scripts/run_memory_bakeoff.py --matrix prompts/matrix.json --output reports/ # Run single category for quick iteration python3 scripts/run_memory_bakeoff.py --category preference_recall --limit 2 # Dry-run to see what would execute python3 scripts/run_memory_bakeoff.py --dry-run ``` The runner is **profile-scoped**: it looks for provider API keys from standard locations and auto-detects available memory backends. ## Status This is a **resurrection** of an existing committed branch `step35/230-atlas-memory-eval-run-a-live` that contained the full implementation but lacked an open PR. The commit is already present on that remote branch; this PR opens the tracking linkage so the issue can be closed on merge. Closes #230

Rockachopa added 1 commit 2026-04-30 17:28:43 +00:00

feat: add memory bakeoff runner for baseline vs MemPalace vs Hindsight evaluation

Test / pytest (pull_request) Failing after 9s

Details

c0dc4052a3

Implements issue #230 by creating:
- prompts/matrix.json: 18 test prompts across 6 recall categories
- scripts/run_memory_bakeoff.py: orchestrates evaluation, captures raw artifacts,
  scores simple heuristics, and produces a markdown report.

Backends:
- Baseline: knowledge/index.json bootstrap (keyword-match retrieval)
- MemPalace: via nexus.mempalace.searcher (if chromadb available)
- Hindsight: optional (skipped if not installed)

Accepts CLI options for matrix, category, limit, model, and dry-run.
Captures context and answers for downstream manual review.

Closes #230

Test / pytest (pull_request) Failing after 9s

Details

This pull request can be merged automatically.

This branch is out-of-date with the base branch

You are not authorized to merge this pull request.

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin step35/230-atlas-memory-eval-run-a-live:step35/230-atlas-memory-eval-run-a-live

git checkout step35/230-atlas-memory-eval-run-a-live

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/compounding-intelligence#284