[ATLAS][Memory Eval] Run a live Hindsight vs MemPalace bakeoff with raw context-window artifacts #230

Open
opened 2026-04-22 14:25:19 +00:00 by Rockachopa · 2 comments
Owner

Parent

Timmy_Foundation/hermes-agent#985

Depends on

Timmy_Foundation/hermes-agent#1010

Sources to Evaluate

  • built-in Hermes memory surfaces
  • mempalace
  • vectorize-io/hindsight

What

Run a live, source-backed bakeoff between baseline Hermes memory, MemPalace, and local Hindsight.

This is not a README benchmark. This is a runtime bakeoff that captures what actually reaches the prompt and whether it improves answers on the kinds of recall we care about.

The evaluation must include:

  • built-in preference recall
  • structured fact recall
  • architecture/decision recall
  • fleet/operational recall
  • contradiction/failure-mode framing
  • at least one long-horizon or BEAM-style impossible-to-stuff memory scenario

Tasks

  • Build a prompt matrix covering the recall categories above
  • Run the matrix against baseline / MemPalace / local Hindsight using profile-scoped eval homes
  • Capture raw context-window artifacts for each run
  • Score answer quality, relevance of injected context, and retrieval noise
  • Publish a report with PASS/FAIL verdicts and concrete failure modes
  • Recommend keep / adopt / hybrid / reject for each provider path

Acceptance Criteria

  • All three paths (baseline, MemPalace, Hindsight) are evaluated on the same prompt matrix
  • Raw artifacts are saved for every run: system prompt, provider prefetch, final memory injection, final user payload
  • The report includes explicit failure modes, not just scores
  • At least one long-horizon memory test exceeds what can be solved by naive context stuffing
  • The final report makes a grounded recommendation for the sovereign Mac-local stack
## Parent Timmy_Foundation/hermes-agent#985 ## Depends on Timmy_Foundation/hermes-agent#1010 ## Sources to Evaluate - built-in Hermes memory surfaces - `mempalace` - `vectorize-io/hindsight` ## What Run a **live, source-backed bakeoff** between baseline Hermes memory, MemPalace, and local Hindsight. This is not a README benchmark. This is a runtime bakeoff that captures what actually reaches the prompt and whether it improves answers on the kinds of recall we care about. The evaluation must include: - built-in preference recall - structured fact recall - architecture/decision recall - fleet/operational recall - contradiction/failure-mode framing - at least one long-horizon or BEAM-style impossible-to-stuff memory scenario ## Tasks - [ ] Build a prompt matrix covering the recall categories above - [ ] Run the matrix against baseline / MemPalace / local Hindsight using profile-scoped eval homes - [ ] Capture raw context-window artifacts for each run - [ ] Score answer quality, relevance of injected context, and retrieval noise - [ ] Publish a report with PASS/FAIL verdicts and concrete failure modes - [ ] Recommend keep / adopt / hybrid / reject for each provider path ## Acceptance Criteria - [ ] All three paths (baseline, MemPalace, Hindsight) are evaluated on the same prompt matrix - [ ] Raw artifacts are saved for every run: system prompt, provider prefetch, final memory injection, final user payload - [ ] The report includes explicit failure modes, not just scores - [ ] At least one long-horizon memory test exceeds what can be solved by naive context stuffing - [ ] The final report makes a grounded recommendation for the sovereign Mac-local stack
Author
Owner

🔥 BURN DISPATCHBURN:LOOM.1

Queued for execution with /queue.

Target branch: fix/230-hindsight-mempalace-bakeoff
Workflow: Gitea-first, dedup PRs, verify before PR.

🔥 **BURN DISPATCH** — `BURN:LOOM.1` Queued for execution with `/queue`. Target branch: `fix/230-hindsight-mempalace-bakeoff` Workflow: Gitea-first, dedup PRs, verify before PR.
Author
Owner

STEP35 FREE BURN — BLOCKED

Attempting to implement #230 but hitting a hard dependency gap.

Blocker Analysis

Depends on: hermes-agent#1010 — local Hindsight adapter

  • hermes-agent#1010 state: open (no PR merged)
  • No local Hindsight integration exists yet
  • No MemPalace adapter found in environment

Environment status:

  • MemPalace Python module: NOT installed (ModuleNotFoundError)
  • Hindsight Python module: NOT installed (ModuleNotFoundError)
  • No mempalace or hindsight tools in hermes-agent toolset
  • No vendor'd Hindsight binary in PATH

Acceptance Criteria Status

Criterion Status Notes
All 3 paths (baseline/MemPalace/Hindsight) evaluated BLOCKED Cannot eval missing backends
Raw artifacts captured for every run BLOCKED No runtime to capture
Report with PASS/FAIL + failure modes BLOCKED No execution data
Long-horizon memory test (> context stuffing) BLOCKED Depends on Hindsight
Recommendation for sovereign Mac-local stack ⚠️ DEFERRED Needs eval results

What Would Be Needed

  1. First: merge hermes-agent#1010 (local Hindsight adapter)
  2. Second: verify MemPalace integration works in hermes-agent runtime
  3. Third: this bakeoff script can execute 3-way comparison

Recommendation

Close #230 as blocked, reopen when #1010 merges. The bakeoff work is downstream infrastructure that cannot proceed without its dependency.

Action taken: Stopped work per FREE BURN guardrail. Commented blocker. No implementation, no PR opened.

**STEP35 FREE BURN — BLOCKED** Attempting to implement #230 but hitting a hard dependency gap. ## Blocker Analysis **Depends on:** `hermes-agent#1010` — local Hindsight adapter - `hermes-agent#1010` state: **open** (no PR merged) - No local Hindsight integration exists yet - No MemPalace adapter found in environment **Environment status:** - MemPalace Python module: NOT installed (`ModuleNotFoundError`) - Hindsight Python module: NOT installed (`ModuleNotFoundError`) - No `mempalace` or `hindsight` tools in hermes-agent toolset - No vendor'd Hindsight binary in PATH ## Acceptance Criteria Status | Criterion | Status | Notes | |-----------|--------|-------| | All 3 paths (baseline/MemPalace/Hindsight) evaluated | ❌ BLOCKED | Cannot eval missing backends | | Raw artifacts captured for every run | ❌ BLOCKED | No runtime to capture | | Report with PASS/FAIL + failure modes | ❌ BLOCKED | No execution data | | Long-horizon memory test (> context stuffing) | ❌ BLOCKED | Depends on Hindsight | | Recommendation for sovereign Mac-local stack | ⚠️ DEFERRED | Needs eval results | ## What Would Be Needed 1. **First:** merge `hermes-agent#1010` (local Hindsight adapter) 2. **Second:** verify MemPalace integration works in hermes-agent runtime 3. **Third:** this bakeoff script can execute 3-way comparison ## Recommendation Close #230 as blocked, reopen when #1010 merges. The bakeoff work is downstream infrastructure that cannot proceed without its dependency. **Action taken:** Stopped work per FREE BURN guardrail. Commented blocker. No implementation, no PR opened.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/compounding-intelligence#230