[ATLAS][Memory Eval] Run a live Hindsight vs MemPalace bakeoff with raw context-window artifacts #230

New Issue

Rockachopa · 2026-04-22T14:25:19Z

Rockachopa commented

2026-04-22 14:25:19 +00:00

Parent

Timmy_Foundation/hermes-agent#985

Depends on

Timmy_Foundation/hermes-agent#1010

Sources to Evaluate

built-in Hermes memory surfaces
mempalace
vectorize-io/hindsight

What

Run a live, source-backed bakeoff between baseline Hermes memory, MemPalace, and local Hindsight.

This is not a README benchmark. This is a runtime bakeoff that captures what actually reaches the prompt and whether it improves answers on the kinds of recall we care about.

The evaluation must include:

built-in preference recall
structured fact recall
architecture/decision recall
fleet/operational recall
contradiction/failure-mode framing
at least one long-horizon or BEAM-style impossible-to-stuff memory scenario

Tasks

Build a prompt matrix covering the recall categories above
Run the matrix against baseline / MemPalace / local Hindsight using profile-scoped eval homes
Capture raw context-window artifacts for each run
Score answer quality, relevance of injected context, and retrieval noise
Publish a report with PASS/FAIL verdicts and concrete failure modes
Recommend keep / adopt / hybrid / reject for each provider path

Acceptance Criteria

All three paths (baseline, MemPalace, Hindsight) are evaluated on the same prompt matrix
Raw artifacts are saved for every run: system prompt, provider prefetch, final memory injection, final user payload
The report includes explicit failure modes, not just scores
At least one long-horizon memory test exceeds what can be solved by naive context stuffing
The final report makes a grounded recommendation for the sovereign Mac-local stack

## Parent Timmy_Foundation/hermes-agent#985 ## Depends on Timmy_Foundation/hermes-agent#1010 ## Sources to Evaluate - built-in Hermes memory surfaces - `mempalace` - `vectorize-io/hindsight` ## What Run a **live, source-backed bakeoff** between baseline Hermes memory, MemPalace, and local Hindsight. This is not a README benchmark. This is a runtime bakeoff that captures what actually reaches the prompt and whether it improves answers on the kinds of recall we care about. The evaluation must include: - built-in preference recall - structured fact recall - architecture/decision recall - fleet/operational recall - contradiction/failure-mode framing - at least one long-horizon or BEAM-style impossible-to-stuff memory scenario ## Tasks - [ ] Build a prompt matrix covering the recall categories above - [ ] Run the matrix against baseline / MemPalace / local Hindsight using profile-scoped eval homes - [ ] Capture raw context-window artifacts for each run - [ ] Score answer quality, relevance of injected context, and retrieval noise - [ ] Publish a report with PASS/FAIL verdicts and concrete failure modes - [ ] Recommend keep / adopt / hybrid / reject for each provider path ## Acceptance Criteria - [ ] All three paths (baseline, MemPalace, Hindsight) are evaluated on the same prompt matrix - [ ] Raw artifacts are saved for every run: system prompt, provider prefetch, final memory injection, final user payload - [ ] The report includes explicit failure modes, not just scores - [ ] At least one long-horizon memory test exceeds what can be solved by naive context stuffing - [ ] The final report makes a grounded recommendation for the sovereign Mac-local stack

Rockachopa referenced this issue from Timmy_Foundation/hermes-agent

2026-04-22 14:25:20 +00:00

[ATLAS][Memory] Port Hindsight-style multi-path recall (semantic + BM25 + graph + temporal + RRF) #1011

Rockachopa referenced this issue from Timmy_Foundation/hermes-agent

2026-04-22 14:25:20 +00:00

[ATLAS][Memory] Improve durable fact extraction and normalization using Hindsight-inspired patterns #1012

Rockachopa referenced this issue from Timmy_Foundation/hermes-agent

2026-04-22 14:25:21 +00:00

[ATLAS][Memory] Build a lossless context + memory subsystem from hermes-lcm and gbrain #985

Rockachopa commented

2026-04-23 01:37:20 +00:00

🔥 BURN DISPATCH — BURN:LOOM.1

Queued for execution with /queue.

Target branch: fix/230-hindsight-mempalace-bakeoff
Workflow: Gitea-first, dedup PRs, verify before PR.

🔥 **BURN DISPATCH** — `BURN:LOOM.1` Queued for execution with `/queue`. Target branch: `fix/230-hindsight-mempalace-bakeoff` Workflow: Gitea-first, dedup PRs, verify before PR.

Rockachopa commented

2026-04-29 03:12:42 +00:00

STEP35 FREE BURN — BLOCKED

Attempting to implement #230 but hitting a hard dependency gap.

Blocker Analysis

Depends on: hermes-agent#1010 — local Hindsight adapter

hermes-agent#1010 state: open (no PR merged)
No local Hindsight integration exists yet
No MemPalace adapter found in environment

Environment status:

MemPalace Python module: NOT installed (ModuleNotFoundError)
Hindsight Python module: NOT installed (ModuleNotFoundError)
No mempalace or hindsight tools in hermes-agent toolset
No vendor'd Hindsight binary in PATH

Acceptance Criteria Status

Criterion	Status	Notes
All 3 paths (baseline/MemPalace/Hindsight) evaluated	❌ BLOCKED	Cannot eval missing backends
Raw artifacts captured for every run	❌ BLOCKED	No runtime to capture
Report with PASS/FAIL + failure modes	❌ BLOCKED	No execution data
Long-horizon memory test (> context stuffing)	❌ BLOCKED	Depends on Hindsight
Recommendation for sovereign Mac-local stack	⚠️ DEFERRED	Needs eval results

What Would Be Needed

First: merge hermes-agent#1010 (local Hindsight adapter)
Second: verify MemPalace integration works in hermes-agent runtime
Third: this bakeoff script can execute 3-way comparison

Recommendation

Close #230 as blocked, reopen when #1010 merges. The bakeoff work is downstream infrastructure that cannot proceed without its dependency.

Action taken: Stopped work per FREE BURN guardrail. Commented blocker. No implementation, no PR opened.

**STEP35 FREE BURN — BLOCKED** Attempting to implement #230 but hitting a hard dependency gap. ## Blocker Analysis **Depends on:** `hermes-agent#1010` — local Hindsight adapter - `hermes-agent#1010` state: **open** (no PR merged) - No local Hindsight integration exists yet - No MemPalace adapter found in environment **Environment status:** - MemPalace Python module: NOT installed (`ModuleNotFoundError`) - Hindsight Python module: NOT installed (`ModuleNotFoundError`) - No `mempalace` or `hindsight` tools in hermes-agent toolset - No vendor'd Hindsight binary in PATH ## Acceptance Criteria Status | Criterion | Status | Notes | |-----------|--------|-------| | All 3 paths (baseline/MemPalace/Hindsight) evaluated | ❌ BLOCKED | Cannot eval missing backends | | Raw artifacts captured for every run | ❌ BLOCKED | No runtime to capture | | Report with PASS/FAIL + failure modes | ❌ BLOCKED | No execution data | | Long-horizon memory test (> context stuffing) | ❌ BLOCKED | Depends on Hindsight | | Recommendation for sovereign Mac-local stack | ⚠️ DEFERRED | Needs eval results | ## What Would Be Needed 1. **First:** merge `hermes-agent#1010` (local Hindsight adapter) 2. **Second:** verify MemPalace integration works in hermes-agent runtime 3. **Third:** this bakeoff script can execute 3-way comparison ## Recommendation Close #230 as blocked, reopen when #1010 merges. The bakeoff work is downstream infrastructure that cannot proceed without its dependency. **Action taken:** Stopped work per FREE BURN guardrail. Commented blocker. No implementation, no PR opened.

Rockachopa referenced this issue from a commit

2026-04-29 22:00:19 +00:00

feat: add memory bakeoff runner for baseline vs MemPalace vs Hindsight evaluation

Rockachopa referenced a pull request that will close this issue

2026-04-30 17:28:42 +00:00

[ATLAS][Memory Eval] Add memory bakeoff runner — baseline vs MemPalace vs Hindsight #284

Sign in to join this conversation.

Branches Tags

main

step35/150-8-7-graph-query-engine

step35/230-atlas-memory-eval-run-a-live

step35/89-3-10-test-generation-orchest

step35/87-3-8-regression-test-generato

step35/231-atlas-wiki-build-the-llm-wik

step35/108-5-2-vulnerability-scanner

step35/233-atlas-connectors-sovereign-p

step35/195-feat-session-transcript-harv

step35/199-feat-training-data-pipeline

step35/232-atlas-research-solve-the-swa

step35/127-6-9-review-quality-scorer

step35/99-4-4-architecture-doc-generat

step35/172-10-7-knowledge-gap-identifier

step35/162-9-8-code-duplication-detecto

step35/121-6-3-logic-reviewer

step35/104-4-9-doc-freshness-checker

step35/157-9-3-type-checker

step35/171-10-6-performance-bottleneck

step35/161-9-7-dependency-freshness

step35/140-7-8-citation-tracker

step35/132-feat-codebase-genome-diff-de

step35/135-feat-pr-complexity-scorer-es

step35/124-6-6-test-coverage-checker

step35/113-5-7-security-patch-applier

step35/109-5-3-update-checker

step35/170-10-5-automation-opportunity

step35/148-8-5-session-knowledge-extrac

step35/147-8-4-cross-repo-connector

step35/126-review-comment-generator

step35/134-gh-trending

step35/138-7-6-conference-talk-summariz

step35/96-4-1-docstring-generator

step35/98-4-3-api-doc-generator

step35/205-feat-zero-shot-knowledge-syn

step35/173-10-8-progress-tracker

step35/137-7-5-release-note-analyzer

step35/107-5-1-dependency-inventory

step35/111-5-5-transitive-dependency-an

step35/90-feat-gitea-issue-body-parser

step35/158-9-4-security-linter

step35/155-9-1-linter-runner

step35/133-feat-import-graph-visualizat

step35/93-feat-cross-repo-dependency-g

step35/112-5-6-dependency-bloat-detecto

step35/97-4-2-readme-generator

step35/91-feat-session-transcript-trai

step35/144-8-1-entity-extractor

step35/151-8-8-graph-visualizer

step35/88-3-9-test-documentation-gener

step35/197-feat-provenance-chain-source

step35/103-4-8-doc-link-validator

burn/196-1776306000

feat/200-knowledge-freshness-cron

fix/syntax-bottleneck-211

fix/212-dependency-graph-dot-quoting

fix/211-syntax-errors

fix/210-refactoring-opportunity-api

fix/210-refactoring-opportunity-finder

burn/210-1776305000

burn/211-1776305100

fix/211-syntax-error

fix/212-dot-quoting

fix/perf-bottleneck-syntax-211

fix/211-perf-bottleneck-syntax

burn/212-fix-dot-quoting

fix/211

fix/212-dependency-graph-quoting

fix/676

fix/198-quality-gate

fix/201-pytest-warnings

burn/210-1776852000

fix/676-genome-ci

fix/190

burn/170-1776263897

burn/169-1776263898

burn/174-1776263883

burn/171-1776263896

burn/168-1776263899

burn/172-1776263893

burn/175-1776263877

feat/179-staleness-check

feat/176-diff-analyzer

feat/177-issue-parser

feat/94-dead-code-detector

burn/172-1776218600

feat/93-dependency-graph

feat/92-knowledge-staleness-detector

feat/91-session-pair-harvester

feat/90-issue-body-parser

burn/110-license-checker

burn/118-1776218500

burn/17-session-sampler

fix/7-extraction-prompt

docs/genome-676

feat/session-metadata

fix/10-knowledge-format

fix/14-measurer

fix/9-auto-harvest-cron

fix/19-migrate-memory

fix/11-bootstrapper

fix/8-harvester

feat/session-reader

burn/8-harvester-py

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/compounding-intelligence#230