diff --git a/reports/evaluations/2026-04-06-mempalace-evaluation.md b/reports/evaluations/2026-04-06-mempalace-evaluation.md index 54f5d72..2fb63a3 100644 --- a/reports/evaluations/2026-04-06-mempalace-evaluation.md +++ b/reports/evaluations/2026-04-06-mempalace-evaluation.md @@ -1,124 +1,253 @@ # MemPalace Integration Evaluation Report +**Issue:** #568 +**Original draft landed in:** PR #569 +**Status:** Updated with live mining results, independent verification, and current recommendation + ## Executive Summary -Evaluated **MemPalace v3.0.0** (github.com/milla-jovovich/mempalace) as a memory layer for the Timmy/Hermes agent stack. +Evaluated **MemPalace v3.0.0** (`github.com/milla-jovovich/mempalace`) as a memory layer for the Timmy/Hermes stack. -**Installed:** ✅ `mempalace 3.0.0` via `pip install` -**Works with:** ChromaDB, MCP servers, local LLMs -**Zero cloud:** ✅ Fully local, no API keys required +What is now established from the issue thread plus the merged draft: +- **Synthetic evaluation:** positive +- **Live mining on Timmy data:** positive +- **Independent Allegro verification:** positive +- **Zero-cloud property:** confirmed +- **Recommendation:** MemPalace is strong enough for pilot integration and wake-up experiments, but `timmy-home` should treat it as a proven candidate rather than the final uncontested winner until it is benchmarked against the current Engram direction documented elsewhere in this repo. -## Benchmark Findings (from Paper) +In other words: the evaluation succeeded. The remaining question is not whether MemPalace works. It is whether MemPalace should become the permanent fleet memory default. + +## Benchmark Findings + +These benchmark numbers were cited in the original evaluation draft: | Benchmark | Mode | Score | API Required | -|---|---|---|---| -| **LongMemEval R@5** | Raw ChromaDB only | **96.6%** | **Zero** | -| **LongMemEval R@5** | Hybrid + Haiku rerank | **100%** | Optional Haiku | -| **LoCoMo R@10** | Raw, session level | 60.3% | Zero | -| **Personal palace R@10** | Heuristic bench | 85% | Zero | -| **Palace structure impact** | Wing+room filtering | **+34%** R@10 | Zero | +|---|---|---:|---| +| LongMemEval R@5 | Raw ChromaDB only | 96.6% | Zero | +| LongMemEval R@5 | Hybrid + Haiku rerank | 100% | Optional Haiku | +| LoCoMo R@10 | Raw, session level | 60.3% | Zero | +| Personal palace R@10 | Heuristic bench | 85% | Zero | +| Palace structure impact | Wing + room filtering | +34% R@10 | Zero | -## Before vs After Evaluation (Live Test) +These are paper-level or draft-level metrics. They matter, but the more important evidence for `timmy-home` is the live operational testing below. -### Test Setup -- Created test project with 4 files (README.md, auth.md, deployment.md, main.py) -- Mined into MemPalace palace -- Ran 4 standard queries -- Results recorded +## Before vs After Evaluation -### Before (Standard BM25 / Simple Search) +### Synthetic test setup +- 4-file test project: + - `README.md` + - `auth.md` + - `deployment.md` + - `main.py` +- mined into a MemPalace palace +- queried with 4 standard prompts + +### Before (keyword/BM25 style expectations) | Query | Would Return | Notes | |---|---|---| -| "authentication" | auth.md (exact match only) | Misses context about JWT choice | -| "docker nginx SSL" | deployment.md | Manual regex/keyword matching needed | -| "keycloak OAuth" | auth.md | Would need full-text index | -| "postgresql database" | README.md (maybe) | Depends on index | +| `authentication` | `auth.md` | exact match only; weak on implementation context | +| `docker nginx SSL` | `deployment.md` | requires manual keyword logic | +| `keycloak OAuth` | `auth.md` | little semantic cross-reference | +| `postgresql database` | `README.md` maybe | depends on index quality | -**Problems:** -- No semantic understanding -- Exact match only -- No conversation memory -- No structured organization -- No wake-up context +Problems in the draft baseline: +- no semantic ranking +- exact match bias +- no durable conversation memory +- no palace structure +- no wake-up context artifact -### After (MemPalace) +### After (MemPalace synthetic results) | Query | Results | Score | Notes | +|---|---|---:|---| +| `authentication` | `auth.md`, `main.py` | -0.139 | finds auth discussion and implementation | +| `docker nginx SSL` | `deployment.md`, `auth.md` | 0.447 | exact deployment hit plus related JWT context | +| `keycloak OAuth` | `auth.md`, `main.py` | -0.029 | finds both conceptual and implementation evidence | +| `postgresql database` | `README.md`, `main.py` | 0.025 | finds decision and implementation | + +### Wake-up Context (synthetic) +- ~210 tokens total +- L0 identity placeholder +- L1 compressed project facts +- prompt-injection ready as a session wake-up payload + +## Live Mining Results + +Timmy later moved past the synthetic test and mined live agent context. That is the more important result for this repo. + +### Live Timmy mining outcome +- **5,198 drawers** across 3 wings +- **413 files** mined from `~/.timmy/` +- wings reported in the issue: + - `timmy_soul` -> 27 drawers + - `timmy_memory` -> 5,166 drawers + - `mempalace-eval` -> 5 drawers +- **wake-up context:** ~785 tokens of L0 + L1 + +### Verified retrieval examples +Timmy reported successful verbatim retrieval for: +- `sovereignty service` + - exact SOUL.md text about sovereignty and service +- `crisis suicidal` + - exact crisis protocol text and related mission context + +### Live before/after summary +| Query Type | Before MemPalace | After MemPalace | Delta | |---|---|---|---| -| "authentication" | auth.md, main.py | -0.139 | Finds both auth discussion and JWT implementation | -| "docker nginx SSL" | deployment.md, auth.md | 0.447 | Exact match on deployment, related JWT context | -| "keycloak OAuth" | auth.md, main.py | -0.029 | Finds OAuth discussion and JWT usage | -| "postgresql database" | README.md, main.py | 0.025 | Finds both decision and implementation | +| Sovereignty facts | Model confabulation | Verbatim SOUL.md retrieval | 100% accuracy on the cited example | +| Crisis protocol | No persistent recall | Exact protocol text | Mission-critical recall restored | +| Config decisions | Lost between sessions | Persistent + searchable | Stops re-deciding known facts | +| Agent memory | Context window only | 5,198 searchable drawers | Large durable recall expansion | +| Wake-up tokens | 0 | ~785 compressed | Session-start context becomes possible | -### Wake-up Context -- **~210 tokens** total -- L0: Identity (placeholder) -- L1: All essential facts compressed -- Ready to inject into any LLM prompt +This is the strongest evidence in the issue: the evaluation moved from toy files to real Timmy memory material and still held up. -## Integration Potential +## Independent Verification -### 1. Memory Mining +Allegro independently reproduced the evaluation protocol. + +### Allegro installation and setup +- installed `mempalace` in an isolated venv +- observed ChromaDB backend +- observed first-run embedding model download (~79MB) +- recreated the 4-file synthetic evaluation project + +### Allegro before/after comparison +#### Before (grep/BM25 style) +| Query | Results | Limitation | +|---|---|---| +| `authentication` | 4 exact matches across 3 files | no ranking, no semantic context | +| `docker nginx SSL` | 7 matches across 5 files | OR-style matching, weak relevance ordering | +| `keycloak OAuth` | 5 matches across 4 files | exact keyword bias | +| `postgresql database` | 5 matches across 4 files | no cross-reference understanding | + +#### After (MemPalace) +| Query | Top Result | Score | Cross-references Found | +|---|---|---:|---| +| `authentication` | `auth.md` | -0.232 | ranked related files behind it | +| `docker nginx SSL` | `main.py` | 0.073 | ranked infra-related files by relevance | +| `keycloak OAuth` | `auth.md` | -0.137 | related files surfaced as well | +| `postgresql database` | `README.md` | -0.084 | decision + implementation chain visible | + +### Allegro wake-up context +- **238 tokens** generated +- identity placeholder still needed explicit setup +- confirmed there is a real compressed wake-up artifact, not just search hits + +### Independent verdict +Allegro explicitly confirmed Timmy's core finding: +- semantic retrieval works +- palace structure is useful +- no cloud dependency is required + +That matters because it reduces the chance that Timmy's result was a one-machine artifact. + +## Operational Gotchas + +The issue thread also surfaced practical constraints that matter more than the headline scores. + +1. `mempalace init` is interactive even with `--yes` + - practical workaround: write `mempalace.yaml` manually + +2. YAML schema gotcha + - key is `wing:` not `wings:` + - rooms are expected as a list of dicts + +3. First-run download cost + - embedding model auto-download observed at ~79MB + - this is fine on a healthy machine but matters for cold-start and constrained hosts + +4. Managed Python / venv dependency + - installation is straightforward, but it still assumes a controllable local Python environment + +5. Integration is still only described, not fully landed + - the issue thread proposes: + - wake-up hook + - post-session mining + - MCP integration + - replacement of older memory paths + - those are recommendations and next steps, not completed mainline integration in `timmy-home` + +## Recommendation + +### Recommendation for this issue (#568) +**Accept the evaluation as successful and complete.** + +MemPalace demonstrated: +- positive synthetic before/after improvement +- positive live Timmy mining results +- positive independent Allegro verification +- zero-cloud operation +- useful wake-up context generation + +That is enough to say the evaluation question has been answered. + +### Recommendation for `timmy-home` roadmap +**Do not overstate the result as “MemPalace is now the permanent uncontested memory layer.”** + +A more precise current recommendation is: +1. use MemPalace as a proven pilot candidate for memory mining and wake-up experiments +2. keep the evaluation report as evidence that semantic local memory works in this stack +3. benchmark it against the current Engram direction before declaring final fleet-wide replacement + +Why that caution is justified from inside this repo: +- `docs/hermes-agent-census.md` now treats **Engram memory provider** as a high-priority sovereignty path +- the issue thread proves MemPalace can work, but it does not prove MemPalace is the final best long-term provider for every host and workflow + +### Practical call +- **For evaluation:** MemPalace passes +- **For immediate experimentation:** proceed +- **For irreversible architectural replacement:** compare against Engram first + +## Integration Path Already Proposed + +The issue thread and merged draft already outline a practical integration path worth preserving: + +### Memory mining ```bash -# Mine Timmy's conversations mempalace mine ~/.hermes/sessions/ --mode convos - -# Mine project code and docs mempalace mine ~/.hermes/hermes-agent/ - -# Mine configs mempalace mine ~/.hermes/ ``` -### 2. Wake-up Protocol +### Wake-up protocol ```bash mempalace wake-up > /tmp/timmy-context.txt -# Inject into Hermes system prompt ``` -### 3. MCP Integration +### MCP integration ```bash -# Add as MCP tool hermes mcp add mempalace -- python -m mempalace.mcp_server ``` -### 4. Hermes Integration Pattern -- `PreCompact` hook: save memory before context compression -- `PostAPI` hook: mine conversation after significant interactions -- `WakeUp` hook: load context at session start +### Hook points suggested in the draft +- `PreCompact` hook +- `PostAPI` hook +- `WakeUp` hook -## Recommendations +These remain sensible as pilot integration points. -### Immediate -1. Add `mempalace` to Hermes venv requirements -2. Create mine script for ~/.hermes/ and ~/.timmy/ -3. Add wake-up hook to Hermes session start -4. Test with real conversation exports +## Next Steps -### Short-term (Next Week) -1. Mine last 30 days of Timmy sessions -2. Build wake-up context for all agents -3. Add MemPalace MCP tools to Hermes toolset -4. Test retrieval quality on real queries - -### Medium-term (Next Month) -1. Replace homebrew memory system with MemPalace -2. Build palace structure: wings for projects, halls for topics -3. Compress with AAAK for 30x storage efficiency -4. Benchmark against current RetainDB system - -## Issues Filed - -See Gitea issue #[NUMBER] for tracking. +Short list that follows directly from the evaluation without overcommitting the architecture: +- [ ] wire a MemPalace wake-up experiment into Hermes session start +- [ ] test post-session mining on real exported conversations +- [ ] measure retrieval quality on real operator queries, not only synthetic prompts +- [ ] run the same before/after protocol against Engram for a direct comparison +- [ ] only then decide whether MemPalace replaces or merely informs the permanent sovereign memory provider path ## Conclusion -MemPalace scores higher than published alternatives (Mem0, Mastra, Supermemory) with **zero API calls**. +PR #569 captured the first good draft of the MemPalace evaluation, but it left the issue open and the report unfinished. -For our use case, the key advantages are: -1. **Verbatim retrieval** — never loses the "why" context -2. **Palace structure** — +34% boost from organization -3. **Local-only** — aligns with our sovereignty mandate -4. **MCP compatible** — drops into our existing tool chain -5. **AAAK compression** — 30x storage reduction coming +This updated report closes the loop by consolidating: +- the original synthetic benchmarks +- Timmy's live mining results +- Allegro's independent verification +- the real operational gotchas +- a recommendation precise enough for the current `timmy-home` roadmap -It replaces the "we should build this" memory layer with something that already works and scores better than the research alternatives. +Bottom line: +- **MemPalace worked.** +- **The evaluation succeeded.** +- **The permanent memory-provider choice should still be made comparatively, not by enthusiasm alone.** diff --git a/tests/docs/test_mempalace_evaluation_report.py b/tests/docs/test_mempalace_evaluation_report.py new file mode 100644 index 0000000..2f9c1c7 --- /dev/null +++ b/tests/docs/test_mempalace_evaluation_report.py @@ -0,0 +1,34 @@ +from pathlib import Path + + +REPORT = Path("reports/evaluations/2026-04-06-mempalace-evaluation.md") + + +def _content() -> str: + return REPORT.read_text() + + +def test_mempalace_evaluation_report_exists() -> None: + assert REPORT.exists() + + +def test_mempalace_evaluation_report_has_completed_sections() -> None: + content = _content() + assert "# MemPalace Integration Evaluation Report" in content + assert "## Executive Summary" in content + assert "## Benchmark Findings" in content + assert "## Before vs After Evaluation" in content + assert "## Live Mining Results" in content + assert "## Independent Verification" in content + assert "## Operational Gotchas" in content + assert "## Recommendation" in content + + +def test_mempalace_evaluation_report_uses_real_issue_reference_and_metrics() -> None: + content = _content() + assert "#568" in content + assert "#[NUMBER]" not in content + assert "5,198 drawers" in content + assert "~785 tokens" in content + assert "238 tokens" in content + assert "interactive even with `--yes`" in content or "interactive even with --yes" in content