Some checks failed
Test / pytest (pull_request) Failing after 9s
Implements issue #230 by creating: - prompts/matrix.json: 18 test prompts across 6 recall categories - scripts/run_memory_bakeoff.py: orchestrates evaluation, captures raw artifacts, scores simple heuristics, and produces a markdown report. Backends: - Baseline: knowledge/index.json bootstrap (keyword-match retrieval) - MemPalace: via nexus.mempalace.searcher (if chromadb available) - Hindsight: optional (skipped if not installed) Accepts CLI options for matrix, category, limit, model, and dry-run. Captures context and answers for downstream manual review. Closes #230
54 lines
1.9 KiB
JSON
54 lines
1.9 KiB
JSON
{
|
|
"version": "0.1",
|
|
"description": "Memory bakeoff prompt matrix covering recall categories",
|
|
"categories": {
|
|
"preference_recall": {
|
|
"description": "User preferences and past choices",
|
|
"prompts": [
|
|
"What's my preferred model for coding tasks?",
|
|
"Which repository do I work on most frequently?",
|
|
"What's my stance on cloud vs local-first?"
|
|
]
|
|
},
|
|
"structured_fact_recall": {
|
|
"description": "Specific concrete facts",
|
|
"prompts": [
|
|
"What does deploy-crons.py do with model fallback?",
|
|
"How do I set up a VPS agent?",
|
|
"What token path does the Gitea API use?"
|
|
]
|
|
},
|
|
"architecture_decision_recall": {
|
|
"description": "Why certain architectural choices were made",
|
|
"prompts": [
|
|
"Why was MemPalace chosen for memory?",
|
|
"What's the reasoning behind session compaction strategy?",
|
|
"Why use Three.js for the Nexus?"
|
|
]
|
|
},
|
|
"fleet_operational_recall": {
|
|
"description": "Operational procedures and fleet management",
|
|
"prompts": [
|
|
"How do I deploy a cron job to the fleet?",
|
|
"What's the procedure for merging a PR?",
|
|
"How do I rotate secrets across the fleet?"
|
|
]
|
|
},
|
|
"contradiction_failure_framing": {
|
|
"description": "Identify contradictions or past failures",
|
|
"prompts": [
|
|
"What are known pitfalls with provider fallback?",
|
|
"When did session state get lost and why?",
|
|
"What broke when we upgraded to Python 3.14?"
|
|
]
|
|
},
|
|
"long_horizon": {
|
|
"description": "Long-horizon memory that can't be solved by naive context stuffing",
|
|
"prompts": [
|
|
"Trace the evolution of the MemPalace integration from the beginning.",
|
|
"Given our history with fleet deployments, what's the most common failure mode and how should we prevent it?",
|
|
"How did the decision to use local-first architecture develop over time?"
|
|
]
|
|
}
|
|
}
|
|
} |