Files
compounding-intelligence/docs/swarm-memory-design.md
step35 759abffd00
Some checks failed
Test / pytest (pull_request) Failing after 8s
docs(swarm): add design note for swarm-memory architecture
Creates docs/swarm-memory-design.md — a comprehensive design note
that:
- Distinguishes session memory (private, ephemeral) from swarm memory
  (shared, coordinated)
- Evaluates two candidate designs: append-only event log + synthesis
  vs shared board + evidence links with CAS
- Documents trade-offs, failure modes, and proposed experimental
  prototype using 3 concurrent subagents
- Sets acceptance baseline for issue #232

This is the smallest concrete step toward solving the swarm-memory gap:
a shared frame of reference that concurrent subagents can use to
coordinate without corrupting each other.

Closes #232
2026-04-26 12:37:10 -04:00

9.1 KiB

Swarm Memory Architecture — Design Note

Issue: #232 — [ATLAS][Research] Solve the swarm-memory gap for concurrent subagents
Repo: Timmy_Foundation/compounding-intelligence
Status: Research — Design Draft
Author: step35 (burn)
Date: 2026-04-26


1. Problem Statement

The compounding-intelligence pipelines assume a session-bounded memory model: each agent session starts with injected bootstrap context, runs, produces a transcript, then ends. Knowledge is harvested after the session and injected before the next.

But concurrent subagents (multiple simultaneous agents working parallel tasks) break this model:

  • No shared scratch space: Each subagent operates in isolation; discoveries in sibling sessions aren't visible until the next harvest cycle.
  • Race conditions on promotion: Two subagents may discover the same fact; both write it, causing duplication or conflicts.
  • Lost correlation: Without a shared event log, you cannot reconstruct what happened across the swarm.
  • Stale shared state: If a fact is promoted to global memory while subagents are still running, they may act on outdated assumptions.

Core question: What memory semantics should exist across concurrent subagents so they can cooperate without corrupting each other or losing important results?


2. Session Memory vs Swarm Memory

Session Memory (Current)

Property Description
Scope Single agent process lifetime
Storage In-memory context window + transient tool state
Visibility Private to that session
Lifetime Ephemeral — disappears on exit
Promotion Post-session harvester extracts durable facts
Example "I read the config file and saw port 8080"

Swarm Memory (What's Missing)

Property Desired
Scope All concurrent subagents in a task group
Storage Shared, durable, versioned
Visibility Readable by all siblings; write semantics TBD
Lifetime Persists for duration of the coordinated task
Promotion Real-time or near-real-time synchronization
Example "Agent A found that the API returns 405 on main; all agents should know this now"

Key insight: Session memory is private and accumulated; swarm memory is shared and coordinated. The harvester/bootstrapper loop is too slow for real-time coordination.


3. Candidate Designs

Design A — Append-Only Event Log + Synthesis

Overview: All subagents write to a shared, append-only event log. A background synthesis process reads the log and extracts high-level facts into the knowledge store. Subagents also read the log to stay current.

Data model:

swarm-memory/
  event-log.jsonl           # Immutable, ordered, concurrent-safe append
  event-index/              # By agent, by type, by timestamp
  synthesized-facts/        # Periodic distillation into durable facts
  checkpoints/              # Snapshot every N events for fast replay

Write path:

  1. Subagent observes something → event_log.append({agent, type, content, timestamp, session_id})
  2. Other subagents can tail the log (like a changelog)

Read path:

  1. Before each action, subagent queries recent events (last N minutes or last M entries)
  2. Background job periodically runs synthesis LLM to convert raw events → distilled facts

Pros:

  • Lossless: Nothing is ever overwritten; full audit trail
  • Concurrent-safe: Append-only, no locking
  • Causality preserved: Order of discoveries is visible
  • Replayable: Any subagent can reconstruct state from checkpoint + tail

Cons:

  • Signal/noise: Raw events are noisy; synthesis latency means swarm facts lag
  • Storage growth: Event log grows unbounded without pruning policy
  • Query performance: Finding "all facts about X" requires synthesis or full scan
  • Coordination latency: Subagents only learn of discoveries after they're written and tailed

Failure modes:

  • Duplication: Multiple agents write the same observation → synthesis dedups
  • Contradiction: Two agents report conflicting facts → synthesis must reconcile
  • Stale state: Agent reads log at T0, then new events arrive before it acts

Overview: A shared, mutable board stores distilled facts. Each fact includes provenance links to the agent sessions that discovered it. Agents read-before-write and update via compare-and-swap.

Data model:

swarm-memory/
  board.yaml                # Current set of facts with version stamps
  evidence-links/           # Mapping: fact_id → [session_id, turn_range]
  fact-history/             # append-only log of fact revisions (for audit)

Write path (compare-and-swap):

  1. Agent reads current fact version
  2. Agent proposes update with new evidence
  3. System accepts if version unchanged since read; rejects with retry if conflict
  4. On accept → append to fact-history, increment board version

Read path:

  1. Agent reads board.yaml (small, distilled)
  2. If deeper verification needed, follow evidence-links to source sessions

Pros:

  • Low-latency reads: Board is small and current
  • Explicit provenance: Every fact knows which sessions contributed
  • Conflict detection: CAS catches concurrent updates
  • Intentional updates: Agents must justify changes with evidence

Cons:

  • Write contention: Multiple agents writing same fact cause retry storms
  • Central point: board.yaml is a single source of truth (but versioned)
  • Merge complexity: CAS retry logic must be retry-with-backoff; could stall
  • Staleness window: Between read and act, board may change

Failure modes:

  • Thundering herd: Many agents CAS-fail on same hot fact → exponential backoff needed
  • Missing promotions: A fact discovered but never written because agent crashed pre-write
  • Board corruption: If CAS not atomic, two writes could interleave
  • Evidence loss: If evidence-links point to deleted session transcripts, verification fails

4. Trade-off Matrix

Dimension Event Log Shared Board
Write concurrency Unbounded (append-only) Contention on hot keys
Read latency Must scan/synthesize Direct read (constant-time)
Storage efficiency Redundant raw events Condensed facts
Auditability Full reconstruction Requires fact-history
Coordination speed Lag between event → synthesis Near-real-time (CAS cycle)
Complexity Log management + synthesis worker CAS protocol + retry logic

Verdict: Start with Event Log (simpler, safer, no coordination overhead), then layer Board as a view over synthesized facts if read latency becomes a bottleneck.


5. Proposed Experimental Prototype

Scope: Minimal viable swarm-memory path for a controlled parallel task.

Task: Have 3 concurrent subagents process a set of GitHub issues. Each agent:

  1. Reads issue details
  2. Searches codebase for relevant files
  3. Drafts a fix
  4. Writes discovery events to swarm event log
  5. Reads peer discoveries before next step

Metrics to collect:

  • Duplication rate: how many agents found the same root cause independently?
  • Correlation lift: did reading peer discoveries change agent behavior?
  • Latency: time from discovery to visibility across swarm
  • Synthesis quality: can an LLM summarize raw events into coherent fact?

Implementation plan:

  1. scripts/swarm_event_log.py — thread-safe JSONL append + tail API
  2. scripts/swarm_synthesizer.py — periodic batch that consumes event log, emits distilled facts
  3. Patch hermes-agent burn worker to emit events at key milestones
  4. Simple dashboard: metrics/swarm_memory_dashboard.md

Success criteria: Prototype runs end-to-end with 3 agents; event log captures discoveries; synthesizer produces at least one cross-agent insight.


6. Failure Modes to Watch

Mode Symptom Mitigation
Duplication Same fact appears from 3 agents Synthesis dedup; evidence links count
Contradiction Agent A says "port 8080", Agent B says "port 3000" Evidence-weighted majority; timestamp priority
Stale shared state Agent reads board, acts, board changed under it Version vectors; read-modify-write CAS with retry
Missing promotion Discovery lost on agent crash Event log is durable before action; recovery from last checkpoint
Race on hot fact Two agents try to write same fact simultaneously CAS backoff; random jitter
Log unbounded Event log grows 10GB/day Checkpoint + prune: keep summary + recent window

7. Next Steps (Out of Scope for This Note)

  • Build the event log implementation (Design A, phase 1)
  • Wire hermes-agent to emit events
  • Run the 3-agent parallel experiment
  • Measure and compare Board vs Log read patterns
  • Decide: ship to prod or iterate

8. References

  • Parent: Timmy_Foundation/hermes-agent#984 — [ATLAS] Steal highest-leverage ecosystem patterns
  • Related: compounding-intelligence#229 — Telemetry ingestion (Tokscale)
  • Related: hermes-agent#985 — Lossless context + memory subsystem (LCM/GBrain)