Timmy_Foundation/compounding-intelligence

Files

Test / pytest (pull_request) Failing after 8s

Details

docs(swarm): add design note for swarm-memory architecture

Creates docs/swarm-memory-design.md — a comprehensive design note
that:
- Distinguishes session memory (private, ephemeral) from swarm memory
  (shared, coordinated)
- Evaluates two candidate designs: append-only event log + synthesis
  vs shared board + evidence links with CAS
- Documents trade-offs, failure modes, and proposed experimental
  prototype using 3 concurrent subagents
- Sets acceptance baseline for issue #232

This is the smallest concrete step toward solving the swarm-memory gap:
a shared frame of reference that concurrent subagents can use to
coordinate without corrupting each other.

Closes #232

2026-04-26 12:37:10 -04:00

9.1 KiB

Raw Blame History

Swarm Memory Architecture — Design Note

Issue: #232 — [ATLAS][Research] Solve the swarm-memory gap for concurrent subagents
Repo: Timmy_Foundation/compounding-intelligence
Status: Research — Design Draft
Author: step35 (burn)
Date: 2026-04-26

1. Problem Statement

The compounding-intelligence pipelines assume a session-bounded memory model: each agent session starts with injected bootstrap context, runs, produces a transcript, then ends. Knowledge is harvested after the session and injected before the next.

But concurrent subagents (multiple simultaneous agents working parallel tasks) break this model:

No shared scratch space: Each subagent operates in isolation; discoveries in sibling sessions aren't visible until the next harvest cycle.
Race conditions on promotion: Two subagents may discover the same fact; both write it, causing duplication or conflicts.
Lost correlation: Without a shared event log, you cannot reconstruct what happened across the swarm.
Stale shared state: If a fact is promoted to global memory while subagents are still running, they may act on outdated assumptions.

Core question: What memory semantics should exist across concurrent subagents so they can cooperate without corrupting each other or losing important results?

2. Session Memory vs Swarm Memory

Session Memory (Current)

Property	Description
Scope	Single agent process lifetime
Storage	In-memory context window + transient tool state
Visibility	Private to that session
Lifetime	Ephemeral — disappears on exit
Promotion	Post-session harvester extracts durable facts
Example	"I read the config file and saw port 8080"

Swarm Memory (What's Missing)

Property	Desired
Scope	All concurrent subagents in a task group
Storage	Shared, durable, versioned
Visibility	Readable by all siblings; write semantics TBD
Lifetime	Persists for duration of the coordinated task
Promotion	Real-time or near-real-time synchronization
Example	"Agent A found that the API returns 405 on main; all agents should know this now"

Key insight: Session memory is private and accumulated; swarm memory is shared and coordinated. The harvester/bootstrapper loop is too slow for real-time coordination.

3. Candidate Designs

Design A — Append-Only Event Log + Synthesis

Overview: All subagents write to a shared, append-only event log. A background synthesis process reads the log and extracts high-level facts into the knowledge store. Subagents also read the log to stay current.

Data model:

swarm-memory/
  event-log.jsonl           # Immutable, ordered, concurrent-safe append
  event-index/              # By agent, by type, by timestamp
  synthesized-facts/        # Periodic distillation into durable facts
  checkpoints/              # Snapshot every N events for fast replay

Write path:

Subagent observes something → event_log.append({agent, type, content, timestamp, session_id})
Other subagents can tail the log (like a changelog)

Read path:

Before each action, subagent queries recent events (last N minutes or last M entries)
Background job periodically runs synthesis LLM to convert raw events → distilled facts

Pros:

Lossless: Nothing is ever overwritten; full audit trail
Concurrent-safe: Append-only, no locking
Causality preserved: Order of discoveries is visible
Replayable: Any subagent can reconstruct state from checkpoint + tail

Cons:

Signal/noise: Raw events are noisy; synthesis latency means swarm facts lag
Storage growth: Event log grows unbounded without pruning policy
Query performance: Finding "all facts about X" requires synthesis or full scan
Coordination latency: Subagents only learn of discoveries after they're written and tailed

Failure modes:

Duplication: Multiple agents write the same observation → synthesis dedups
Contradiction: Two agents report conflicting facts → synthesis must reconcile
Stale state: Agent reads log at T0, then new events arrive before it acts

Design B — Shared Board + Evidence Links

Overview: A shared, mutable board stores distilled facts. Each fact includes provenance links to the agent sessions that discovered it. Agents read-before-write and update via compare-and-swap.

Data model:

swarm-memory/
  board.yaml                # Current set of facts with version stamps
  evidence-links/           # Mapping: fact_id → [session_id, turn_range]
  fact-history/             # append-only log of fact revisions (for audit)

Write path (compare-and-swap):

Agent reads current fact version
Agent proposes update with new evidence
System accepts if version unchanged since read; rejects with retry if conflict
On accept → append to fact-history, increment board version

Read path:

Agent reads board.yaml (small, distilled)
If deeper verification needed, follow evidence-links to source sessions

Pros:

Low-latency reads: Board is small and current
Explicit provenance: Every fact knows which sessions contributed
Conflict detection: CAS catches concurrent updates
Intentional updates: Agents must justify changes with evidence

Cons:

Write contention: Multiple agents writing same fact cause retry storms
Central point: board.yaml is a single source of truth (but versioned)
Merge complexity: CAS retry logic must be retry-with-backoff; could stall
Staleness window: Between read and act, board may change

Failure modes:

Thundering herd: Many agents CAS-fail on same hot fact → exponential backoff needed
Missing promotions: A fact discovered but never written because agent crashed pre-write
Board corruption: If CAS not atomic, two writes could interleave
Evidence loss: If evidence-links point to deleted session transcripts, verification fails

4. Trade-off Matrix

Dimension	Event Log	Shared Board
Write concurrency	Unbounded (append-only)	Contention on hot keys
Read latency	Must scan/synthesize	Direct read (constant-time)
Storage efficiency	Redundant raw events	Condensed facts
Auditability	Full reconstruction	Requires fact-history
Coordination speed	Lag between event → synthesis	Near-real-time (CAS cycle)
Complexity	Log management + synthesis worker	CAS protocol + retry logic

Verdict: Start with Event Log (simpler, safer, no coordination overhead), then layer Board as a view over synthesized facts if read latency becomes a bottleneck.

5. Proposed Experimental Prototype

Scope: Minimal viable swarm-memory path for a controlled parallel task.

Task: Have 3 concurrent subagents process a set of GitHub issues. Each agent:

Reads issue details
Searches codebase for relevant files
Drafts a fix
Writes discovery events to swarm event log
Reads peer discoveries before next step

Metrics to collect:

Duplication rate: how many agents found the same root cause independently?
Correlation lift: did reading peer discoveries change agent behavior?
Latency: time from discovery to visibility across swarm
Synthesis quality: can an LLM summarize raw events into coherent fact?

Implementation plan:

scripts/swarm_event_log.py — thread-safe JSONL append + tail API
scripts/swarm_synthesizer.py — periodic batch that consumes event log, emits distilled facts
Patch hermes-agent burn worker to emit events at key milestones
Simple dashboard: metrics/swarm_memory_dashboard.md

Success criteria: Prototype runs end-to-end with 3 agents; event log captures discoveries; synthesizer produces at least one cross-agent insight.

6. Failure Modes to Watch

Mode	Symptom	Mitigation
Duplication	Same fact appears from 3 agents	Synthesis dedup; evidence links count
Contradiction	Agent A says "port 8080", Agent B says "port 3000"	Evidence-weighted majority; timestamp priority
Stale shared state	Agent reads board, acts, board changed under it	Version vectors; read-modify-write CAS with retry
Missing promotion	Discovery lost on agent crash	Event log is durable before action; recovery from last checkpoint
Race on hot fact	Two agents try to write same fact simultaneously	CAS backoff; random jitter
Log unbounded	Event log grows 10GB/day	Checkpoint + prune: keep summary + recent window

7. Next Steps (Out of Scope for This Note)

Build the event log implementation (Design A, phase 1)
Wire hermes-agent to emit events
Run the 3-agent parallel experiment
Measure and compare Board vs Log read patterns
Decide: ship to prod or iterate

8. References

Parent: Timmy_Foundation/hermes-agent#984 — [ATLAS] Steal highest-leverage ecosystem patterns
Related: compounding-intelligence#229 — Telemetry ingestion (Tokscale)
Related: hermes-agent#985 — Lossless context + memory subsystem (LCM/GBrain)

9.1 KiB Raw Blame History