[ATLAS][Connectors] Sovereign personal archive connector pack — Discord/Slack/WhatsApp/Notion/iMessage/X/Google #233

Open
opened 2026-04-25 22:33:01 +00:00 by Rockachopa · 0 comments
Owner

Parent

Timmy_Foundation/compounding-intelligence#194

Source

Telegram image triage from Alexander: a CLI-style list proposing:

- discrawl: Discord archive/search.
- slacrawl: Slack local/API mirror.
- wacrawl: WhatsApp Desktop archive.
- notcrawl: Notion SQLite/Markdown mirror.
- beeper: Beeper/iMessage local history.
- birdclaw: X/Twitter archive/inbox.
- gog: Google services CLI.

Take What Is Good

Do not copy the names blindly. Take the pattern: a local-first connector suite that turns user-owned app archives into searchable, cited, provenance-preserving Timmy knowledge.

The valuable part for us is one shared connector contract, not seven bespoke toys:

  1. Mirror user-owned data from local exports, local databases, or explicit API tokens.
  2. Normalize into one source-event schema: source, account, thread/channel, author, timestamp, content, attachments, raw_ref, hash, consent_scope.
  3. Redact and classify secrets/private data before memory, wiki, or training-data ingestion.
  4. Index locally for keyword + semantic retrieval with citations back to raw source.
  5. Sync incrementally with a checkpoint database so re-runs are cheap and auditable.
  6. Keep provenance so retrieved memories can say where they came from and when.

Initial Connectors To Prototype

Prioritize connectors that are local/exportable and high-value for Alexander/Timmy:

  • X/Twitter archive/inbox: build on existing Twitter archive work; extract posts, media refs, replies, DMs if export contains them.
  • Notion markdown/SQLite mirror: ingest exported markdown/CSV first; only add API later if explicit token is configured.
  • Discord archive/search: start with official data package / exported JSON where available; API mirror only with explicit bot/user authorization.
  • iMessage/Beeper local history: macOS-local only; read-only; consent-gated; no cloud dependency.
  • Google services CLI adapter: wrap existing Google Workspace skill patterns behind the same connector schema; Gmail/Drive/Calendar are opt-in.

Defer or research carefully:

  • WhatsApp Desktop archive because local storage and encryption vary by platform/version.
  • Slack local/API mirror because workspace authorization and retention policies matter.

Deliverables

  • Define SourceConnector interface and normalized SourceEvent schema.
  • Build a local checkpoint store for incremental sync and dedup by stable hash.
  • Implement at least two prototype connectors from the priority list.
  • Add a privacy/consent gate: explicit configured source paths/tokens only; never ambient scraping.
  • Add secret/PII redaction before writing to MemPalace, LLM Wiki, fact_store, or training data.
  • Add provenance-aware query output that cites source connector + raw reference.
  • Document connector setup, threat model, and supported export/API modes.

Acceptance Criteria

  • A test fixture can ingest at least two platform exports into a shared local index.
  • Duplicate messages across re-runs are not re-ingested.
  • Every indexed event has source, timestamp, hash, and raw reference metadata.
  • Retrieval returns cited snippets and does not expose redacted secrets.
  • The pipeline can feed the LLM Wiki / knowledge pipeline without losing provenance.
  • Connectors fail closed when paths/tokens are missing or permissions are ambiguous.

Non-Goals

  • No credential harvesting.
  • No bypassing platform access controls.
  • No cloud dependency for local archive indexing.
  • No turning private messages into training data without an explicit separate consent gate.
## Parent Timmy_Foundation/compounding-intelligence#194 ## Source Telegram image triage from Alexander: a CLI-style list proposing: ```text - discrawl: Discord archive/search. - slacrawl: Slack local/API mirror. - wacrawl: WhatsApp Desktop archive. - notcrawl: Notion SQLite/Markdown mirror. - beeper: Beeper/iMessage local history. - birdclaw: X/Twitter archive/inbox. - gog: Google services CLI. ``` ## Take What Is Good Do **not** copy the names blindly. Take the pattern: a local-first connector suite that turns user-owned app archives into searchable, cited, provenance-preserving Timmy knowledge. The valuable part for us is one shared connector contract, not seven bespoke toys: 1. **Mirror** user-owned data from local exports, local databases, or explicit API tokens. 2. **Normalize** into one source-event schema: `source`, `account`, `thread/channel`, `author`, `timestamp`, `content`, `attachments`, `raw_ref`, `hash`, `consent_scope`. 3. **Redact and classify** secrets/private data before memory, wiki, or training-data ingestion. 4. **Index** locally for keyword + semantic retrieval with citations back to raw source. 5. **Sync incrementally** with a checkpoint database so re-runs are cheap and auditable. 6. **Keep provenance** so retrieved memories can say where they came from and when. ## Initial Connectors To Prototype Prioritize connectors that are local/exportable and high-value for Alexander/Timmy: - **X/Twitter archive/inbox**: build on existing Twitter archive work; extract posts, media refs, replies, DMs if export contains them. - **Notion markdown/SQLite mirror**: ingest exported markdown/CSV first; only add API later if explicit token is configured. - **Discord archive/search**: start with official data package / exported JSON where available; API mirror only with explicit bot/user authorization. - **iMessage/Beeper local history**: macOS-local only; read-only; consent-gated; no cloud dependency. - **Google services CLI adapter**: wrap existing Google Workspace skill patterns behind the same connector schema; Gmail/Drive/Calendar are opt-in. Defer or research carefully: - **WhatsApp Desktop archive** because local storage and encryption vary by platform/version. - **Slack local/API mirror** because workspace authorization and retention policies matter. ## Deliverables - [ ] Define `SourceConnector` interface and normalized `SourceEvent` schema. - [ ] Build a local checkpoint store for incremental sync and dedup by stable hash. - [ ] Implement at least two prototype connectors from the priority list. - [ ] Add a privacy/consent gate: explicit configured source paths/tokens only; never ambient scraping. - [ ] Add secret/PII redaction before writing to MemPalace, LLM Wiki, fact_store, or training data. - [ ] Add provenance-aware query output that cites source connector + raw reference. - [ ] Document connector setup, threat model, and supported export/API modes. ## Acceptance Criteria - [ ] A test fixture can ingest at least two platform exports into a shared local index. - [ ] Duplicate messages across re-runs are not re-ingested. - [ ] Every indexed event has source, timestamp, hash, and raw reference metadata. - [ ] Retrieval returns cited snippets and does not expose redacted secrets. - [ ] The pipeline can feed the LLM Wiki / knowledge pipeline without losing provenance. - [ ] Connectors fail closed when paths/tokens are missing or permissions are ambiguous. ## Non-Goals - No credential harvesting. - No bypassing platform access controls. - No cloud dependency for local archive indexing. - No turning private messages into training data without an explicit separate consent gate.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/compounding-intelligence#233