[ATLAS][Connectors] Sovereign personal archive connector pack — Discord/Slack/WhatsApp/Notion/iMessage/X/Google #233

New Issue

Rockachopa · 2026-04-25T22:33:01Z

Rockachopa commented

2026-04-25 22:33:01 +00:00

Parent

Timmy_Foundation/compounding-intelligence#194

Source

Telegram image triage from Alexander: a CLI-style list proposing:

- discrawl: Discord archive/search.
- slacrawl: Slack local/API mirror.
- wacrawl: WhatsApp Desktop archive.
- notcrawl: Notion SQLite/Markdown mirror.
- beeper: Beeper/iMessage local history.
- birdclaw: X/Twitter archive/inbox.
- gog: Google services CLI.

Take What Is Good

Do not copy the names blindly. Take the pattern: a local-first connector suite that turns user-owned app archives into searchable, cited, provenance-preserving Timmy knowledge.

The valuable part for us is one shared connector contract, not seven bespoke toys:

Mirror user-owned data from local exports, local databases, or explicit API tokens.
Normalize into one source-event schema: source, account, thread/channel, author, timestamp, content, attachments, raw_ref, hash, consent_scope.
Redact and classify secrets/private data before memory, wiki, or training-data ingestion.
Index locally for keyword + semantic retrieval with citations back to raw source.
Sync incrementally with a checkpoint database so re-runs are cheap and auditable.
Keep provenance so retrieved memories can say where they came from and when.

Initial Connectors To Prototype

Prioritize connectors that are local/exportable and high-value for Alexander/Timmy:

X/Twitter archive/inbox: build on existing Twitter archive work; extract posts, media refs, replies, DMs if export contains them.
Notion markdown/SQLite mirror: ingest exported markdown/CSV first; only add API later if explicit token is configured.
Discord archive/search: start with official data package / exported JSON where available; API mirror only with explicit bot/user authorization.
iMessage/Beeper local history: macOS-local only; read-only; consent-gated; no cloud dependency.
Google services CLI adapter: wrap existing Google Workspace skill patterns behind the same connector schema; Gmail/Drive/Calendar are opt-in.

Defer or research carefully:

WhatsApp Desktop archive because local storage and encryption vary by platform/version.
Slack local/API mirror because workspace authorization and retention policies matter.

Deliverables

Define SourceConnector interface and normalized SourceEvent schema.
Build a local checkpoint store for incremental sync and dedup by stable hash.
Implement at least two prototype connectors from the priority list.
Add a privacy/consent gate: explicit configured source paths/tokens only; never ambient scraping.
Add secret/PII redaction before writing to MemPalace, LLM Wiki, fact_store, or training data.
Add provenance-aware query output that cites source connector + raw reference.
Document connector setup, threat model, and supported export/API modes.

Acceptance Criteria

A test fixture can ingest at least two platform exports into a shared local index.
Duplicate messages across re-runs are not re-ingested.
Every indexed event has source, timestamp, hash, and raw reference metadata.
Retrieval returns cited snippets and does not expose redacted secrets.
The pipeline can feed the LLM Wiki / knowledge pipeline without losing provenance.
Connectors fail closed when paths/tokens are missing or permissions are ambiguous.

Non-Goals

No credential harvesting.
No bypassing platform access controls.
No cloud dependency for local archive indexing.
No turning private messages into training data without an explicit separate consent gate.

## Parent Timmy_Foundation/compounding-intelligence#194 ## Source Telegram image triage from Alexander: a CLI-style list proposing: ```text - discrawl: Discord archive/search. - slacrawl: Slack local/API mirror. - wacrawl: WhatsApp Desktop archive. - notcrawl: Notion SQLite/Markdown mirror. - beeper: Beeper/iMessage local history. - birdclaw: X/Twitter archive/inbox. - gog: Google services CLI. ``` ## Take What Is Good Do **not** copy the names blindly. Take the pattern: a local-first connector suite that turns user-owned app archives into searchable, cited, provenance-preserving Timmy knowledge. The valuable part for us is one shared connector contract, not seven bespoke toys: 1. **Mirror** user-owned data from local exports, local databases, or explicit API tokens. 2. **Normalize** into one source-event schema: `source`, `account`, `thread/channel`, `author`, `timestamp`, `content`, `attachments`, `raw_ref`, `hash`, `consent_scope`. 3. **Redact and classify** secrets/private data before memory, wiki, or training-data ingestion. 4. **Index** locally for keyword + semantic retrieval with citations back to raw source. 5. **Sync incrementally** with a checkpoint database so re-runs are cheap and auditable. 6. **Keep provenance** so retrieved memories can say where they came from and when. ## Initial Connectors To Prototype Prioritize connectors that are local/exportable and high-value for Alexander/Timmy: - **X/Twitter archive/inbox**: build on existing Twitter archive work; extract posts, media refs, replies, DMs if export contains them. - **Notion markdown/SQLite mirror**: ingest exported markdown/CSV first; only add API later if explicit token is configured. - **Discord archive/search**: start with official data package / exported JSON where available; API mirror only with explicit bot/user authorization. - **iMessage/Beeper local history**: macOS-local only; read-only; consent-gated; no cloud dependency. - **Google services CLI adapter**: wrap existing Google Workspace skill patterns behind the same connector schema; Gmail/Drive/Calendar are opt-in. Defer or research carefully: - **WhatsApp Desktop archive** because local storage and encryption vary by platform/version. - **Slack local/API mirror** because workspace authorization and retention policies matter. ## Deliverables - [ ] Define `SourceConnector` interface and normalized `SourceEvent` schema. - [ ] Build a local checkpoint store for incremental sync and dedup by stable hash. - [ ] Implement at least two prototype connectors from the priority list. - [ ] Add a privacy/consent gate: explicit configured source paths/tokens only; never ambient scraping. - [ ] Add secret/PII redaction before writing to MemPalace, LLM Wiki, fact_store, or training data. - [ ] Add provenance-aware query output that cites source connector + raw reference. - [ ] Document connector setup, threat model, and supported export/API modes. ## Acceptance Criteria - [ ] A test fixture can ingest at least two platform exports into a shared local index. - [ ] Duplicate messages across re-runs are not re-ingested. - [ ] Every indexed event has source, timestamp, hash, and raw reference metadata. - [ ] Retrieval returns cited snippets and does not expose redacted secrets. - [ ] The pipeline can feed the LLM Wiki / knowledge pipeline without losing provenance. - [ ] Connectors fail closed when paths/tokens are missing or permissions are ambiguous. ## Non-Goals - No credential harvesting. - No bypassing platform access controls. - No cloud dependency for local archive indexing. - No turning private messages into training data without an explicit separate consent gate.

Rockachopa referenced this issue

2026-04-25 22:33:02 +00:00

EPIC: Knowledge pipeline v2 — harvester, dedup, provenance, quality gate #194

Rockachopa referenced this issue from a commit

2026-04-27 00:45:52 +00:00

feat(connectors): add sovereign personal archive connector pack foundation

Rockachopa referenced a pull request that will close this issue

2026-04-27 00:46:15 +00:00

feat(connectors): add sovereign personal archive connector pack foundation #279

Sign in to join this conversation.

Branches Tags

main

step35/150-8-7-graph-query-engine

step35/230-atlas-memory-eval-run-a-live

step35/89-3-10-test-generation-orchest

step35/87-3-8-regression-test-generato

step35/231-atlas-wiki-build-the-llm-wik

step35/108-5-2-vulnerability-scanner

step35/233-atlas-connectors-sovereign-p

step35/195-feat-session-transcript-harv

step35/199-feat-training-data-pipeline

step35/232-atlas-research-solve-the-swa

step35/127-6-9-review-quality-scorer

step35/99-4-4-architecture-doc-generat

step35/172-10-7-knowledge-gap-identifier

step35/162-9-8-code-duplication-detecto

step35/121-6-3-logic-reviewer

step35/104-4-9-doc-freshness-checker

step35/157-9-3-type-checker

step35/171-10-6-performance-bottleneck

step35/161-9-7-dependency-freshness

step35/140-7-8-citation-tracker

step35/132-feat-codebase-genome-diff-de

step35/135-feat-pr-complexity-scorer-es

step35/124-6-6-test-coverage-checker

step35/113-5-7-security-patch-applier

step35/109-5-3-update-checker

step35/170-10-5-automation-opportunity

step35/148-8-5-session-knowledge-extrac

step35/147-8-4-cross-repo-connector

step35/126-review-comment-generator

step35/134-gh-trending

step35/138-7-6-conference-talk-summariz

step35/96-4-1-docstring-generator

step35/98-4-3-api-doc-generator

step35/205-feat-zero-shot-knowledge-syn

step35/173-10-8-progress-tracker

step35/137-7-5-release-note-analyzer

step35/107-5-1-dependency-inventory

step35/111-5-5-transitive-dependency-an

step35/90-feat-gitea-issue-body-parser

step35/158-9-4-security-linter

step35/155-9-1-linter-runner

step35/133-feat-import-graph-visualizat

step35/93-feat-cross-repo-dependency-g

step35/112-5-6-dependency-bloat-detecto

step35/97-4-2-readme-generator

step35/91-feat-session-transcript-trai

step35/144-8-1-entity-extractor

step35/151-8-8-graph-visualizer

step35/88-3-9-test-documentation-gener

step35/197-feat-provenance-chain-source

step35/103-4-8-doc-link-validator

burn/196-1776306000

feat/200-knowledge-freshness-cron

fix/syntax-bottleneck-211

fix/212-dependency-graph-dot-quoting

fix/211-syntax-errors

fix/210-refactoring-opportunity-api

fix/210-refactoring-opportunity-finder

burn/210-1776305000

burn/211-1776305100

fix/211-syntax-error

fix/212-dot-quoting

fix/perf-bottleneck-syntax-211

fix/211-perf-bottleneck-syntax

burn/212-fix-dot-quoting

fix/211

fix/212-dependency-graph-quoting

fix/676

fix/198-quality-gate

fix/201-pytest-warnings

burn/210-1776852000

fix/676-genome-ci

fix/190

burn/170-1776263897

burn/169-1776263898

burn/174-1776263883

burn/171-1776263896

burn/168-1776263899

burn/172-1776263893

burn/175-1776263877

feat/179-staleness-check

feat/176-diff-analyzer

feat/177-issue-parser

feat/94-dead-code-detector

burn/172-1776218600

feat/93-dependency-graph

feat/92-knowledge-staleness-detector

feat/91-session-pair-harvester

feat/90-issue-body-parser

burn/110-license-checker

burn/118-1776218500

burn/17-session-sampler

fix/7-extraction-prompt

docs/genome-676

feat/session-metadata

fix/10-knowledge-format

fix/14-measurer

fix/9-auto-harvest-cron

fix/19-migrate-memory

fix/11-bootstrapper

fix/8-harvester

feat/session-reader

burn/8-harvester-py

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/compounding-intelligence#233