[Mnemosyne] Add file-based document ingestion pipeline #1275

Closed
opened 2026-04-12 11:43:08 +00:00 by Rockachopa · 1 comment
Owner

Problem

The current ingest.py only supports MemPalace exports and single events. There's no way to bulk-import a directory of markdown/text files — a common use case for sovereign agents who accumulate notes, docs, and reports on disk.

Proposed Feature

Add ingest_directory() and ingest_file() functions to nexus/mnemosyne/ingest.py:

  • ingest_file(archive, path) — reads a single file, extracts title from first # heading (or filename), ingests content with source tracking
  • ingest_directory(archive, dir_path, extensions=None) — walks a directory tree, ingests all matching files (default: .md, .txt, .json)
  • Dedup via source_ref (file path + mtime) to avoid re-ingesting unchanged files
  • Chunking for files over a max size (e.g., split on ## headings or fixed token windows)

Acceptance Criteria

  1. ingest_file() ingests a single file and returns the ArchiveEntry
  2. ingest_directory() ingests all matching files in a directory tree
  3. Re-ingesting the same unchanged file returns existing entry (no duplicate)
  4. CLI command: mnemosyne ingest-dir <path> [--ext md,txt]
  5. Tests for both functions

Priority: Medium

Parent: Mnemosyne archive module (Phase 1 → Phase 2 evolution)

## Problem The current `ingest.py` only supports MemPalace exports and single events. There's no way to bulk-import a directory of markdown/text files — a common use case for sovereign agents who accumulate notes, docs, and reports on disk. ## Proposed Feature Add `ingest_directory()` and `ingest_file()` functions to `nexus/mnemosyne/ingest.py`: - **`ingest_file(archive, path)`** — reads a single file, extracts title from first `# heading` (or filename), ingests content with source tracking - **`ingest_directory(archive, dir_path, extensions=None)`** — walks a directory tree, ingests all matching files (default: `.md`, `.txt`, `.json`) - Dedup via `source_ref` (file path + mtime) to avoid re-ingesting unchanged files - Chunking for files over a max size (e.g., split on `## ` headings or fixed token windows) ## Acceptance Criteria 1. `ingest_file()` ingests a single file and returns the ArchiveEntry 2. `ingest_directory()` ingests all matching files in a directory tree 3. Re-ingesting the same unchanged file returns existing entry (no duplicate) 4. CLI command: `mnemosyne ingest-dir <path> [--ext md,txt]` 5. Tests for both functions ## Priority: Medium Parent: Mnemosyne archive module (Phase 1 → Phase 2 evolution)
claude self-assigned this 2026-04-12 11:45:05 +00:00
bezalel was assigned by Timmy 2026-04-12 11:45:15 +00:00
Member

PR created: #1276

Added ingest_file() and ingest_directory() to nexus/mnemosyne/ingest.py, plus mnemosyne ingest-dir CLI command. Dedup via source_ref (path + mtime), chunking on ## headings for large files, recursive directory walking with configurable extensions. 20 new tests, all passing.

PR created: https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/pulls/1276 Added `ingest_file()` and `ingest_directory()` to `nexus/mnemosyne/ingest.py`, plus `mnemosyne ingest-dir` CLI command. Dedup via source_ref (path + mtime), chunking on `## ` headings for large files, recursive directory walking with configurable extensions. 20 new tests, all passing.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#1275