EPIC: Knowledge pipeline v2 — harvester, dedup, provenance, quality gate #194

Open
opened 2026-04-15 15:18:00 +00:00 by Rockachopa · 0 comments
Owner

SOTA Reference

Our fleet generates massive amounts of knowledge through sessions, code, and research. Currently unstructured. The field is moving toward RAG with provenance tracking and quality-gated ingestion.

Current State

Knowledge extraction exists but is incomplete — no harvester, no deduplication, no quality filtering, no provenance chain.

What This Epic Covers

  1. Session transcript harvester
  2. Knowledge deduplication and freshness
  3. Provenance tracking (source session, model, timestamp)
  4. Quality gate (filter low-value entries)
  5. Training data pipeline integration

Success Criteria

  • Harvest 1000+ knowledge entries from existing sessions
  • Deduplication reduces redundancy by >50%
  • Every entry has provenance metadata
  • Quality gate filters entries scoring <0.5
  • Training data pipeline consumes harvested knowledge

Sub-issues: #137-142

Sub-issues

  • #195: Session transcript harvester
  • #196: Knowledge deduplication
  • #197: Provenance chain
  • #198: Quality gate
  • #199: Training data pipeline
  • #200: Knowledge freshness cron

External Source Connector Gap Closers

  • #233: Sovereign personal archive connector pack — platform archive mirrors behind one provenance-preserving connector contract
## SOTA Reference Our fleet generates massive amounts of knowledge through sessions, code, and research. Currently unstructured. The field is moving toward RAG with provenance tracking and quality-gated ingestion. ### Current State Knowledge extraction exists but is incomplete — no harvester, no deduplication, no quality filtering, no provenance chain. ### What This Epic Covers 1. Session transcript harvester 2. Knowledge deduplication and freshness 3. Provenance tracking (source session, model, timestamp) 4. Quality gate (filter low-value entries) 5. Training data pipeline integration ### Success Criteria - [ ] Harvest 1000+ knowledge entries from existing sessions - [ ] Deduplication reduces redundancy by >50% - [ ] Every entry has provenance metadata - [ ] Quality gate filters entries scoring <0.5 - [ ] Training data pipeline consumes harvested knowledge ### Sub-issues: #137-142 ### Sub-issues - #195: Session transcript harvester - #196: Knowledge deduplication - #197: Provenance chain - #198: Quality gate - #199: Training data pipeline - #200: Knowledge freshness cron ## External Source Connector Gap Closers - #233: Sovereign personal archive connector pack — platform archive mirrors behind one provenance-preserving connector contract
hermes was assigned by Rockachopa 2026-04-15 16:23:36 +00:00
hermes was unassigned by Rockachopa 2026-04-17 05:06:21 +00:00
google was assigned by Rockachopa 2026-04-22 02:10:56 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/compounding-intelligence#194