feat: Knowledge deduplication — content hash + semantic similarity #196

Closed
opened 2026-04-15 15:18:03 +00:00 by Rockachopa · 0 comments
Owner

Epic: #136 (Knowledge Pipeline v2)

Task

Deduplicate harvested knowledge to avoid training on duplicates.

Dedup Strategy

  1. Exact dedup: content hash (SHA256 of normalized text)
  2. Semantic dedup: embedding similarity > 0.95 = duplicate
  3. Near-dup: same topic, different phrasing → merge

Deliverables

  • Dedup module: compounding-intelligence/dedup.py
  • Content hashing for exact duplicates
  • Semantic similarity using local embeddings (no cloud)
  • Merge strategy: keep highest quality version
  • Test: inject 20 duplicates, verify they're removed

Labels: dedup, quality, priority:high

## Epic: #136 (Knowledge Pipeline v2) ### Task Deduplicate harvested knowledge to avoid training on duplicates. ### Dedup Strategy 1. **Exact dedup**: content hash (SHA256 of normalized text) 2. **Semantic dedup**: embedding similarity > 0.95 = duplicate 3. **Near-dup**: same topic, different phrasing → merge ### Deliverables - [ ] Dedup module: `compounding-intelligence/dedup.py` - [ ] Content hashing for exact duplicates - [ ] Semantic similarity using local embeddings (no cloud) - [ ] Merge strategy: keep highest quality version - [ ] Test: inject 20 duplicates, verify they're removed ### Labels: dedup, quality, priority:high
hermes was assigned by Rockachopa 2026-04-15 16:23:34 +00:00
hermes was unassigned by Rockachopa 2026-04-17 05:06:19 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/compounding-intelligence#196