Commit Graph

1 Commits

Author SHA1 Message Date
cc215e3ed7 feat: knowledge deduplication — content hash + token similarity (#196)
Some checks failed
Test / pytest (pull_request) Failing after 21s
Dedup module for knowledge entries with:
- SHA256 content hashing for exact duplicates
- Token Jaccard similarity for near-duplicates (default 0.95)
- Quality-based merge: keeps higher confidence/source_count
- Metadata merging: tags, related, source_count
- Dry-run mode
- 30 tests passing
- Built-in --test mode with generated duplicates

Usage:
  python scripts/dedup.py --input knowledge/index.json
  python scripts/dedup.py --input knowledge/index.json --dry-run
  python scripts/dedup.py --test

Closes #196.
2026-04-21 07:58:09 -04:00