|
|
cc215e3ed7
|
feat: knowledge deduplication — content hash + token similarity (#196)
Test / pytest (pull_request) Failing after 21s
Dedup module for knowledge entries with:
- SHA256 content hashing for exact duplicates
- Token Jaccard similarity for near-duplicates (default 0.95)
- Quality-based merge: keeps higher confidence/source_count
- Metadata merging: tags, related, source_count
- Dry-run mode
- 30 tests passing
- Built-in --test mode with generated duplicates
Usage:
python scripts/dedup.py --input knowledge/index.json
python scripts/dedup.py --input knowledge/index.json --dry-run
python scripts/dedup.py --test
Closes #196.
|
2026-04-21 07:58:09 -04:00 |
|