feat: knowledge deduplication — content hash + token similarity (#196) #228

Merged
Rockachopa merged 1 commits from burn/196-1776306000 into main 2026-04-21 15:28:51 +00:00
Owner

Dedup module for knowledge entries.

Features:

  • SHA256 content hashing for exact duplicates
  • Token Jaccard similarity for near-duplicates (configurable threshold, default 0.95)
  • Quality-based merge: keeps higher confidence/source_count fact
  • Metadata merging: tags, related, source_count combined
  • Dry-run mode for safe inspection
  • 30 tests passing
  • Built-in --test mode with generated duplicate set

Usage:

# Dedup index.json (in-place)
python scripts/dedup.py --input knowledge/index.json

# Dry run
python scripts/dedup.py --input knowledge/index.json --dry-run

# JSON output
python scripts/dedup.py --input knowledge/index.json --json

# Run test suite
python scripts/dedup.py --test

Closes #196.

Dedup module for knowledge entries. **Features:** - SHA256 content hashing for exact duplicates - Token Jaccard similarity for near-duplicates (configurable threshold, default 0.95) - Quality-based merge: keeps higher confidence/source_count fact - Metadata merging: tags, related, source_count combined - Dry-run mode for safe inspection - 30 tests passing - Built-in `--test` mode with generated duplicate set **Usage:** ```bash # Dedup index.json (in-place) python scripts/dedup.py --input knowledge/index.json # Dry run python scripts/dedup.py --input knowledge/index.json --dry-run # JSON output python scripts/dedup.py --input knowledge/index.json --json # Run test suite python scripts/dedup.py --test ``` Closes #196.
Rockachopa added 1 commit 2026-04-21 11:59:04 +00:00
feat: knowledge deduplication — content hash + token similarity (#196)
Some checks failed
Test / pytest (pull_request) Failing after 21s
cc215e3ed7
Dedup module for knowledge entries with:
- SHA256 content hashing for exact duplicates
- Token Jaccard similarity for near-duplicates (default 0.95)
- Quality-based merge: keeps higher confidence/source_count
- Metadata merging: tags, related, source_count
- Dry-run mode
- 30 tests passing
- Built-in --test mode with generated duplicates

Usage:
  python scripts/dedup.py --input knowledge/index.json
  python scripts/dedup.py --input knowledge/index.json --dry-run
  python scripts/dedup.py --test

Closes #196.
Rockachopa merged commit 345d2451d0 into main 2026-04-21 15:28:51 +00:00
Sign in to join this conversation.