Sovereign backup of all Hermes Agent configuration and data. Excludes: secrets, auth tokens, sessions, caches, code (separate repo). Tracked: - config.yaml (model, fallback chain, toolsets, display prefs) - SOUL.md (Timmy personality charter) - memories/ (persistent MEMORY.md + USER.md) - skills/ (371 files — full skill library) - cron/jobs.json (scheduled tasks) - channel_directory.json (platform channels) - hooks/ (custom hooks)
2.1 KiB
2.1 KiB
Deduplication Guide
Complete guide to exact, fuzzy, and semantic deduplication.
Exact deduplication
Remove documents with identical content.
from nemo_curator.modules import ExactDuplicates
# Exact deduplication
exact_dedup = ExactDuplicates(
id_field="id",
text_field="text",
hash_method="md5" # or "sha256"
)
deduped = exact_dedup(dataset)
Performance: ~16× faster on GPU vs CPU
Fuzzy deduplication
Remove near-duplicate documents using MinHash + LSH.
from nemo_curator.modules import FuzzyDuplicates
fuzzy_dedup = FuzzyDuplicates(
id_field="id",
text_field="text",
num_hashes=260, # MinHash permutations (more = accurate)
num_buckets=20, # LSH buckets (more = faster, less recall)
hash_method="md5",
jaccard_threshold=0.8 # Similarity threshold
)
deduped = fuzzy_dedup(dataset)
Parameters:
num_hashes: 128-512 (default 260)num_buckets: 10-50 (default 20)jaccard_threshold: 0.7-0.9 (default 0.8)
Performance: 16× faster on 8TB dataset (120h → 7.5h)
Semantic deduplication
Remove semantically similar documents using embeddings.
from nemo_curator.modules import SemanticDuplicates
semantic_dedup = SemanticDuplicates(
id_field="id",
text_field="text",
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
embedding_batch_size=256,
threshold=0.85, # Cosine similarity threshold
device="cuda"
)
deduped = semantic_dedup(dataset)
Models:
all-MiniLM-L6-v2: Fast, 384 dimsall-mpnet-base-v2: Better quality, 768 dims- Custom models supported
Comparison
| Method | Speed | Recall | Use Case |
|---|---|---|---|
| Exact | Fastest | 100% | Exact matches only |
| Fuzzy | Fast | ~95% | Near-duplicates (recommended) |
| Semantic | Slow | ~90% | Paraphrases, rewrites |
Best practices
- Start with exact dedup - Remove obvious duplicates
- Use fuzzy for large datasets - Best speed/quality trade-off
- Semantic for high-value data - Expensive but thorough
- GPU acceleration required - 10-16× speedup