88 lines
2.1 KiB
Markdown
88 lines
2.1 KiB
Markdown
|
|
# Deduplication Guide
|
|||
|
|
|
|||
|
|
Complete guide to exact, fuzzy, and semantic deduplication.
|
|||
|
|
|
|||
|
|
## Exact deduplication
|
|||
|
|
|
|||
|
|
Remove documents with identical content.
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from nemo_curator.modules import ExactDuplicates
|
|||
|
|
|
|||
|
|
# Exact deduplication
|
|||
|
|
exact_dedup = ExactDuplicates(
|
|||
|
|
id_field="id",
|
|||
|
|
text_field="text",
|
|||
|
|
hash_method="md5" # or "sha256"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
deduped = exact_dedup(dataset)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Performance**: ~16× faster on GPU vs CPU
|
|||
|
|
|
|||
|
|
## Fuzzy deduplication
|
|||
|
|
|
|||
|
|
Remove near-duplicate documents using MinHash + LSH.
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from nemo_curator.modules import FuzzyDuplicates
|
|||
|
|
|
|||
|
|
fuzzy_dedup = FuzzyDuplicates(
|
|||
|
|
id_field="id",
|
|||
|
|
text_field="text",
|
|||
|
|
num_hashes=260, # MinHash permutations (more = accurate)
|
|||
|
|
num_buckets=20, # LSH buckets (more = faster, less recall)
|
|||
|
|
hash_method="md5",
|
|||
|
|
jaccard_threshold=0.8 # Similarity threshold
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
deduped = fuzzy_dedup(dataset)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters**:
|
|||
|
|
- `num_hashes`: 128-512 (default 260)
|
|||
|
|
- `num_buckets`: 10-50 (default 20)
|
|||
|
|
- `jaccard_threshold`: 0.7-0.9 (default 0.8)
|
|||
|
|
|
|||
|
|
**Performance**: 16× faster on 8TB dataset (120h → 7.5h)
|
|||
|
|
|
|||
|
|
## Semantic deduplication
|
|||
|
|
|
|||
|
|
Remove semantically similar documents using embeddings.
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from nemo_curator.modules import SemanticDuplicates
|
|||
|
|
|
|||
|
|
semantic_dedup = SemanticDuplicates(
|
|||
|
|
id_field="id",
|
|||
|
|
text_field="text",
|
|||
|
|
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
|
|||
|
|
embedding_batch_size=256,
|
|||
|
|
threshold=0.85, # Cosine similarity threshold
|
|||
|
|
device="cuda"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
deduped = semantic_dedup(dataset)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Models**:
|
|||
|
|
- `all-MiniLM-L6-v2`: Fast, 384 dims
|
|||
|
|
- `all-mpnet-base-v2`: Better quality, 768 dims
|
|||
|
|
- Custom models supported
|
|||
|
|
|
|||
|
|
## Comparison
|
|||
|
|
|
|||
|
|
| Method | Speed | Recall | Use Case |
|
|||
|
|
|--------|-------|--------|----------|
|
|||
|
|
| Exact | Fastest | 100% | Exact matches only |
|
|||
|
|
| Fuzzy | Fast | ~95% | Near-duplicates (recommended) |
|
|||
|
|
| Semantic | Slow | ~90% | Paraphrases, rewrites |
|
|||
|
|
|
|||
|
|
## Best practices
|
|||
|
|
|
|||
|
|
1. **Start with exact dedup** - Remove obvious duplicates
|
|||
|
|
2. **Use fuzzy for large datasets** - Best speed/quality trade-off
|
|||
|
|
3. **Semantic for high-value data** - Expensive but thorough
|
|||
|
|
4. **GPU acceleration required** - 10-16× speedup
|