augment_pairs.py: generates paraphrases and translations for any
JSONL training file.
Features:
- Auto-detects text field (rich, terse, text, content, lyric_line, etc.)
- N paraphrases per entry (template-based, or LLM with --llm-endpoint)
- Translations to ES, FR, DE (template dictionary, or LLM)
- Outputs augmented JSONL alongside originals
- Marks each augmented entry with _augmentation, _original, _language
Usage:
python3 augment_pairs.py --input data.jsonl
python3 augment_pairs.py --input data.jsonl --paraphrases 5 --langs es,fr
python3 augment_pairs.py --input data.jsonl --llm-endpoint http://localhost:11434/v1Closes#695