Files
timmy-config/training/scripts
Alexander Whitestone 79d148ddd8
Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 31s
Smoke Test / smoke (pull_request) Failing after 21s
Validate Config / YAML Lint (pull_request) Failing after 14s
Validate Config / JSON Validate (pull_request) Successful in 15s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m12s
PR Checklist / pr-checklist (pull_request) Failing after 5m45s
Validate Config / Shell Script Lint (pull_request) Failing after 46s
Validate Config / Cron Syntax Check (pull_request) Successful in 10s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 9s
Validate Training Data / validate (pull_request) Successful in 15s
Validate Config / Playbook Schema Validation (pull_request) Successful in 19s
Architecture Lint / Lint Repository (pull_request) Has been cancelled
Validate Config / Python Test Suite (pull_request) Has been cancelled
feat: training data quality filter (#687)
Scores training pairs and removes low-quality entries.

Scoring criteria:
- Response length (too short = low quality)
- Prompt/response ratio (response should be substantive)
- Filler detection (sure, okay, i dont know)
- Placeholder detection (TODO, FIXME, PLACEHOLDER)
- Prompt=response detection (duplicates)
- Repetition detection (repeated bigrams)
- Prompt minimum length

Usage:
  python3 training/scripts/quality_filter.py --input data.jsonl --dry-run
  python3 training/scripts/quality_filter.py --input data.jsonl --threshold 0.5

Closes #687
2026-04-16 00:45:50 -04:00
..