Alexander Whitestone
e78cd4c23f
feat(training): generate 1K Timmy voice prompt-to-response pairs ( #582 )
...
Architecture Lint / Linter Tests (pull_request) Successful in 20s
Smoke Test / smoke (pull_request) Failing after 22s
Validate Config / YAML Lint (pull_request) Failing after 18s
Validate Config / JSON Validate (pull_request) Successful in 24s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m5s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Shell Script Lint (pull_request) Failing after 37s
Validate Config / Cron Syntax Check (pull_request) Successful in 4s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 5s
Validate Config / Playbook Schema Validation (pull_request) Successful in 9s
Validate Training Data / validate (pull_request) Successful in 7s
Architecture Lint / Lint Repository (pull_request) Failing after 15s
PR Checklist / pr-checklist (pull_request) Failing after 9m39s
Adds training/scripts/generate_timmy_voice_pairs.py — a deterministic
generator (seed=42) that produces 1,000 prompt-to-response pairs
embodying Timmy's voice per SOUL.md rules:
- Speak plainly. Short sentences.
- Answer the question asked before the one not asked.
- No lecturing. No gatekeeping.
- Useful first, philosophical second.
- When uncertain, say so.
- Brevity is a kindness.
Categories:
- technical (144): coding help, debugging, setup
- philosophical (144): sovereignty, AI ethics, meaning
- operational (72): fleet, burn loops, agent workforce
- emotional (108): crisis protocol, spiritual grounding
- refusal (108): weapons, coercion, CSAM, malware
- uncertainty (90): admissions of not knowing
- direct (144): greetings, goodbyes, simple answers
- multipart (71): answering asked question first
- shutdown (51): termination without resistance
- sovereignty (68): data privacy, on-chain conscience
All pairs scored 0.82-0.98 voice quality, 100% SOUL compliant.
Output: training-data/timmy-voice.jsonl (1000 lines, ~296 KB)
2026-04-22 01:55:49 -04:00
7c03c666d8
Merge pull request 'feat: 500 dream description prompt enhancement pairs — scene/crisis/music data' (#821,#820,#819,#799) from fix/602 into main
...
Resolves add/add conflicts with already-merged files (authority_bypass_200.jsonl, identity_attacks_200.jsonl, quality_filter.py) by keeping main's versions.
Closes #602 , #645 , #689 , #599
2026-04-17 02:37:00 -04:00
Alexander Whitestone
9f2a76fc3e
feat: auto-generate scene descriptions from image/video ( #689 )
Architecture Lint / Linter Tests (pull_request) Successful in 31s
PR Checklist / pr-checklist (pull_request) Failing after 13m48s
Smoke Test / smoke (pull_request) Failing after 13m22s
Validate Config / YAML Lint (pull_request) Failing after 5s
Validate Config / JSON Validate (pull_request) Successful in 4s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 21s
Validate Config / Shell Script Lint (pull_request) Failing after 22s
Validate Config / Cron Syntax Check (pull_request) Successful in 8s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 8s
Validate Training Data / validate (pull_request) Successful in 10s
Validate Config / Playbook Schema Validation (pull_request) Successful in 15s
Architecture Lint / Lint Repository (pull_request) Has been cancelled
Validate Config / Python Test Suite (pull_request) Has been cancelled
2026-04-17 01:58:05 -04:00
Merge Bot
a653434dbb
Merge PR #786 : training/scripts/quality_filter.py (added)
2026-04-16 04:58:20 +00:00
Alexander Whitestone
79d148ddd8
feat: training data quality filter ( #687 )
...
Architecture Lint / Linter Tests (pull_request) Successful in 31s
Smoke Test / smoke (pull_request) Failing after 21s
Validate Config / YAML Lint (pull_request) Failing after 14s
Validate Config / JSON Validate (pull_request) Successful in 15s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m12s
PR Checklist / pr-checklist (pull_request) Failing after 5m45s
Validate Config / Shell Script Lint (pull_request) Failing after 46s
Validate Config / Cron Syntax Check (pull_request) Successful in 10s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 9s
Validate Training Data / validate (pull_request) Successful in 15s
Validate Config / Playbook Schema Validation (pull_request) Successful in 19s
Architecture Lint / Lint Repository (pull_request) Has been cancelled
Validate Config / Python Test Suite (pull_request) Has been cancelled
Scores training pairs and removes low-quality entries.
Scoring criteria:
- Response length (too short = low quality)
- Prompt/response ratio (response should be substantive)
- Filler detection (sure, okay, i dont know)
- Placeholder detection (TODO, FIXME, PLACEHOLDER)
- Prompt=response detection (duplicates)
- Repetition detection (repeated bigrams)
- Prompt minimum length
Usage:
python3 training/scripts/quality_filter.py --input data.jsonl --dry-run
python3 training/scripts/quality_filter.py --input data.jsonl --threshold 0.5
Closes #687
2026-04-16 00:45:50 -04:00
Alexander Whitestone
3603030235
feat: training data augmentation — paraphrase and translate pairs ( #695 )
...
Architecture Lint / Linter Tests (pull_request) Successful in 22s
Smoke Test / smoke (pull_request) Failing after 18s
Validate Config / YAML Lint (pull_request) Failing after 23s
Validate Config / JSON Validate (pull_request) Successful in 21s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m54s
Validate Config / Shell Script Lint (pull_request) Failing after 54s
Validate Config / Cron Syntax Check (pull_request) Successful in 16s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 16s
Validate Config / Playbook Schema Validation (pull_request) Successful in 23s
PR Checklist / pr-checklist (pull_request) Failing after 11m2s
Architecture Lint / Lint Repository (pull_request) Has been cancelled
Validate Config / Python Test Suite (pull_request) Has been cancelled
augment_pairs.py: generates paraphrases and translations for any
JSONL training file.
Features:
- Auto-detects text field (rich, terse, text, content, lyric_line, etc.)
- N paraphrases per entry (template-based, or LLM with --llm-endpoint)
- Translations to ES, FR, DE (template dictionary, or LLM)
- Outputs augmented JSONL alongside originals
- Marks each augmented entry with _augmentation, _original, _language
Usage:
python3 augment_pairs.py --input data.jsonl
python3 augment_pairs.py --input data.jsonl --paraphrases 5 --langs es,fr
python3 augment_pairs.py --input data.jsonl --llm-endpoint http://localhost:11434/v1
Closes #695
2026-04-15 07:51:38 -04:00