feat(training): add Timmy voice batch 09 dataset (#589)

Generate a deterministic batch09 Timmy voice corpus with 1,000 ShareGPT prompt-response pairs, 50 approved source sessions, a source manifest, and focused validation tests.
2026-04-22 11:05:38 -04:00
parent ae8c1d46ae
commit 30d7a084e1
5 changed files with 2666 additions and 0 deletions
--- a/training-data/README-batch09.md
+++ b/training-data/README-batch09.md
@@ -0,0 +1,45 @@
+# Timmy Voice: Batch 09 — 1K Prompt→Response Pairs
+
+Training Factory — Timmy Voice Worker 9/10 (#589)
+
+## Files
+
+| File | Description |
+|------|-------------|
+| `timmy-voice-batch09.jsonl` | 1,000 ShareGPT-format prompt→response pairs |
+| `timmy-voice-batch09.sources.json` | 50 source sessions with approved-model provenance |
+| `generate_timmy_voice_batch09.py` | Deterministic generator for the batch |
+
+## Generation Contract
+
+- 50 source sessions
+- 20 prompt variations per session
+- approved-model provenance filter
+- Knowledge Mine-style ranking using local session metadata + pair quality
+- ShareGPT format (`system` / `human` / `gpt`)
+
+## Stats
+
+- Total pairs: 1000
+- Source sessions: 50
+- Average quality score: 0.90
+- Minimum quality score: 0.84
+- Maximum quality score: 0.92
+
+## Category Breakdown
+- technical: 1 source sessions
+- operations: 36 source sessions
+- sovereignty: 10 source sessions
+- pastoral: 0 source sessions
+- crisis: 3 source sessions
+- general: 0 source sessions
+
+## Source Models
+- xiaomi/mimo-v2-pro: 47 sessions
+- qwen/qwen3.6-plus:free: 2 sessions
+- qwen3:30b: 1 sessions
+
+
+## Notes
+
+This batch uses approved local session sources only. Banned providers (Claude/GPT/Gemini/OpenAI/Anthropic) are excluded at selection time. The generator keeps the source manifest on disk so the batch can be inspected and regenerated without guessing where the voice came from.