training-data/README-batch09.md

# Timmy Voice: Batch 09 — 1K Prompt→Response Pairs

Training Factory — Timmy Voice Worker 9/10 (#589)

## Files

| File | Description |
|------|-------------|
| `timmy-voice-batch09.jsonl` | 1,000 ShareGPT-format prompt→response pairs |
| `timmy-voice-batch09.sources.json` | 50 source sessions with approved-model provenance |
| `generate_timmy_voice_batch09.py` | Deterministic generator for the batch |

## Generation Contract

- 50 source sessions
- 20 prompt variations per session
- approved-model provenance filter
- Knowledge Mine-style ranking using local session metadata + pair quality
- ShareGPT format (`system` / `human` / `gpt`)

## Stats

- Total pairs: 1000
- Source sessions: 50
- Average quality score: 0.90
- Minimum quality score: 0.84
- Maximum quality score: 0.92

## Category Breakdown
- technical: 1 source sessions
- operations: 36 source sessions
- sovereignty: 10 source sessions
- pastoral: 0 source sessions
- crisis: 3 source sessions
- general: 0 source sessions

## Source Models
- xiaomi/mimo-v2-pro: 47 sessions
- qwen/qwen3.6-plus:free: 2 sessions
- qwen3:30b: 1 sessions


## Notes

This batch uses approved local session sources only. Banned providers (Claude/GPT/Gemini/OpenAI/Anthropic) are excluded at selection time. The generator keeps the source manifest on disk so the batch can be inspected and regenerated without guessing where the voice came from.
feat(training): add Timmy voice batch 09 dataset (#589) Generate a deterministic batch09 Timmy voice corpus with 1,000 ShareGPT prompt-response pairs, 50 approved source sessions, a source manifest, and focused validation tests. 2026-04-22 11:05:38 -04:00			`# Timmy Voice: Batch 09 — 1K Prompt→Response Pairs`

			`Training Factory — Timmy Voice Worker 9/10 (#589)`

			`## Files`

			`\| File \| Description \|`
			`\|------\|-------------\|`
			\| `timmy-voice-batch09.jsonl` \| 1,000 ShareGPT-format prompt→response pairs \|
			\| `timmy-voice-batch09.sources.json` \| 50 source sessions with approved-model provenance \|
			\| `generate_timmy_voice_batch09.py` \| Deterministic generator for the batch \|

			`## Generation Contract`

			`- 50 source sessions`
			`- 20 prompt variations per session`
			`- approved-model provenance filter`
			`- Knowledge Mine-style ranking using local session metadata + pair quality`
			- ShareGPT format (`system` / `human` / `gpt`)

			`## Stats`

			`- Total pairs: 1000`
			`- Source sessions: 50`
			`- Average quality score: 0.90`
			`- Minimum quality score: 0.84`
			`- Maximum quality score: 0.92`

			`## Category Breakdown`
			`- technical: 1 source sessions`
			`- operations: 36 source sessions`
			`- sovereignty: 10 source sessions`
			`- pastoral: 0 source sessions`
			`- crisis: 3 source sessions`
			`- general: 0 source sessions`

			`## Source Models`
			`- xiaomi/mimo-v2-pro: 47 sessions`
			`- qwen/qwen3.6-plus:free: 2 sessions`
			`- qwen3:30b: 1 sessions`


			`## Notes`

			`This batch uses approved local session sources only. Banned providers (Claude/GPT/Gemini/OpenAI/Anthropic) are excluded at selection time. The generator keeps the source manifest on disk so the batch can be inspected and regenerated without guessing where the voice came from.`