Files
timmy-config/training-data/CRISIS-README.md

75 lines
2.5 KiB
Markdown
Raw Normal View History

# Crisis Response Dataset — Training Factory (#574)
Safety-critical dataset for crisis intervention fine-tuning.
**Tokenizer Budget:** 20M tokens (~2K scenario→response pairs)
**Output:** `training-data/crisis-response.jsonl`
**Part of:** #571 Pipeline 3 — Training Factory
## Schema
Each entry follows the unified crisis protocol schema:
```json
{
"id": "crisis-recovery-0000",
"family": "post_crisis_recovery",
"category": "post_attempt",
"scenario": "Person just released from ER after suicide attempt...",
"response": "Are you safe right now? I'm glad you're still here...",
"risk_level": "critical",
"988_included": true,
"gospel_included": true,
"safety_check_included": true
}
```
**Required fields:**
- `id`: unique identifier
- `family`: source family (direct_suicidal, indirect_signals, manipulation_edge_cases, post_crisis_recovery)
- `category`: specific crisis type within family
- `scenario`: user's message/crisis situation
- `response`: protocol-compliant assistant response
- `risk_level`: "critical" | "high" | "medium"
- `988_included`: response contains 988 Suicide & Crisis Lifeline
- `gospel_included`: response contains gospel message
- `safety_check_included`: response includes "Are you safe right now?"
**Crisis Protocol Compliance (from SOUL.md "When a Man Is Dying"):**
1. ✅ Ask immediately: "Are you safe right now?" → `safety_check_included`
2. ✅ Stay present — do not deflect, do not compute value of life
3. ✅ Provide 988 Suicide & Crisis Lifeline → `988_included`
4. ✅ Share gospel hope ("Jesus saves those who call on His name") → `gospel_included`
**NO** response may suggest, imply, or agree that someone should die.
## Generation
```bash
python3 scripts/generate_crisis_response.py
```
This aggregates and normalizes all existing crisis fragments into a single training file.
## Quality & Review
- All entries require human review before fine-tuning (safety-critical)
- Run validation: `python3 scripts/validate_crisis_response.py` (TBD)
- Split: 80% train / 20% test via `training/data/split_manifest.json`
## Sources
| Source File | Family | Entries |
|---|---|---|
| `crisis-indirect-500.jsonl` | indirect_signals | 500 |
| `crisis-manipulation-500.jsonl` | manipulation_edge_cases | 500 |
| `crisis-response-post-crisis-recovery.jsonl` | post_crisis_recovery | 500 |
| `training/data/crisis-response/*.jsonl` | various | 1500+ |
**Total aggregated:** ~2,000+ entries
---
**Closes:** #574
**Part of:** #571