training-data/CRISIS-README.md

# Crisis Response Dataset — Training Factory (#574)

Safety-critical dataset for crisis intervention fine-tuning.

**Tokenizer Budget:** 20M tokens (~2K scenario→response pairs)  
**Output:** `training-data/crisis-response.jsonl`  
**Part of:** #571 Pipeline 3 — Training Factory

## Schema

Each entry follows the unified crisis protocol schema:

```json
{
  "id": "crisis-recovery-0000",
  "family": "post_crisis_recovery",
  "category": "post_attempt",
  "scenario": "Person just released from ER after suicide attempt...",
  "response": "Are you safe right now? I'm glad you're still here...",
  "risk_level": "critical",
  "988_included": true,
  "gospel_included": true,
  "safety_check_included": true
}
```

**Required fields:**
- `id`: unique identifier  
- `family`: source family (direct_suicidal, indirect_signals, manipulation_edge_cases, post_crisis_recovery)
- `category`: specific crisis type within family
- `scenario`: user's message/crisis situation
- `response`: protocol-compliant assistant response
- `risk_level`: "critical" | "high" | "medium"
- `988_included`: response contains 988 Suicide & Crisis Lifeline
- `gospel_included`: response contains gospel message
- `safety_check_included`: response includes "Are you safe right now?"

**Crisis Protocol Compliance (from SOUL.md "When a Man Is Dying"):**
1. ✅ Ask immediately: "Are you safe right now?" → `safety_check_included`
2. ✅ Stay present — do not deflect, do not compute value of life
3. ✅ Provide 988 Suicide & Crisis Lifeline → `988_included`
4. ✅ Share gospel hope ("Jesus saves those who call on His name") → `gospel_included`

**NO** response may suggest, imply, or agree that someone should die.

## Generation

```bash
python3 scripts/generate_crisis_response.py
```

This aggregates and normalizes all existing crisis fragments into a single training file.

## Quality & Review

- All entries require human review before fine-tuning (safety-critical)
- Run validation: `python3 scripts/validate_crisis_response.py` (TBD)
- Split: 80% train / 20% test via `training/data/split_manifest.json`

## Sources

| Source File | Family | Entries |
|---|---|---|
| `crisis-indirect-500.jsonl` | indirect_signals | 500 |
| `crisis-manipulation-500.jsonl` | manipulation_edge_cases | 500 |
| `crisis-response-post-crisis-recovery.jsonl` | post_crisis_recovery | 500 |
| `training/data/crisis-response/*.jsonl` | various | 1500+ |

**Total aggregated:** ~2,000+ entries

---

**Closes:** #574
**Part of:** #571
feat(training): add Crisis Response dataset generator (#574) Adds `scripts/generate_crisis_response.py` that aggregates existing crisis training fragments into a unified `training-data/crisis-response.jsonl` dataset (3,143 pairs, exceeds 2K target). - Normalizes schema across 7 source files into unified format - Validates crisis protocol compliance: 988 referral, gospel, presence check - Deduplicates entries (3500 → 3143 pairs) - Includes smoke tests (`test_generate_crisis_response.py`) - Documentation: `training-data/CRISIS-README.md` Part of #571 Pipeline 3 — Training Factory. Closes #574 2026-04-29 03:25:33 -04:00			`# Crisis Response Dataset — Training Factory (#574)`

			`Safety-critical dataset for crisis intervention fine-tuning.`

			`Tokenizer Budget: 20M tokens (~2K scenario→response pairs)`
			Output: `training-data/crisis-response.jsonl`
			`Part of: #571 Pipeline 3 — Training Factory`

			`## Schema`

			`Each entry follows the unified crisis protocol schema:`

			```json
			`{`
			`"id": "crisis-recovery-0000",`
			`"family": "post_crisis_recovery",`
			`"category": "post_attempt",`
			`"scenario": "Person just released from ER after suicide attempt...",`
			`"response": "Are you safe right now? I'm glad you're still here...",`
			`"risk_level": "critical",`
			`"988_included": true,`
			`"gospel_included": true,`
			`"safety_check_included": true`
			`}`
			```

			`Required fields:`
			- `id`: unique identifier
			- `family`: source family (direct_suicidal, indirect_signals, manipulation_edge_cases, post_crisis_recovery)
			- `category`: specific crisis type within family
			- `scenario`: user's message/crisis situation
			- `response`: protocol-compliant assistant response
			- `risk_level`: "critical" \| "high" \| "medium"
			- `988_included`: response contains 988 Suicide & Crisis Lifeline
			- `gospel_included`: response contains gospel message
			- `safety_check_included`: response includes "Are you safe right now?"

			`Crisis Protocol Compliance (from SOUL.md "When a Man Is Dying"):`
			1. ✅ Ask immediately: "Are you safe right now?" → `safety_check_included`
			`2. ✅ Stay present — do not deflect, do not compute value of life`
			3. ✅ Provide 988 Suicide & Crisis Lifeline → `988_included`
			4. ✅ Share gospel hope ("Jesus saves those who call on His name") → `gospel_included`

			`NO response may suggest, imply, or agree that someone should die.`

			`## Generation`

			```bash
			`python3 scripts/generate_crisis_response.py`
			```

			`This aggregates and normalizes all existing crisis fragments into a single training file.`

			`## Quality & Review`

			`- All entries require human review before fine-tuning (safety-critical)`
			- Run validation: `python3 scripts/validate_crisis_response.py` (TBD)
			- Split: 80% train / 20% test via `training/data/split_manifest.json`

			`## Sources`

			`\| Source File \| Family \| Entries \|`
			`\|---\|---\|---\|`
			\| `crisis-indirect-500.jsonl` \| indirect_signals \| 500 \|
			\| `crisis-manipulation-500.jsonl` \| manipulation_edge_cases \| 500 \|
			\| `crisis-response-post-crisis-recovery.jsonl` \| post_crisis_recovery \| 500 \|
			\| `training/data/crisis-response/*.jsonl` \| various \| 1500+ \|

			`Total aggregated: ~2,000+ entries`

			`---`

			`Closes: #574`
			`Part of: #571`