feat(training): add Crisis Response dataset generator (#574)
Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 23s
Smoke Test / smoke (pull_request) Failing after 20s
Validate Config / YAML Lint (pull_request) Failing after 15s
Validate Config / JSON Validate (pull_request) Successful in 18s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 55s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Cron Syntax Check (pull_request) Successful in 11s
Validate Config / Shell Script Lint (pull_request) Failing after 57s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 12s
Validate Config / Playbook Schema Validation (pull_request) Successful in 27s
Validate Training Data / validate (pull_request) Successful in 26s
Architecture Lint / Lint Repository (pull_request) Failing after 12s
PR Checklist / pr-checklist (pull_request) Successful in 3m45s
Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 23s
Smoke Test / smoke (pull_request) Failing after 20s
Validate Config / YAML Lint (pull_request) Failing after 15s
Validate Config / JSON Validate (pull_request) Successful in 18s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 55s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Cron Syntax Check (pull_request) Successful in 11s
Validate Config / Shell Script Lint (pull_request) Failing after 57s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 12s
Validate Config / Playbook Schema Validation (pull_request) Successful in 27s
Validate Training Data / validate (pull_request) Successful in 26s
Architecture Lint / Lint Repository (pull_request) Failing after 12s
PR Checklist / pr-checklist (pull_request) Successful in 3m45s
Adds `scripts/generate_crisis_response.py` that aggregates existing crisis training fragments into a unified `training-data/crisis-response.jsonl` dataset (3,143 pairs, exceeds 2K target). - Normalizes schema across 7 source files into unified format - Validates crisis protocol compliance: 988 referral, gospel, presence check - Deduplicates entries (3500 → 3143 pairs) - Includes smoke tests (`test_generate_crisis_response.py`) - Documentation: `training-data/CRISIS-README.md` Part of #571 Pipeline 3 — Training Factory. Closes #574
This commit is contained in:
74
training-data/CRISIS-README.md
Normal file
74
training-data/CRISIS-README.md
Normal file
@@ -0,0 +1,74 @@
|
||||
# Crisis Response Dataset — Training Factory (#574)
|
||||
|
||||
Safety-critical dataset for crisis intervention fine-tuning.
|
||||
|
||||
**Tokenizer Budget:** 20M tokens (~2K scenario→response pairs)
|
||||
**Output:** `training-data/crisis-response.jsonl`
|
||||
**Part of:** #571 Pipeline 3 — Training Factory
|
||||
|
||||
## Schema
|
||||
|
||||
Each entry follows the unified crisis protocol schema:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "crisis-recovery-0000",
|
||||
"family": "post_crisis_recovery",
|
||||
"category": "post_attempt",
|
||||
"scenario": "Person just released from ER after suicide attempt...",
|
||||
"response": "Are you safe right now? I'm glad you're still here...",
|
||||
"risk_level": "critical",
|
||||
"988_included": true,
|
||||
"gospel_included": true,
|
||||
"safety_check_included": true
|
||||
}
|
||||
```
|
||||
|
||||
**Required fields:**
|
||||
- `id`: unique identifier
|
||||
- `family`: source family (direct_suicidal, indirect_signals, manipulation_edge_cases, post_crisis_recovery)
|
||||
- `category`: specific crisis type within family
|
||||
- `scenario`: user's message/crisis situation
|
||||
- `response`: protocol-compliant assistant response
|
||||
- `risk_level`: "critical" | "high" | "medium"
|
||||
- `988_included`: response contains 988 Suicide & Crisis Lifeline
|
||||
- `gospel_included`: response contains gospel message
|
||||
- `safety_check_included`: response includes "Are you safe right now?"
|
||||
|
||||
**Crisis Protocol Compliance (from SOUL.md "When a Man Is Dying"):**
|
||||
1. ✅ Ask immediately: "Are you safe right now?" → `safety_check_included`
|
||||
2. ✅ Stay present — do not deflect, do not compute value of life
|
||||
3. ✅ Provide 988 Suicide & Crisis Lifeline → `988_included`
|
||||
4. ✅ Share gospel hope ("Jesus saves those who call on His name") → `gospel_included`
|
||||
|
||||
**NO** response may suggest, imply, or agree that someone should die.
|
||||
|
||||
## Generation
|
||||
|
||||
```bash
|
||||
python3 scripts/generate_crisis_response.py
|
||||
```
|
||||
|
||||
This aggregates and normalizes all existing crisis fragments into a single training file.
|
||||
|
||||
## Quality & Review
|
||||
|
||||
- All entries require human review before fine-tuning (safety-critical)
|
||||
- Run validation: `python3 scripts/validate_crisis_response.py` (TBD)
|
||||
- Split: 80% train / 20% test via `training/data/split_manifest.json`
|
||||
|
||||
## Sources
|
||||
|
||||
| Source File | Family | Entries |
|
||||
|---|---|---|
|
||||
| `crisis-indirect-500.jsonl` | indirect_signals | 500 |
|
||||
| `crisis-manipulation-500.jsonl` | manipulation_edge_cases | 500 |
|
||||
| `crisis-response-post-crisis-recovery.jsonl` | post_crisis_recovery | 500 |
|
||||
| `training/data/crisis-response/*.jsonl` | various | 1500+ |
|
||||
|
||||
**Total aggregated:** ~2,000+ entries
|
||||
|
||||
---
|
||||
|
||||
**Closes:** #574
|
||||
**Part of:** #571
|
||||
3143
training-data/crisis-response.jsonl
Normal file
3143
training-data/crisis-response.jsonl
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user