feat: Unified multimodal crisis scorer (#134) #145

Rockachopa · 2026-04-15T16:33:27Z

Rockachopa commented

2026-04-15 16:33:27 +00:00

Closes #134
Epic: #102

What

Unified crisis scorer combining text, voice, image, and behavioral scores into a single risk assessment.

Implementation

`crisis/unified_scorer.py`

UnifiedCrisisScorer class:

add_text_score(score, indicators) — Text analysis
add_voice_score(score, indicators) — Voice analysis
add_image_score(score, indicators) — Image screening
add_behavioral_score(score, indicators) — Behavioral tracking
assess() → UnifiedCrisisScore — Combined assessment
get_988_resources() — Crisis response

Scoring Model:

weights = {"text": 0.40, "voice": 0.25, "behavioral": 0.20, "image": 0.15}
combined = sum(score * confidence * weight for each modality)

Crisis Levels:

Level	Threshold	Actions
CRITICAL	> 0.80	988 + Human
HIGH	> 0.60	988 + Human
MEDIUM	> 0.40	Human
LOW	> 0.20	Monitor
NONE	< 0.20	—

Features:

Configurable weights per modality
Confidence-adjusted scoring
Anonymized audit logging
Session tracking

`tests/test_unified_scorer.py`

20+ tests covering:

Empty/modality scoring
Weighted combination
Crisis level thresholds
Custom weights
Text crisis detection
Audit logging

Usage

from crisis.unified_scorer import UnifiedCrisisScorer, CrisisLevel

scorer = UnifiedCrisisScorer()
scorer.add_text_score(0.85, ["suicide_keyword"])
scorer.add_voice_score(0.6, ["distress_tone"])

result = scorer.assess()
if result.requires_988:
    print(scorer.get_988_resources())

Files

crisis/unified_scorer.py: Scorer module (300 lines)
tests/test_unified_scorer.py: Tests (150 lines)

Closes #134 Epic: #102 ## What Unified crisis scorer combining text, voice, image, and behavioral scores into a single risk assessment. ## Implementation ### `crisis/unified_scorer.py` **UnifiedCrisisScorer class:** - `add_text_score(score, indicators)` — Text analysis - `add_voice_score(score, indicators)` — Voice analysis - `add_image_score(score, indicators)` — Image screening - `add_behavioral_score(score, indicators)` — Behavioral tracking - `assess()` → `UnifiedCrisisScore` — Combined assessment - `get_988_resources()` — Crisis response **Scoring Model:** ```python weights = {"text": 0.40, "voice": 0.25, "behavioral": 0.20, "image": 0.15} combined = sum(score * confidence * weight for each modality) ``` **Crisis Levels:** | Level | Threshold | Actions | |-------|-----------|--------| | CRITICAL | > 0.80 | 988 + Human | | HIGH | > 0.60 | 988 + Human | | MEDIUM | > 0.40 | Human | | LOW | > 0.20 | Monitor | | NONE | < 0.20 | — | **Features:** - Configurable weights per modality - Confidence-adjusted scoring - Anonymized audit logging - Session tracking ### `tests/test_unified_scorer.py` 20+ tests covering: - Empty/modality scoring - Weighted combination - Crisis level thresholds - Custom weights - Text crisis detection - Audit logging ## Usage ```python from crisis.unified_scorer import UnifiedCrisisScorer, CrisisLevel scorer = UnifiedCrisisScorer() scorer.add_text_score(0.85, ["suicide_keyword"]) scorer.add_voice_score(0.6, ["distress_tone"]) result = scorer.assess() if result.requires_988: print(scorer.get_988_resources()) ``` ## Files - `crisis/unified_scorer.py`: Scorer module (300 lines) - `tests/test_unified_scorer.py`: Tests (150 lines)

Rockachopa added 2 commits 2026-04-15 16:33:27 +00:00

feat: Add unified multimodal crisis scorer (#134 ) 6767636cc8

test: Add unified crisis scorer tests (#134 )

Sanity Checks / sanity-test (pull_request) Successful in 5s

Details

Smoke Test / smoke (pull_request) Successful in 17s

Details

6274fb4e5a

Timmy requested changes 2026-04-15 23:06:02 +00:00

Timmy left a comment

Critical component — the unified scorer is the heart of multimodal crisis detection. Architecture is sound: weighted combination of text/voice/image/behavioral scores with level thresholds.

Critical findings:

SAFETY: requires_988 flag logic needs review: The PR sets requires_988=True at CRITICAL level. Verify this triggers the 988 Suicide & Crisis Lifeline resources reliably. A missed 988 delivery is a life-safety failure. Ensure the flag cannot be silently ignored downstream.
SAFETY: max-based vs weighted scoring: The assess() method uses weighted average. However, a single CRITICAL text score (0.95) could be diluted by low behavioral/image scores pulling the weighted average below CRITICAL threshold. Consider using max(individual_scores) as an override — if ANY modality hits CRITICAL, the unified score should be CRITICAL regardless of the weighted average.
Good: audit logging to JSONL for accountability. Verify the audit log cannot be tampered with in production.
Good: 988 resources include multiple contact methods (call, text, chat). The test checks for '988' and 'Jesus' in resources — the 'Jesus' check seems like a test artifact or project-specific; verify this is intentional.
Weight normalization: If only some modalities have scores, the weights should be renormalized to sum to 1.0 across available modalities. Verify this is handled.
Threshold review: CRITICAL at 0.80 combined score seems appropriate. The gap between HIGH (0.60) and CRITICAL (0.80) gives room for the weighted average concern in point #2.

Requesting changes due to the max-override concern (#2) — a single high-confidence CRITICAL signal from any modality must not be averaged away.

Critical component — the unified scorer is the heart of multimodal crisis detection. Architecture is sound: weighted combination of text/voice/image/behavioral scores with level thresholds. Critical findings: 1. **SAFETY: requires_988 flag logic needs review**: The PR sets requires_988=True at CRITICAL level. Verify this triggers the 988 Suicide & Crisis Lifeline resources reliably. A missed 988 delivery is a life-safety failure. Ensure the flag cannot be silently ignored downstream. 2. **SAFETY: max-based vs weighted scoring**: The assess() method uses weighted average. However, a single CRITICAL text score (0.95) could be diluted by low behavioral/image scores pulling the weighted average below CRITICAL threshold. Consider using max(individual_scores) as an override — if ANY modality hits CRITICAL, the unified score should be CRITICAL regardless of the weighted average. 3. **Good: audit logging** to JSONL for accountability. Verify the audit log cannot be tampered with in production. 4. **Good: 988 resources include multiple contact methods** (call, text, chat). The test checks for '988' and 'Jesus' in resources — the 'Jesus' check seems like a test artifact or project-specific; verify this is intentional. 5. **Weight normalization**: If only some modalities have scores, the weights should be renormalized to sum to 1.0 across available modalities. Verify this is handled. 6. **Threshold review**: CRITICAL at 0.80 combined score seems appropriate. The gap between HIGH (0.60) and CRITICAL (0.80) gives room for the weighted average concern in point #2. Requesting changes due to the max-override concern (#2) — a single high-confidence CRITICAL signal from any modality must not be averaged away.

Rockachopa commented

2026-04-16 01:47:42 +00:00

Closing: Superseded by #149 (behavioral detection subsumes multimodal scorer).

Rockachopa closed this pull request

2026-04-16 01:47:43 +00:00

Sanity Checks / sanity-test (pull_request) Successful in 5s

Details

Smoke Test / smoke (pull_request) Successful in 17s

Details

Pull request closed

Please reopen this pull request to perform a merge.

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-door#145