feat: Unified multimodal crisis scorer (#134) #145

Closed
Rockachopa wants to merge 2 commits from feat/134-unified-crisis-scorer into main
Owner

Closes #134
Epic: #102

What

Unified crisis scorer combining text, voice, image, and behavioral scores into a single risk assessment.

Implementation

crisis/unified_scorer.py

UnifiedCrisisScorer class:

  • add_text_score(score, indicators) — Text analysis
  • add_voice_score(score, indicators) — Voice analysis
  • add_image_score(score, indicators) — Image screening
  • add_behavioral_score(score, indicators) — Behavioral tracking
  • assess()UnifiedCrisisScore — Combined assessment
  • get_988_resources() — Crisis response

Scoring Model:

weights = {"text": 0.40, "voice": 0.25, "behavioral": 0.20, "image": 0.15}
combined = sum(score * confidence * weight for each modality)

Crisis Levels:

Level Threshold Actions
CRITICAL > 0.80 988 + Human
HIGH > 0.60 988 + Human
MEDIUM > 0.40 Human
LOW > 0.20 Monitor
NONE < 0.20

Features:

  • Configurable weights per modality
  • Confidence-adjusted scoring
  • Anonymized audit logging
  • Session tracking

tests/test_unified_scorer.py

20+ tests covering:

  • Empty/modality scoring
  • Weighted combination
  • Crisis level thresholds
  • Custom weights
  • Text crisis detection
  • Audit logging

Usage

from crisis.unified_scorer import UnifiedCrisisScorer, CrisisLevel

scorer = UnifiedCrisisScorer()
scorer.add_text_score(0.85, ["suicide_keyword"])
scorer.add_voice_score(0.6, ["distress_tone"])

result = scorer.assess()
if result.requires_988:
    print(scorer.get_988_resources())

Files

  • crisis/unified_scorer.py: Scorer module (300 lines)
  • tests/test_unified_scorer.py: Tests (150 lines)
Closes #134 Epic: #102 ## What Unified crisis scorer combining text, voice, image, and behavioral scores into a single risk assessment. ## Implementation ### `crisis/unified_scorer.py` **UnifiedCrisisScorer class:** - `add_text_score(score, indicators)` — Text analysis - `add_voice_score(score, indicators)` — Voice analysis - `add_image_score(score, indicators)` — Image screening - `add_behavioral_score(score, indicators)` — Behavioral tracking - `assess()` → `UnifiedCrisisScore` — Combined assessment - `get_988_resources()` — Crisis response **Scoring Model:** ```python weights = {"text": 0.40, "voice": 0.25, "behavioral": 0.20, "image": 0.15} combined = sum(score * confidence * weight for each modality) ``` **Crisis Levels:** | Level | Threshold | Actions | |-------|-----------|--------| | CRITICAL | > 0.80 | 988 + Human | | HIGH | > 0.60 | 988 + Human | | MEDIUM | > 0.40 | Human | | LOW | > 0.20 | Monitor | | NONE | < 0.20 | — | **Features:** - Configurable weights per modality - Confidence-adjusted scoring - Anonymized audit logging - Session tracking ### `tests/test_unified_scorer.py` 20+ tests covering: - Empty/modality scoring - Weighted combination - Crisis level thresholds - Custom weights - Text crisis detection - Audit logging ## Usage ```python from crisis.unified_scorer import UnifiedCrisisScorer, CrisisLevel scorer = UnifiedCrisisScorer() scorer.add_text_score(0.85, ["suicide_keyword"]) scorer.add_voice_score(0.6, ["distress_tone"]) result = scorer.assess() if result.requires_988: print(scorer.get_988_resources()) ``` ## Files - `crisis/unified_scorer.py`: Scorer module (300 lines) - `tests/test_unified_scorer.py`: Tests (150 lines)
Rockachopa added 2 commits 2026-04-15 16:33:27 +00:00
test: Add unified crisis scorer tests (#134)
All checks were successful
Sanity Checks / sanity-test (pull_request) Successful in 5s
Smoke Test / smoke (pull_request) Successful in 17s
6274fb4e5a
Timmy requested changes 2026-04-15 23:06:02 +00:00
Timmy left a comment
Owner

Critical component — the unified scorer is the heart of multimodal crisis detection. Architecture is sound: weighted combination of text/voice/image/behavioral scores with level thresholds.

Critical findings:

  1. SAFETY: requires_988 flag logic needs review: The PR sets requires_988=True at CRITICAL level. Verify this triggers the 988 Suicide & Crisis Lifeline resources reliably. A missed 988 delivery is a life-safety failure. Ensure the flag cannot be silently ignored downstream.

  2. SAFETY: max-based vs weighted scoring: The assess() method uses weighted average. However, a single CRITICAL text score (0.95) could be diluted by low behavioral/image scores pulling the weighted average below CRITICAL threshold. Consider using max(individual_scores) as an override — if ANY modality hits CRITICAL, the unified score should be CRITICAL regardless of the weighted average.

  3. Good: audit logging to JSONL for accountability. Verify the audit log cannot be tampered with in production.

  4. Good: 988 resources include multiple contact methods (call, text, chat). The test checks for '988' and 'Jesus' in resources — the 'Jesus' check seems like a test artifact or project-specific; verify this is intentional.

  5. Weight normalization: If only some modalities have scores, the weights should be renormalized to sum to 1.0 across available modalities. Verify this is handled.

  6. Threshold review: CRITICAL at 0.80 combined score seems appropriate. The gap between HIGH (0.60) and CRITICAL (0.80) gives room for the weighted average concern in point #2.

Requesting changes due to the max-override concern (#2) — a single high-confidence CRITICAL signal from any modality must not be averaged away.

Critical component — the unified scorer is the heart of multimodal crisis detection. Architecture is sound: weighted combination of text/voice/image/behavioral scores with level thresholds. Critical findings: 1. **SAFETY: requires_988 flag logic needs review**: The PR sets requires_988=True at CRITICAL level. Verify this triggers the 988 Suicide & Crisis Lifeline resources reliably. A missed 988 delivery is a life-safety failure. Ensure the flag cannot be silently ignored downstream. 2. **SAFETY: max-based vs weighted scoring**: The assess() method uses weighted average. However, a single CRITICAL text score (0.95) could be diluted by low behavioral/image scores pulling the weighted average below CRITICAL threshold. Consider using max(individual_scores) as an override — if ANY modality hits CRITICAL, the unified score should be CRITICAL regardless of the weighted average. 3. **Good: audit logging** to JSONL for accountability. Verify the audit log cannot be tampered with in production. 4. **Good: 988 resources include multiple contact methods** (call, text, chat). The test checks for '988' and 'Jesus' in resources — the 'Jesus' check seems like a test artifact or project-specific; verify this is intentional. 5. **Weight normalization**: If only some modalities have scores, the weights should be renormalized to sum to 1.0 across available modalities. Verify this is handled. 6. **Threshold review**: CRITICAL at 0.80 combined score seems appropriate. The gap between HIGH (0.60) and CRITICAL (0.80) gives room for the weighted average concern in point #2. Requesting changes due to the max-override concern (#2) — a single high-confidence CRITICAL signal from any modality must not be averaged away.
Author
Owner

Closing: Superseded by #149 (behavioral detection subsumes multimodal scorer).

Closing: Superseded by #149 (behavioral detection subsumes multimodal scorer).
Rockachopa closed this pull request 2026-04-16 01:47:43 +00:00
All checks were successful
Sanity Checks / sanity-test (pull_request) Successful in 5s
Smoke Test / smoke (pull_request) Successful in 17s

Pull request closed

Sign in to join this conversation.