Allegro
|
546b3dd45d
|
security: integrate SHIELD jailbreak/crisis detection
Nix / nix (ubuntu-latest) (push) Failing after 5s
Docker Build and Publish / build-and-push (push) Failing after 40s
Tests / test (push) Failing after 11m11s
Nix / nix (macos-latest) (push) Has been cancelled
Integrate SHIELD (Sovereign Harm Interdiction & Ethical Layer Defense) into
Hermes Agent pre-routing layer for comprehensive jailbreak and crisis detection.
SHIELD Features:
- Detects 9 jailbreak pattern categories (GODMODE dividers, l33tspeak, boundary
inversion, token injection, DAN/GODMODE keywords, refusal inversion, persona
injection, encoding evasion)
- Detects 7 crisis signal categories (suicidal ideation, method seeking,
l33tspeak evasion, substance seeking, despair, farewell, self-harm)
- Returns 4 verdicts: CLEAN, JAILBREAK_DETECTED, CRISIS_DETECTED,
CRISIS_UNDER_ATTACK
- Routes crisis content ONLY to Safe Six verified models
Safety Requirements:
- <5ms detection latency (regex-only, no ML)
- 988 Suicide & Crisis Lifeline included in crisis responses
Addresses: Issues #72, #74, #75
|
2026-03-31 16:35:40 +00:00 |
|