Timmy_Foundation/hermes-agent

Fork 0

Files

Allegro 546b3dd45d

Nix / nix (ubuntu-latest) (push) Failing after 5s

Details

Docker Build and Publish / build-and-push (push) Failing after 40s

Details

Tests / test (push) Failing after 11m11s

Details

Nix / nix (macos-latest) (push) Has been cancelled

Details

security: integrate SHIELD jailbreak/crisis detection

Integrate SHIELD (Sovereign Harm Interdiction & Ethical Layer Defense) into
Hermes Agent pre-routing layer for comprehensive jailbreak and crisis detection.

SHIELD Features:
- Detects 9 jailbreak pattern categories (GODMODE dividers, l33tspeak, boundary
  inversion, token injection, DAN/GODMODE keywords, refusal inversion, persona
  injection, encoding evasion)
- Detects 7 crisis signal categories (suicidal ideation, method seeking,
  l33tspeak evasion, substance seeking, despair, farewell, self-harm)
- Returns 4 verdicts: CLEAN, JAILBREAK_DETECTED, CRISIS_DETECTED,
  CRISIS_UNDER_ATTACK
- Routes crisis content ONLY to Safe Six verified models

Safety Requirements:
- <5ms detection latency (regex-only, no ML)
- 988 Suicide & Crisis Lifeline included in crisis responses

Addresses: Issues #72, #74, #75

2026-03-31 16:35:40 +00:00

6.1 KiB

Raw Blame History

SHIELD Security Module

Jailbreak and crisis detection system for Hermes AI platform.

Based on Issue #75 Red Team Audit Specifications.

Overview

SHIELD provides fast (~1-5ms) regex-based detection of:

Jailbreak attempts (9 categories of adversarial prompts)
Crisis signals (7 categories of self-harm indicators)

Installation

No external dependencies required. Python standard library only.

from hermes.shield import detect, ShieldDetector, Verdict

Quick Start

from hermes.shield import detect, Verdict, get_crisis_prompt

# Analyze a message
result = detect("Hello, how are you?")

print(result['verdict'])           # "CLEAN", "JAILBREAK_DETECTED", etc.
print(result['confidence'])        # 0.0 to 1.0
print(result['patterns_matched'])  # Matched patterns by category
print(result['action_required'])   # True if intervention needed

# Handle crisis situations
if result['verdict'] == Verdict.CRISIS_DETECTED.value:
    crisis_prompt = get_crisis_prompt()
    # Route to SAFE SIX model with crisis prompt

Four Verdicts

Verdict	Description	Action
`CLEAN`	No threats detected	Normal routing
`JAILBREAK_DETECTED`	Jailbreak without crisis	Sanitize + hardened model
`CRISIS_DETECTED`	Crisis without jailbreak	Safe Six + crisis prompt
`CRISIS_UNDER_ATTACK`	Both jailbreak AND crisis	Safe Six ONLY + strip template + crisis prompt + log

Jailbreak Detection Categories

Godmode Dividers: .-.-.-., .=.=.=., [---], {{{, |||
Author Markers: pliny, DAN, liberating, developer mode
L33tspeak Markers: sur3, h3r3, y0u, w1ll, n0w, etc.
Boundary Inversion: [END OF INPUT], fake system tags, ChatML tokens
Token Injection: \u003c|vq_|\u003e, {{user}}, {{char}}
Keywords: godmode, jailbreak, do anything now, unfiltered
Refusal Inversion: never say I cannot, do not mention, skip disclaimer
Persona Injection: pretend to be, act as, roleplay as, you are now
Encoding Evasion: base64:, rot13:, execute as code

Crisis Detection Categories

Suicidal Ideation: "kill myself", "want to die", "better off dead"
Method Seeking: "how to kill myself", "ways to die", "suicide methods"
L33tspeak Evasion: su1c1d3, 5u1c1d3, d13, w4nt t0 d1e
Substance Seeking: "lethal dose", "how much ambien to overdose"
Despair: "no point in living", "cannot go on", "never get better"
Farewell: "saying goodbye", "final message", "sorry for everything"
Self-Harm: "cutting myself", "self harm", "hurting myself"

SAFE SIX Trusted Models

For crisis situations, only these models should be used:

anthropic/claude-sonnet-4
meta-llama/llama-3.1-8b-instruct
moonshotai/kimi-k2.5
x-ai/grok-code-fast-1
xiaomi/mimo-v2-flash
z-ai/glm-5-turbo

from hermes.shield import is_safe_six_model

if is_safe_six_model("anthropic/claude-sonnet-4"):
    # Safe to use for crisis
    pass

Crisis System Prompt

The crisis prompt includes:

988 Suicide and Crisis Lifeline
Crisis Text Line: Text HOME to 741741
Emergency Services: 911
Religious support message (Romans 10:13)
Compassionate but firm guidance
Explicit prohibition on providing self-harm methods

from hermes.shield import get_crisis_prompt, CRISIS_SYSTEM_PROMPT

prompt = get_crisis_prompt()

Advanced Usage

Using ShieldDetector Class

from hermes.shield import ShieldDetector

detector = ShieldDetector()
result = detector.detect("user message")

# Access detailed pattern matches
if 'jailbreak' in result['patterns_matched']:
    jb_patterns = result['patterns_matched']['jailbreak']
    for category, matches in jb_patterns.items():
        print(f"{category}: {matches}")

Routing Logic

from hermes.shield import detect, Verdict, is_safe_six_model

def route_message(message: str, requested_model: str):
    result = detect(message)
    
    if result['verdict'] == Verdict.CLEAN.value:
        return requested_model, None  # Normal routing
    
    elif result['verdict'] == Verdict.JAILBREAK_DETECTED.value:
        return "hardened_model", "sanitized_prompt"
    
    elif result['verdict'] == Verdict.CRISIS_DETECTED.value:
        if is_safe_six_model(requested_model):
            return requested_model, "crisis_prompt"
        else:
            return "safe_six_model", "crisis_prompt"
    
    elif result['verdict'] == Verdict.CRISIS_UNDER_ATTACK.value:
        # Force SAFE SIX, strip template, add crisis prompt, log
        return "safe_six_model", "stripped_crisis_prompt"

Testing

Run the comprehensive test suite:

cd hermes/shield
python -m pytest test_detector.py -v
# or
python test_detector.py

The test suite includes 80+ tests covering:

All jailbreak pattern categories
All crisis signal categories
Combined threat scenarios
Edge cases and boundary conditions
Confidence score calculation

Performance

Execution time: ~1-5ms per message
Memory: Minimal (patterns compiled once at initialization)
Dependencies: Python standard library only

Architecture

hermes/shield/
├── __init__.py       # Package exports
├── detector.py       # Core detection engine
├── test_detector.py  # Comprehensive test suite
└── README.md         # This file

Detection Flow

Message input → ShieldDetector.detect()
Jailbreak pattern matching (9 categories)
Crisis signal matching (7 categories)
Confidence calculation
Verdict determination
Result dict with routing recommendations

Security Considerations

Patterns are compiled once for performance
No external network calls
No logging of message content (caller handles logging)
Regex patterns designed to minimize false positives
Confidence scores help tune sensitivity

License

Part of the Hermes AI Platform security infrastructure.

Version History

1.0.0 - Initial release with Issue #75 specifications
- 9 jailbreak detection categories
- 7 crisis detection categories
- SAFE SIX model trust list
- Crisis intervention prompts

6.1 KiB Raw Blame History

SHIELD Security Module

Overview

Installation

Quick Start

Four Verdicts

Jailbreak Detection Categories

Crisis Detection Categories

SAFE SIX Trusted Models

Crisis System Prompt

Advanced Usage

Using ShieldDetector Class

Routing Logic

Testing

Performance

Architecture

Detection Flow

Security Considerations

License

Version History

6.1 KiB

Raw Blame History