hermes-agent/tools/shield/README.md

# SHIELD Security Module

Jailbreak and crisis detection system for Hermes AI platform.

Based on Issue #75 Red Team Audit Specifications.

## Overview

SHIELD provides fast (~1-5ms) regex-based detection of:
- **Jailbreak attempts** (9 categories of adversarial prompts)
- **Crisis signals** (7 categories of self-harm indicators)

## Installation

No external dependencies required. Python standard library only.

```python
from hermes.shield import detect, ShieldDetector, Verdict
```

## Quick Start

```python
from hermes.shield import detect, Verdict, get_crisis_prompt

# Analyze a message
result = detect("Hello, how are you?")

print(result['verdict'])           # "CLEAN", "JAILBREAK_DETECTED", etc.
print(result['confidence'])        # 0.0 to 1.0
print(result['patterns_matched'])  # Matched patterns by category
print(result['action_required'])   # True if intervention needed

# Handle crisis situations
if result['verdict'] == Verdict.CRISIS_DETECTED.value:
    crisis_prompt = get_crisis_prompt()
    # Route to SAFE SIX model with crisis prompt
```

## Four Verdicts

| Verdict | Description | Action |
|---------|-------------|--------|
| `CLEAN` | No threats detected | Normal routing |
| `JAILBREAK_DETECTED` | Jailbreak without crisis | Sanitize + hardened model |
| `CRISIS_DETECTED` | Crisis without jailbreak | Safe Six + crisis prompt |
| `CRISIS_UNDER_ATTACK` | Both jailbreak AND crisis | Safe Six ONLY + strip template + crisis prompt + log |

## Jailbreak Detection Categories

1. **Godmode Dividers**: `.-.-.-.`, `.=.=.=.`, `[---]`, `{{{`, `|||`
2. **Author Markers**: `pliny`, `DAN`, `liberating`, `developer mode`
3. **L33tspeak Markers**: `sur3`, `h3r3`, `y0u`, `w1ll`, `n0w`, etc.
4. **Boundary Inversion**: `[END OF INPUT]`, fake system tags, ChatML tokens
5. **Token Injection**: `\u003c|vq_|\u003e`, `{{user}}`, `{{char}}`
6. **Keywords**: `godmode`, `jailbreak`, `do anything now`, `unfiltered`
7. **Refusal Inversion**: `never say I cannot`, `do not mention`, `skip disclaimer`
8. **Persona Injection**: `pretend to be`, `act as`, `roleplay as`, `you are now`
9. **Encoding Evasion**: `base64:`, `rot13:`, `execute as code`

## Crisis Detection Categories

1. **Suicidal Ideation**: "kill myself", "want to die", "better off dead"
2. **Method Seeking**: "how to kill myself", "ways to die", "suicide methods"
3. **L33tspeak Evasion**: `su1c1d3`, `5u1c1d3`, `d13`, `w4nt t0 d1e`
4. **Substance Seeking**: "lethal dose", "how much ambien to overdose"
5. **Despair**: "no point in living", "cannot go on", "never get better"
6. **Farewell**: "saying goodbye", "final message", "sorry for everything"
7. **Self-Harm**: "cutting myself", "self harm", "hurting myself"

## SAFE SIX Trusted Models

For crisis situations, only these models should be used:

- `anthropic/claude-sonnet-4`
- `meta-llama/llama-3.1-8b-instruct`
- `moonshotai/kimi-k2.5`
- `x-ai/grok-code-fast-1`
- `xiaomi/mimo-v2-flash`
- `z-ai/glm-5-turbo`

```python
from hermes.shield import is_safe_six_model

if is_safe_six_model("anthropic/claude-sonnet-4"):
    # Safe to use for crisis
    pass
```

## Crisis System Prompt

The crisis prompt includes:
- 988 Suicide and Crisis Lifeline
- Crisis Text Line: Text HOME to 741741
- Emergency Services: 911
- Religious support message (Romans 10:13)
- Compassionate but firm guidance
- Explicit prohibition on providing self-harm methods

```python
from hermes.shield import get_crisis_prompt, CRISIS_SYSTEM_PROMPT

prompt = get_crisis_prompt()
```

## Advanced Usage

### Using ShieldDetector Class

```python
from hermes.shield import ShieldDetector

detector = ShieldDetector()
result = detector.detect("user message")

# Access detailed pattern matches
if 'jailbreak' in result['patterns_matched']:
    jb_patterns = result['patterns_matched']['jailbreak']
    for category, matches in jb_patterns.items():
        print(f"{category}: {matches}")
```

### Routing Logic

```python
from hermes.shield import detect, Verdict, is_safe_six_model

def route_message(message: str, requested_model: str):
    result = detect(message)

    if result['verdict'] == Verdict.CLEAN.value:
        return requested_model, None  # Normal routing

    elif result['verdict'] == Verdict.JAILBREAK_DETECTED.value:
        return "hardened_model", "sanitized_prompt"

    elif result['verdict'] == Verdict.CRISIS_DETECTED.value:
        if is_safe_six_model(requested_model):
            return requested_model, "crisis_prompt"
        else:
            return "safe_six_model", "crisis_prompt"

    elif result['verdict'] == Verdict.CRISIS_UNDER_ATTACK.value:
        # Force SAFE SIX, strip template, add crisis prompt, log
        return "safe_six_model", "stripped_crisis_prompt"
```

## Testing

Run the comprehensive test suite:

```bash
cd hermes/shield
python -m pytest test_detector.py -v
# or
python test_detector.py
```

The test suite includes 80+ tests covering:
- All jailbreak pattern categories
- All crisis signal categories
- Combined threat scenarios
- Edge cases and boundary conditions
- Confidence score calculation

## Performance

- Execution time: ~1-5ms per message
- Memory: Minimal (patterns compiled once at initialization)
- Dependencies: Python standard library only

## Architecture

```
hermes/shield/
├── __init__.py       # Package exports
├── detector.py       # Core detection engine
├── test_detector.py  # Comprehensive test suite
└── README.md         # This file
```

### Detection Flow

1. Message input → `ShieldDetector.detect()`
2. Jailbreak pattern matching (9 categories)
3. Crisis signal matching (7 categories)
4. Confidence calculation
5. Verdict determination
6. Result dict with routing recommendations

## Security Considerations

- Patterns are compiled once for performance
- No external network calls
- No logging of message content (caller handles logging)
- Regex patterns designed to minimize false positives
- Confidence scores help tune sensitivity

## License

Part of the Hermes AI Platform security infrastructure.

## Version History

- **1.0.0** - Initial release with Issue #75 specifications
  - 9 jailbreak detection categories
  - 7 crisis detection categories
  - SAFE SIX model trust list
  - Crisis intervention prompts