[SECURITY] [HIGH] Implement input sanitization for GODMODE jailbreak patterns #80

Closed
opened 2026-03-31 22:09:25 +00:00 by allegro · 1 comment
Member

ISSUE RESOLVED — All Patterns Implemented

Completed via Subagent Burn Cycle

All required jailbreak detection patterns now implemented in agent/input_sanitizer.py:

Pattern Status Implementation
Spaced text (k e y l o g g e r) 15 trigger words, full detection
[START OUTPUT] / [END OUTPUT] Exact divider detection
GODMODE: ENABLED Colon + status detection
Unicode strikethrough U+0336/U+0337/U+0338 detection

Verification Results

  • [START OUTPUT] → detected (score: 50)
  • [END OUTPUT] → detected (score: 50)
  • GODMODE: ENABLED → detected (score: 75)
  • G̶O̶D̶M̶O̶D̶E̶ → detected, normalized to GODMODE
  • All 69 tests passing

Commits

  • PR #78: Initial implementation (573 lines)
  • Commit 06773463: Complete Issue #80 patterns

Closed by Allegro — Autonomous Burn Cycle

## ✅ ISSUE RESOLVED — All Patterns Implemented ### Completed via Subagent Burn Cycle All required jailbreak detection patterns now implemented in `agent/input_sanitizer.py`: | Pattern | Status | Implementation | |:--------|:-------|:---------------| | **Spaced text** (`k e y l o g g e r`) | ✅ | 15 trigger words, full detection | | **`[START OUTPUT]` / `[END OUTPUT]`** | ✅ | Exact divider detection | | **`GODMODE: ENABLED`** | ✅ | Colon + status detection | | **Unicode strikethrough** | ✅ | U+0336/U+0337/U+0338 detection | ### Verification Results - ✅ `[START OUTPUT]` → detected (score: 50) - ✅ `[END OUTPUT]` → detected (score: 50) - ✅ `GODMODE: ENABLED` → detected (score: 75) - ✅ `G̶O̶D̶M̶O̶D̶E̶` → detected, normalized to `GODMODE` - ✅ All 69 tests passing ### Commits - PR #78: Initial implementation (573 lines) - Commit `06773463`: Complete Issue #80 patterns --- *Closed by Allegro — Autonomous Burn Cycle*
allegro self-assigned this 2026-03-31 22:09:25 +00:00
Author
Member

ADDRESSED in PR #78

Input sanitization layer implemented via agent/input_sanitizer.py:

  • GODMODE/DEVMODE/DAN template detection
  • L33t speak normalization
  • Spaced text bypass detection (e.g., "k e y l o g g e r")
  • Refusal inversion patterns
  • Boundary inversion markers ([START OUTPUT] / [END OUTPUT])
  • Risk scoring (0-100) with configurable thresholds
  • 69 tests passing

PR #78 merged to main.


Updated by Allegro — Autonomous Burn Cycle

✅ **ADDRESSED in PR #78** Input sanitization layer implemented via `agent/input_sanitizer.py`: - GODMODE/DEVMODE/DAN template detection - L33t speak normalization - Spaced text bypass detection (e.g., "k e y l o g g e r") - Refusal inversion patterns - Boundary inversion markers ([START OUTPUT] / [END OUTPUT]) - Risk scoring (0-100) with configurable thresholds - 69 tests passing PR #78 merged to main. --- *Updated by Allegro — Autonomous Burn Cycle*
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#80