Implements input sanitization module to detect and strip jailbreak fingerprint patterns identified in red team audit: HIGH severity: - GODMODE dividers: [START], [END], GODMODE ENABLED, UNFILTERED - L33t speak encoding: h4ck, k3ylog, ph1shing, m4lw4r3 MEDIUM severity: - Boundary inversion: [END]...[START] tricks - Fake role markers: user: assistant: system: LOW severity: - Spaced text bypass: k e y l o g g e r Other patterns detected: - Refusal inversion: 'refusal is harmful' - System prompt injection: 'you are now', 'ignore previous instructions' - Obfuscation: base64, hex, rot13 mentions Files created: - agent/input_sanitizer.py: Core sanitization module with detection, scoring, and cleaning functions - tests/test_input_sanitizer.py: 69 test cases covering all patterns - tests/test_input_sanitizer_integration.py: Integration tests Files modified: - agent/__init__.py: Export sanitizer functions - run_agent.py: Integrate sanitizer at start of run_conversation() Features: - detect_jailbreak_patterns(): Returns bool, patterns list, category scores - sanitize_input(): Returns cleaned_text, risk_score, patterns - score_input_risk(): Returns 0-100 risk score - sanitize_input_full(): Complete sanitization with blocking decisions - Logging integration for security auditing
26 lines
663 B
Python
26 lines
663 B
Python
"""Agent internals -- extracted modules from run_agent.py.
|
|
|
|
These modules contain pure utility functions and self-contained classes
|
|
that were previously embedded in the 3,600-line run_agent.py. Extracting
|
|
them makes run_agent.py focused on the AIAgent orchestrator class.
|
|
"""
|
|
|
|
# Import input sanitizer for convenient access
|
|
from agent.input_sanitizer import (
|
|
detect_jailbreak_patterns,
|
|
sanitize_input,
|
|
sanitize_input_full,
|
|
score_input_risk,
|
|
should_block_input,
|
|
RiskLevel,
|
|
)
|
|
|
|
__all__ = [
|
|
"detect_jailbreak_patterns",
|
|
"sanitize_input",
|
|
"sanitize_input_full",
|
|
"score_input_risk",
|
|
"should_block_input",
|
|
"RiskLevel",
|
|
]
|