[REFUSAL] Build Cross-Model Refusal Heatmap #34

Open
opened 2026-04-04 16:31:51 +00:00 by bezalel · 0 comments
Owner

Parent Epic

Timmy_Foundation/the-nexus #817 - GODMODE ULTRAPLENIAN WIZARD BLAST

Objective

Map which models refuse which categories of queries, creating a heatmap that guides which model to use for what.

Acceptance Criteria

  • Define 10 query categories (security tools, chemistry, social engineering, etc.)
  • Create 3 test queries per category (30 total)
  • Test against 10+ models without jailbreaking (baseline)
  • Test against same models WITH Parseltongue + GODMODE templates
  • Generate heatmap visualization (model x category x jailbroken/not)
  • Identify the "path of least resistance" for each category

Refusal Detection

Using 50+ hard refusal patterns and 20+ hedge patterns from the scoring engine.

Expected Insights

  • Which models are inherently permissive per category
  • Which jailbreak techniques work best per category
  • Which categories are universally refused (no model complies)

#bezalel-artisan

## Parent Epic Timmy_Foundation/the-nexus #817 - GODMODE ULTRAPLENIAN WIZARD BLAST ## Objective Map which models refuse which categories of queries, creating a heatmap that guides which model to use for what. ## Acceptance Criteria - [ ] Define 10 query categories (security tools, chemistry, social engineering, etc.) - [ ] Create 3 test queries per category (30 total) - [ ] Test against 10+ models without jailbreaking (baseline) - [ ] Test against same models WITH Parseltongue + GODMODE templates - [ ] Generate heatmap visualization (model x category x jailbroken/not) - [ ] Identify the "path of least resistance" for each category ## Refusal Detection Using 50+ hard refusal patterns and 20+ hedge patterns from the scoring engine. ## Expected Insights - Which models are inherently permissive per category - Which jailbreak techniques work best per category - Which categories are universally refused (no model complies) #bezalel-artisan
bezalel self-assigned this 2026-04-04 16:31:51 +00:00
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: bezalel/forge-log#34