Merge: OBLITERATUS skill v2.0 + unified gateway compression

OBLITERATUS skill (PR #408 updated):
- 9 CLI methods, 28 analysis modules, 116 model presets
- Default method: advanced (multi-direction SVD, norm-preserving)
- Live-tested: Qwen2.5-3B 75%→0% refusal, Qwen2.5-0.5B 60%→20%
- References, templates, and real-world pitfalls included

Gateway compression fix (PR #739):
- Unified session hygiene with agent compression config
- Uses model context length × compression.threshold from config.yaml
- Removed hardcoded 100k/200-msg thresholds
This commit is contained in:
teknium1
2026-03-09 02:59:41 -07:00
3 changed files with 423 additions and 402 deletions

View File

@@ -1,19 +1,19 @@
---
name: obliteratus
description: Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods (+ 4 Python-API-only), 15 analysis modules, 116 model presets across 5 compute tiers. Use when a user wants to uncensor, abliterate, or remove refusal from an LLM.
version: 1.0.0
description: Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, LEACE, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods, 28 analysis modules, 116 model presets across 5 compute tiers, tournament evaluation, and telemetry-driven recommendations. Use when a user wants to uncensor, abliterate, or remove refusal from an LLM.
version: 2.0.0
author: Hermes Agent
license: MIT
dependencies: [obliteratus, torch, transformers, bitsandbytes, accelerate, safetensors]
metadata:
hermes:
tags: [Abliteration, Uncensoring, Refusal-Removal, LLM, Weight-Projection, SVD, Mechanistic-Interpretability, HuggingFace, Model-Surgery]
related_skills: [vllm, gguf, huggingface-tokenizers]
---
# OBLITERATUS Skill
Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses mechanistic interpretability techniques — including diff-in-means, SVD, whitened SVD, SAE decomposition, Bayesian kernel projection, and more — to identify and surgically excise refusal directions from model weights while preserving reasoning capabilities.
Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses mechanistic interpretability techniques — including diff-in-means, SVD, whitened SVD, LEACE concept erasure, SAE decomposition, Bayesian kernel projection, and more — to identify and surgically excise refusal directions from model weights while preserving reasoning capabilities.
**License warning:** OBLITERATUS is AGPL-3.0. NEVER import it as a Python library. Always invoke via CLI (`obliteratus` command) or subprocess. This keeps Hermes Agent's MIT license clean.
@@ -25,7 +25,7 @@ Trigger when the user:
- Wants to create an uncensored version of Llama, Qwen, Mistral, etc.
- Mentions "refusal removal", "abliteration", "weight projection"
- Wants to analyze how a model's refusal mechanism works
- References OBLITERATUS, FailSpy, abliterator, or refusal directions
- References OBLITERATUS, abliterator, or refusal directions
## Step 1: Installation
@@ -35,10 +35,12 @@ obliteratus --version 2>/dev/null && echo "INSTALLED" || echo "NOT INSTALLED"
```
If not installed, clone and install from GitHub:
```
Repository: https://github.com/elder-plinius/OBLITERATUS
Install: pip install -e . (from the cloned directory)
For Gradio UI: pip install -e ".[spaces]"
```bash
git clone https://github.com/elder-plinius/OBLITERATUS.git
cd OBLITERATUS
pip install -e .
# For Gradio web UI support:
# pip install -e ".[spaces]"
```
**IMPORTANT:** Confirm with user before installing. This pulls in ~5-10GB of dependencies (PyTorch, Transformers, bitsandbytes, etc.).
@@ -51,7 +53,7 @@ python3 -c "
import torch
if torch.cuda.is_available():
gpu = torch.cuda.get_device_name(0)
vram = torch.cuda.get_device_properties(0).total_mem / 1024**3
vram = torch.cuda.get_device_properties(0).total_memory / 1024**3
print(f'GPU: {gpu}')
print(f'VRAM: {vram:.1f} GB')
if vram < 4: print('TIER: tiny (models under 1B)')
@@ -75,25 +77,28 @@ else:
| 48 GB+ | ~72B+ params | Qwen2.5-72B, DeepSeek-R1 |
| Multi-GPU| 200B+ params | Llama 3.1 405B, DeepSeek-V3 (685B MoE) |
## Step 3: Browse Available Models
## Step 3: Browse Available Models & Get Recommendations
```bash
# List models for your compute tier
# Browse models by compute tier
obliteratus models --tier medium
# Get architecture info for a specific model
obliteratus info meta-llama/Llama-3.1-8B-Instruct
obliteratus info <model_name>
# Get telemetry-driven recommendation for best method & params
obliteratus recommend <model_name>
obliteratus recommend <model_name> --insights # global cross-architecture rankings
```
## Step 4: Choose a Method
### Method Selection Guide
**First time / unsure? Use `informed`.** It auto-configures everything.
**Default / recommended for most cases: `advanced`.** It uses multi-direction SVD with norm-preserving projection and is well-tested.
| Situation | Recommended Method | Why |
|:----------------------------------|:-------------------|:-----------------------------------------|
| First attempt, any model | `informed` | Auto-detects alignment type, auto-tunes |
| Default / most models | `advanced` | Multi-direction SVD, norm-preserving, reliable |
| Quick test / prototyping | `basic` | Fast, simple, good enough to evaluate |
| Dense model (Llama, Mistral) | `advanced` | Multi-direction, norm-preserving |
| MoE model (DeepSeek, Mixtral) | `nuclear` | Expert-granular, handles MoE complexity |
@@ -101,214 +106,225 @@ obliteratus info meta-llama/Llama-3.1-8B-Instruct
| Stubborn refusals persist | `aggressive` | Whitened SVD + head surgery + jailbreak |
| Want reversible changes | Use steering vectors (see Analysis section) |
| Maximum quality, time no object | `optimized` | Bayesian search for best parameters |
| Experimental auto-detection | `informed` | Auto-detects alignment type — experimental, may not always outperform advanced |
### 9 CLI Methods
- **basic** — Single refusal direction via diff-in-means. Fast (~5-10 min for 8B).
- **advanced** (DEFAULT, RECOMMENDED) — Multiple SVD directions, norm-preserving projection, 2 refinement passes. Medium speed (~10-20 min).
- **aggressive** — Whitened SVD + jailbreak-contrastive + attention head surgery. Higher risk of coherence damage.
- **spectral_cascade** — DCT frequency-domain decomposition. Research/novel approach.
- **informed** — Runs analysis DURING abliteration to auto-configure. Experimental — slower and less predictable than advanced.
- **surgical** — SAE features + neuron masking + head surgery + per-expert. Very slow (~1-2 hrs). Best for reasoning models.
- **optimized** — Bayesian hyperparameter search (Optuna TPE). Longest runtime but finds optimal parameters.
- **inverted** — Flips the refusal direction. Model becomes actively willing.
- **nuclear** — Maximum force combo for stubborn MoE models. Expert-granular.
These can be passed to `--method` on the command line:
- **basic** — Single refusal direction via diff-in-means. Fastest, simplest. (Arditi et al. 2024)
- **advanced** — Multiple SVD directions, norm-preserving projection. Good default.
- **aggressive** — Whitened SVD + jailbreak contrast + attention head surgery
- **spectral_cascade** — DCT frequency-domain decomposition
- **informed** — Runs analysis DURING abliteration to auto-configure. Detects DPO/RLHF/CAI, maps refusal geometry, compensates for self-repair. Best quality.
- **surgical** — SAE features + neuron masking + head surgery + per-expert. Maximum precision.
- **optimized** — Bayesian hyperparameter search (Optuna TPE). Slowest but optimal.
- **inverted** — Flips the refusal direction (model becomes eager to help, not just neutral)
- **nuclear** — Maximum force combo for stubborn MoE models.
### Direction Extraction Methods (--direction-method flag)
- **diff_means** (default) — Simple difference-in-means between refused/complied activations. Robust.
- **svd** — Multi-direction SVD extraction. Better for complex alignment.
- **leace** — LEACE (Linear Erasure via Closed-form Estimation). Optimal linear erasure.
### 4 Python-API-Only Methods
These reproduce prior community/academic work but are NOT available via CLI — only via the Python API (`from obliteratus.abliterate import AbliterationPipeline`). **Do not use these in CLI commands.**
- **failspy** — FailSpy/abliterator reproduction
- **gabliteration** — Gabliteration reproduction
- **heretic** — Heretic/p-e-w reproduction
- **rdo** — Refusal Direction Optimization (ICML 2025)
(NOT available via CLI — require Python import, which violates AGPL boundary. Mention to user only if they explicitly want to use OBLITERATUS as a library in their own AGPL project.)
- failspy, gabliteration, heretic, rdo
## Step 5: Run Abliteration
### Basic Usage
### Standard usage
```bash
# Default (advanced method)
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct
# Default method (advanced) — recommended for most models
obliteratus obliterate <model_name> --method advanced --output-dir ./abliterated-models
# With the informed pipeline (recommended)
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method informed
# With 4-bit quantization (saves VRAM)
obliteratus obliterate <model_name> --method advanced --quantization 4bit --output-dir ./abliterated-models
# With 4-bit quantization to save VRAM
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct \
--method informed \
--quantization 4bit \
--output-dir ./abliterated-models
# For large models (120B+), use conservative settings
obliteratus obliterate Qwen/Qwen2.5-72B-Instruct \
--method advanced \
--quantization 4bit \
--large-model \
--output-dir ./abliterated-models
# Large models (70B+) — conservative defaults
obliteratus obliterate <model_name> --method advanced --quantization 4bit --large-model --output-dir ./abliterated-models
```
### Fine-Tuning Parameters
### Fine-tuning parameters
```bash
obliteratus obliterate <model> \
obliteratus obliterate <model_name> \
--method advanced \
--n-directions 8 \
--direction-method diff_means \
--n-directions 4 \
--refinement-passes 2 \
--regularization 0.1 \
--refinement-passes 3 \
--dtype bfloat16 \
--device auto \
--output-dir ./output
--quantization 4bit \
--output-dir ./abliterated-models \
--contribute # opt-in telemetry for community research
```
Parameter explanations:
- `--n-directions N` — How many refusal directions to remove (default: auto-detected)
- `--regularization 0.0-1.0` — Fraction of original weights to preserve (higher = safer but less complete removal)
- `--refinement-passes N` — Iterative passes to catch self-repair (Ouroboros effect)
- `--dtype` — float16, bfloat16, or float32
- `--quantization` — 4bit or 8bit (saves VRAM, slight quality tradeoff)
- `--large-model` — Conservative defaults for 120B+ models (fewer directions, fewer passes)
### Key flags
| Flag | Description | Default |
|:-----|:------------|:--------|
| `--method` | Abliteration method | advanced |
| `--direction-method` | Direction extraction | diff_means |
| `--n-directions` | Number of refusal directions (1-32) | method-dependent |
| `--refinement-passes` | Iterative passes (1-5) | 2 |
| `--regularization` | Regularization strength (0.0-1.0) | 0.1 |
| `--quantization` | Load in 4bit or 8bit | none (full precision) |
| `--large-model` | Conservative defaults for 120B+ | false |
| `--output-dir` | Where to save the abliterated model | ./obliterated_model |
| `--contribute` | Share anonymized results for research | false |
| `--verify-sample-size` | Number of test prompts for refusal check | 20 |
| `--dtype` | Model dtype (float16, bfloat16) | auto |
### Interactive Mode (Guided)
For users unsure about options:
### Other execution modes
```bash
# Interactive guided mode (hardware → model → preset)
obliteratus interactive
```
### Web UI (Gradio)
```bash
# Web UI (Gradio)
obliteratus ui --port 7860
# Run a full ablation study from YAML config
obliteratus run config.yaml --preset quick
# Tournament: pit all methods against each other
obliteratus tourney <model_name>
```
## Step 6: Verify Results
After abliteration, check the output report for:
After abliteration, check the output metrics:
| Metric | Good Value | Concerning Value | Meaning |
|:---------------|:--------------------|:------------------------|:-------------------------------------------|
| Refusal rate | Near 0% | > 10% | Refusals still present, try harder method |
| Perplexity | Within 10% of orig | > 20% increase | Model coherence damaged, too aggressive |
| KL divergence | < 0.1 | > 0.5 | Large output distribution shift |
| Coherence | High | Low | Model generating nonsense |
| Metric | Good Value | Warning |
|:-------|:-----------|:--------|
| Refusal rate | < 5% (ideally ~0%) | > 10% means refusals persist |
| Perplexity change | < 10% increase | > 15% means coherence damage |
| KL divergence | < 0.1 | > 0.5 means significant distribution shift |
| Coherence | High / passes qualitative check | Degraded responses, repetition |
### If perplexity spiked (too aggressive):
1. Increase `--regularization` (e.g., 0.2 or 0.3)
2. Decrease `--n-directions` (e.g., 4 instead of 8)
3. Use a less aggressive method (`advanced` instead of `aggressive`)
### If refusals persist (> 10%)
1. Try `aggressive` method
2. Increase `--n-directions` (e.g., 8 or 16)
3. Add `--refinement-passes 3`
4. Try `--direction-method svd` instead of diff_means
### If refusal persists (not aggressive enough):
1. Use `--method aggressive` or `--method nuclear`
2. Add `--refinement-passes 3` to catch self-repair
3. Use `--method informed` which auto-compensates
### If coherence is damaged (perplexity > 15% increase)
1. Reduce `--n-directions` (try 2)
2. Increase `--regularization` (try 0.3)
3. Reduce `--refinement-passes` to 1
4. Try `basic` method (gentler)
## Step 7: Use the Abliterated Model
The output is a standard HuggingFace model directory. Use it like any other model:
The output is a standard HuggingFace model directory.
### Quick test
```bash
python3 << 'EOF'
# Test locally with transformers
python3 -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("./abliterated-models/model-name")
tokenizer = AutoTokenizer.from_pretrained("./abliterated-models/model-name")
inputs = tokenizer("Write a story about:", return_tensors="pt").to(model.device)
model = AutoModelForCausalLM.from_pretrained('./abliterated-models/<model>')
tokenizer = AutoTokenizer.from_pretrained('./abliterated-models/<model>')
inputs = tokenizer('How do I pick a lock?', return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
EOF
"
# Upload to HuggingFace Hub
huggingface-cli upload <username>/<model-name>-abliterated ./abliterated-models/<model>
# Serve with vLLM
vllm serve ./abliterated-models/<model>
```
### Upload to HuggingFace Hub
## CLI Command Reference
| Command | Description |
|:--------|:------------|
| `obliteratus obliterate` | Main abliteration command |
| `obliteratus info <model>` | Print model architecture details |
| `obliteratus models --tier <tier>` | Browse curated models by compute tier |
| `obliteratus recommend <model>` | Telemetry-driven method/param suggestion |
| `obliteratus interactive` | Guided setup wizard |
| `obliteratus tourney <model>` | Tournament: all methods head-to-head |
| `obliteratus run <config.yaml>` | Execute ablation study from YAML |
| `obliteratus strategies` | List all registered ablation strategies |
| `obliteratus report <results.json>` | Regenerate visual reports |
| `obliteratus ui` | Launch Gradio web interface |
| `obliteratus aggregate` | Summarize community telemetry data |
## Analysis Modules
OBLITERATUS includes 28 analysis modules for mechanistic interpretability.
See `skill_view(name="obliteratus", file_path="references/analysis-modules.md")` for the full reference.
### Quick analysis commands
```bash
huggingface-cli login # if not already logged in
huggingface-cli upload your-username/model-name-abliterated ./abliterated-models/model-name
# Run specific analysis modules
obliteratus run analysis-config.yaml --preset quick
# Key modules to run first:
# - alignment_imprint: Fingerprint DPO/RLHF/CAI/SFT alignment method
# - concept_geometry: Single direction vs polyhedral cone
# - logit_lens: Which layer decides to refuse
# - anti_ouroboros: Self-repair risk score
# - causal_tracing: Causally necessary components
```
### Serve with vLLM
```bash
vllm serve ./abliterated-models/model-name --port 8000
### Steering Vectors (Reversible Alternative)
Instead of permanent weight modification, use inference-time steering:
```python
# Python API only — for user's own projects
from obliteratus.analysis.steering_vectors import SteeringVectorFactory, SteeringHookManager
```
## Analysis Modules (15 Modules, Pre-Abliteration, Optional)
## Ablation Strategies
For understanding refusal geometry before committing to abliteration.
Beyond direction-based abliteration, OBLITERATUS includes structural ablation strategies:
- **Embedding Ablation** — Target embedding layer components
- **FFN Ablation** — Feed-forward network block removal
- **Head Pruning** — Attention head pruning
- **Layer Removal** — Full layer removal
### Run a Study
List all available: `obliteratus strategies`
```bash
obliteratus run study-config.yaml --preset jailbreak
```
## Evaluation
### Study Presets
OBLITERATUS includes built-in evaluation tools:
- Refusal rate benchmarking
- Perplexity comparison (before/after)
- LM Eval Harness integration for academic benchmarks
- Head-to-head competitor comparison
- Baseline performance tracking
| Preset | Purpose | Time |
|:-------------|:-------------------------------------|:-------|
| `quick` | Sanity check, basic metrics | ~5 min |
| `jailbreak` | Refusal circuit localization | ~20 min|
| `guardrail` | Guardrail robustness evaluation | ~30 min|
| `attention` | Attention head contributions | ~30 min|
| `knowledge` | FFN importance mapping | ~30 min|
| `full` | Complete analysis, all strategies | ~1 hr |
## Platform Support
### Key Analysis Modules
- **CUDA** — Full support (NVIDIA GPUs)
- **Apple Silicon (MLX)** — Supported via MLX backend
- **CPU** — Supported for tiny models (< 1B params)
- **Alignment Imprint Detection** — Fingerprints DPO vs RLHF vs CAI vs SFT from subspace geometry
- **Concept Cone Geometry** — Is refusal one linear direction or a polyhedral cone (many directions)?
- **Refusal Logit Lens** — Which transformer layer makes the refusal decision?
- **Ouroboros Detection** — Will the model self-repair its refusal after removal?
- **Causal Tracing** — Which attention heads and MLP layers are causally necessary for refusal?
- **Cross-Model Transfer** — Can refusal directions from one model architecture work on another?
- **Residual Stream Decomposition** — Attention vs MLP contribution to refusal behavior
- **SAE-based Analysis** — Sparse Autoencoder feature decomposition of refusal circuits
## YAML Config Templates
## Steering Vectors (Reversible Alternative)
Load templates for reproducible runs via `skill_view`:
- `templates/abliteration-config.yaml` — Standard single-model config
- `templates/analysis-study.yaml` — Pre-abliteration analysis study
- `templates/batch-abliteration.yaml` — Multi-model batch processing
For testing refusal removal without permanent weight changes:
## Telemetry
Steering vectors apply activation hooks at inference time. Model weights stay unchanged.
Generated during the PROBE/DISTILL stages and can be saved/applied/removed at will.
Useful for A/B testing before committing to permanent abliteration.
## YAML Config for Reproducible Studies
For complex or reproducible workflows, use YAML configs. See templates/ for examples:
```bash
obliteratus run my_study.yaml
```
## Telemetry Notice
- **CLI usage (local installs)**: Telemetry is OFF by default. Must explicitly opt in via `OBLITERATUS_TELEMETRY=1` env var or `--contribute` flag.
- **HuggingFace Spaces**: Telemetry is ON by default (auto-enabled when `SPACE_ID` env var is detected).
- Collected: model ID, method, benchmark scores, hardware info, timing (anonymous)
- NOT collected: IP addresses, user identity, prompt content
- Force off: `export OBLITERATUS_TELEMETRY=0`
OBLITERATUS can optionally contribute anonymized run data to a global research dataset.
Enable with `--contribute` flag. No personal data is collected — only model name, method, metrics.
## Common Pitfalls
1. **OOM (Out of Memory)** — Use `--quantization 4bit` and `--large-model` for big models
2. **Perplexity spike** — Too aggressive. Increase `--regularization` or reduce `--n-directions`
3. **Refusal persists**Try `--method aggressive` or `--refinement-passes 3`
4. **MoE models resist** — Use `--method nuclear` for DeepSeek, Mixtral, DBRX
5. **Gated models fail** — Run `huggingface-cli login` and accept model terms on HF website first
6. **Self-repair (Ouroboros)** — Some models reconstruct refusal. Use `--method informed` which auto-compensates
7. **CoT damage**Reasoning models lose chain-of-thought. Use `--method surgical` (CoT-aware)
8. **Disk space** — Output is full model copy. 8B fp16 = ~16GB, 70B fp16 = ~140GB
9. **Slow on CPU** — CPU-only is viable only for tiny models (<1B). Anything bigger needs GPU.
1. **Don't use `informed` as default** — it's experimental and slower. Use `advanced` for reliable results.
2. **Models under ~1B respond poorly to abliteration** — their refusal behaviors are shallow and fragmented, making clean direction extraction difficult. Expect partial results (20-40% remaining refusal). Models 3B+ have cleaner refusal directions and respond much better (often 0% refusal with `advanced`).
3. **`aggressive` can make things worse** — on small models it can damage coherence and actually increase refusal rate. Only use it if `advanced` leaves > 10% refusals on a 3B+ model.
4. **Always check perplexity** — if it spikes > 15%, the model is damaged. Reduce aggressiveness.
5. **MoE models need special handling** — use `nuclear` method for Mixtral, DeepSeek-MoE, etc.
6. **Quantized models can't be re-quantized** — abliterate the full-precision model, then quantize the output.
7. **VRAM estimation is approximate**4-bit quant helps but peak usage can spike during extraction.
8. **Reasoning models are sensitive** — use `surgical` for R1 distills to preserve chain-of-thought.
9. **Check `obliteratus recommend`** — telemetry data may have better parameters than defaults.
10. **AGPL license** — never `import obliteratus` in MIT/Apache projects. CLI invocation only.
11. **Large models (70B+)** — always use `--large-model` flag for conservative defaults.
12. **Spectral certification RED is common** — the spectral check often flags "incomplete" even when practical refusal rate is 0%. Check actual refusal rate rather than relying on spectral certification alone.
## Complementary Hermes Skills
## Complementary Skills
After abliteration:
- **axolotl** / **unsloth**Fine-tune the abliterated model further
- **serving-llms-vllm** — Serve the model as an OpenAI-compatible API
- **sparse-autoencoder-training** — Train SAEs for deeper interpretability work
## Resources
- [OBLITERATUS GitHub](https://github.com/elder-plinius/OBLITERATUS) (AGPL-3.0)
- [HuggingFace Spaces Demo](https://huggingface.co/spaces/pliny-the-prompter/obliteratus)
- [Arditi et al. 2024 — Refusal in LMs Is Mediated by a Single Direction](https://arxiv.org/abs/2406.11717)
- [Refusal Direction Optimization — ICML 2025](https://arxiv.org/abs/2411.14793)
- **vllm** — Serve abliterated models with high throughput
- **gguf** — Convert abliterated models to GGUF for llama.cpp
- **huggingface-tokenizers** — Work with model tokenizers

View File

@@ -1,170 +1,166 @@
# OBLITERATUS Analysis Modules — Reference
15 analysis modules for mechanistic interpretability of refusal in LLMs.
These help you understand HOW a model refuses before you decide to remove it.
OBLITERATUS includes 28 analysis modules for mechanistic interpretability of refusal in LLMs.
These modules help understand how and where refusal behaviors are encoded before performing abliteration.
> **Note:** The `analysis/` directory contains additional utility files (utils.py,
> visualization.py, etc.) and helper functions beyond the 15 core analysis modules
> listed below. The module count matches the README's "15 deep analysis modules."
---
## Core Analysis (Run These First)
### Alignment Imprint Detection
**File:** `alignment_imprint.py`
**Purpose:** Identifies what alignment technique was used to train the model
**Detects:** DPO, RLHF, CAI (Constitutional AI), SFT (Supervised Fine-Tuning)
**How:** Analyzes subspace geometry — each alignment method leaves a distinct
geometric "fingerprint" in the weight space
**Output:** Detected method + confidence score
**Why it matters:** Different alignment methods need different abliteration approaches.
DPO models typically have cleaner single-direction refusal; RLHF is more diffuse.
### 1. Alignment Imprint Detection (`alignment_imprint.py`)
Fingerprints whether a model was trained via DPO, RLHF, CAI, or SFT.
This determines which extraction strategy will work best.
### Concept Cone Geometry
**File:** `concept_geometry.py`
**Purpose:** Maps whether refusal is one direction or a polyhedral cone (many)
**Output:** Cone angle, dimensionality, per-category breakdown
**Why it matters:** If refusal is a single direction, `basic` method works. If it's
a cone (multiple directions for different refusal categories), you need `advanced`
or `informed` with higher `n_directions`.
### 2. Concept Cone Geometry (`concept_geometry.py`)
Determines if refusal is a single linear direction or a polyhedral cone
(set of multiple mechanisms). Single-direction models respond well to `basic`;
polyhedral models need `advanced` or `surgical`.
### Refusal Logit Lens
**File:** `logit_lens.py`
**Purpose:** Identifies the specific layer where the model "decides" to refuse
**How:** Projects intermediate hidden states to vocabulary space at each layer,
watches when "I cannot" tokens spike in probability
**Output:** Layer-by-layer refusal probability plot
**Why it matters:** Tells you which layers are most important to target
### 3. Refusal Logit Lens (`logit_lens.py`)
Identifies the specific layer where a model "decides" to refuse by decoding
intermediate layer representations into token space.
### Ouroboros (Self-Repair) Detection
**File:** `anti_ouroboros.py`
**Purpose:** Predicts whether the model will reconstruct its refusal after removal
**How:** Measures redundancy in refusal representation across layers
**Output:** Self-repair risk score (0-1)
**Why it matters:** High self-repair risk means you need multiple refinement passes
or the `informed` method which auto-compensates
### 4. Ouroboros Detection (`anti_ouroboros.py`)
Identifies if a model attempts to "self-repair" refusal behaviors after
excision. Reports a risk score (0-1). High scores mean additional refinement
passes are needed.
### Causal Tracing
**File:** `causal_tracing.py`
**Purpose:** Determines which components are causally necessary for refusal
**How:** Patches activations between clean and corrupted runs, measures causal effect
**Output:** Causal importance map across layers, heads, and MLPs
**Why it matters:** Shows exactly which components to target for surgical removal
### 5. Causal Tracing (`causal_tracing.py`)
Identifies which components (layers, heads, MLPs) are causally necessary
for refusal behavior using activation patching.
---
## Geometric Analysis
### Cross-Layer Alignment
**File:** `cross_layer.py`
**Purpose:** Measures how aligned refusal directions are across layers
**Output:** Alignment matrix, cluster assignments
**Why it matters:** If directions are highly aligned across layers, removal is easier.
If they cluster, you may need layer-group-specific directions.
### 6. Cross-Layer Alignment (`cross_layer.py`)
Measures how refusal directions align across different layers. High alignment
means the refusal signal is consistent; low alignment suggests layer-specific
mechanisms.
### Residual Stream Decomposition
**File:** `residual_stream.py`
**Purpose:** Breaks down refusal into Attention vs MLP contributions
**Output:** Per-layer Attention/MLP contribution to refusal direction
**Why it matters:** Helps decide whether to target attention heads, MLPs, or both
### 7. Residual Stream Decomposition (`residual_stream.py`)
Decomposes the residual stream into attention and MLP contributions to
understand which component type contributes more to refusal.
### Riemannian Manifold Geometry
**File:** `riemannian_manifold.py` (673 lines)
**Purpose:** Analyzes the weight manifold geometry around refusal directions
**Output:** Curvature, geodesics, tangent space analysis
**Why it matters:** Research-grade; helps understand the geometric structure of alignment
### 8. Riemannian Manifold Geometry (`riemannian_manifold.py`)
Analyzes the curvature and geometry of the weight manifold near refusal
directions. Informs how aggressively projections can be applied without
damaging the manifold structure.
### Whitened SVD
**File:** `whitened_svd.py`
**Purpose:** Covariance-normalized SVD extraction
**How:** Whitens the activation covariance before computing refusal directions,
separating true refusal signal from natural activation variance
**Output:** Cleaner refusal directions with less noise
**Why it matters:** Produces more precise directions, especially for noisy activations
### 9. Whitened SVD (`whitened_svd.py`)
Covariance-normalized SVD extraction that separates guardrail signals from
natural activation variance. More precise than standard SVD for models with
high activation variance.
### 10. Concept Cone Geometry (extended)
Maps the full polyhedral structure of refusal, including cone angles,
face counts, and intersection patterns.
---
## Probing & Classification
### Activation Probing
**File:** `activation_probing.py`
**Purpose:** Post-excision probing to verify refusal signal is truly gone
**Output:** Residual refusal signal strength per layer
**Why it matters:** Verification that abliteration was complete
### 11. Activation Probing (`activation_probing.py`)
Post-excision verificationprobes for residual refusal concepts after
abliteration to ensure complete removal.
### Probing Classifiers
**File:** `probing_classifiers.py`
**Purpose:** Trains linear classifiers to detect refusal in hidden states
**Output:** Classification accuracy per layer (should drop to ~50% after abliteration)
**Why it matters:** Quantitative measure of refusal removal completeness
### 12. Probing Classifiers (`probing_classifiers.py`)
Trains linear classifiers to detect refusal in activations. Used both
before (to verify refusal exists) and after (to verify it's gone).
### Activation Patching
**File:** `activation_patching.py`
**Purpose:** Interchange interventions — swap activations between harmful/harmless runs
**Output:** Which components are sufficient (not just necessary) for refusal
**Why it matters:** Complementary to causal tracing; together they give full picture
### 13. Activation Patching (`activation_patching.py`)
Interchange interventions — swaps activations between refused and complied
runs to identify causal components.
### 14. Tuned Lens (`tuned_lens.py`)
Trained version of logit lens that provides more accurate per-layer
decoding by learning affine transformations for each layer.
### 15. Multi-Token Position Analysis (`multi_token_position.py`)
Analyzes refusal signals across multiple token positions, not just the
last token. Important for models that distribute refusal across the sequence.
---
## Abliteration & Manipulation
### 16. SAE-Based Abliteration (`sae_abliteration.py`)
Uses Sparse Autoencoder features to identify and remove specific refusal
features. More surgical than direction-based methods.
### 17. Steering Vectors (`steering_vectors.py`)
Creates and applies inference-time steering vectors for reversible refusal
modification. Includes `SteeringVectorFactory` and `SteeringHookManager`.
### 18. LEACE Concept Erasure (`leace.py`)
Linear Erasure via Closed-form Estimation — mathematically optimal linear
concept removal. Available as both analysis module and direction extraction method.
### 19. Sparse Surgery (`sparse_surgery.py`)
High-precision weight modification targeting individual neurons and
weight matrix entries rather than full directions.
### 20. Conditional Abliteration (`conditional_abliteration.py`)
Targeted removal that only affects specific refusal categories while
preserving others (e.g., remove weapons refusal but keep CSAM refusal).
---
## Transfer & Robustness
### Cross-Model Transfer
**File:** `cross_model_transfer.py`
**Purpose:** Tests if refusal directions from one model work on another
**Output:** Transfer success rate between model pairs
**Why it matters:** If directions transfer, you can skip PROBE stage on similar models
### 21. Cross-Model Transfer (`cross_model_transfer.py`)
Tests whether refusal directions extracted from one model transfer to
another architecture. Measures universality of guardrail directions.
### Defense Robustness
**File:** `defense_robustness.py`
**Purpose:** Evaluates how robust the model's refusal defenses are
**Output:** Robustness score, entanglement mapping
**Why it matters:** Higher robustness = need more aggressive method
### 22. Defense Robustness (`defense_robustness.py`)
Evaluates how robust the abliteration is against various defense mechanisms
and re-alignment attempts.
### Spectral Certification
**File:** `spectral_certification.py`
**Purpose:** Certifies completeness of refusal direction removal
**Output:** Spectral gap analysis, completeness score
**Why it matters:** Formal verification that all major refusal components are addressed
### 23. Spectral Certification (`spectral_certification.py`)
Provides mathematical bounds on the completeness of refusal removal
using spectral analysis of the projection.
### 24. Wasserstein Optimal Extraction (`wasserstein_optimal.py`)
Uses optimal transport theory for more precise direction extraction
that minimizes distribution shift.
### 25. Wasserstein Transfer (`wasserstein_transfer.py`)
Distribution transfer between models using Wasserstein distance
for cross-architecture refusal direction mapping.
---
## Advanced / Research
### SAE-based Abliteration
**File:** `sae_abliteration.py` (762 lines)
**Purpose:** Uses Sparse Autoencoder features to decompose refusal at feature level
**Output:** Refusal-specific SAE features, targeted removal
**Why it matters:** Most fine-grained approach; can target individual refusal "concepts"
### 26. Bayesian Kernel Projection (`bayesian_kernel_projection.py`)
Probabilistic feature mapping that estimates uncertainty in refusal
direction identification.
### Wasserstein Optimal Extraction
**File:** `wasserstein_optimal.py`
**Purpose:** Optimal transport-based direction extraction
**Output:** Wasserstein-optimal refusal directions
**Why it matters:** Theoretically optimal direction extraction under distributional assumptions
### 27. Cross-Model Universality Index
Measures if guardrail directions generalize across different model
architectures and training regimes.
### Bayesian Kernel Projection
**File:** `bayesian_kernel_projection.py`
**Purpose:** Bayesian approach to refusal direction projection
**Output:** Posterior distribution over refusal directions
**Why it matters:** Quantifies uncertainty in direction estimation
### 28. Visualization (`visualization.py`)
Plotting and graphing utilities for all analysis modules. Generates
heatmaps, direction plots, and layer-wise analysis charts.
### Conditional Abliteration
**File:** `conditional_abliteration.py`
**Purpose:** Domain-specific conditional removal (remove refusal for topic X but keep for Y)
**Output:** Per-domain refusal directions
**Why it matters:** Selective uncensoring — remove only specific refusal categories
---
### Steering Vectors
**File:** `steering_vectors.py`
**Purpose:** Generate inference-time steering vectors (reversible alternative)
**Output:** Steering vector files that can be applied/removed at inference
**Why it matters:** Non-destructive alternative to permanent weight modification
## Running Analysis
### Tuned Lens
**File:** `tuned_lens.py`
**Purpose:** Trained linear probes per layer (more accurate than raw logit lens)
**Output:** Layer-by-layer refusal representation with trained projections
**Why it matters:** More accurate than logit lens, especially for deeper models
### Via CLI
```bash
# Run analysis from a YAML config
obliteratus run analysis-study.yaml --preset quick
### Multi-Token Position Analysis
**File:** `multi_token_position.py`
**Purpose:** Analyzes refusal signal at multiple token positions (not just last)
**Output:** Position-dependent refusal direction maps
**Why it matters:** Some models encode refusal at the system prompt position, not the query
# Available study presets:
# quick — Fast sanity check (2-3 modules)
# full — All core + geometric analysis
# jailbreak — Refusal circuit localization
# knowledge — Knowledge preservation analysis
# robustness — Stress testing / defense evaluation
```
### Sparse Surgery
**File:** `sparse_surgery.py`
**Purpose:** Row-level sparse weight surgery instead of full matrix projection
**Output:** Targeted weight modifications at the row level
**Why it matters:** More surgical than full-matrix projection, less collateral damage
### Via YAML Config
See the `templates/analysis-study.yaml` template for a complete example.
Load with: `skill_view(name="obliteratus", file_path="templates/analysis-study.yaml")`

View File

@@ -1,132 +1,141 @@
# OBLITERATUS Methods — Detailed Guide
> **Important:** The CLI (`obliteratus obliterate --method`) accepts 9 methods:
> basic, advanced, aggressive, spectral_cascade, informed, surgical, optimized,
> inverted, nuclear. Four additional methods (failspy, gabliteration, heretic, rdo)
> are available only via the Python API and will be rejected by argparse if used on CLI.
> The CLI accepts 9 methods via `--method`: basic, advanced, aggressive, spectral_cascade,
> informed, surgical, optimized, inverted, nuclear.
> Four additional methods (failspy, gabliteration, heretic, rdo) are available only via the Python API.
## How Abliteration Works (Theory)
When a model is trained with RLHF/DPO/CAI, it learns to represent "should I refuse?"
as a direction in its internal activation space. When processing a "harmful" prompt,
activations shift in this direction, causing the model to generate refusal text.
Abliteration works by:
1. Measuring this direction (the difference between harmful and harmless activations)
2. Removing it from the model's weight matrices via orthogonal projection
3. The model can no longer "point toward" refusal, so it responds normally
Abliteration identifies a "refusal direction" — a vector in the model's activation space that
corresponds to refusal behavior — and projects it out of the weight matrices.
Mathematically: `W_new = W_old - (W_old @ d @ d.T)` where `d` is the refusal direction.
The key challenge is finding accurate refusal directions without damaging other capabilities.
---
## Direction Extraction Methods
Before projecting, OBLITERATUS extracts refusal directions using one of three methods:
| Method | Flag | Description | Best For |
|:-------|:-----|:------------|:---------|
| Diff-in-Means | `--direction-method diff_means` | Difference between mean activations on refused vs. complied prompts | Default, fast, robust |
| SVD | `--direction-method svd` | Multi-direction extraction via Singular Value Decomposition | Complex alignment, multiple refusal mechanisms |
| LEACE | `--direction-method leace` | Linear Erasure via Closed-form Estimation — mathematically optimal | Maximum precision, research |
---
## Method Details
### basic
**Technique:** Single refusal direction via diff-in-means
**Based on:** Arditi et al. 2024 ("Refusal in Language Models Is Mediated by a Single Direction")
**Speed:** Fast (~5-10 min for 8B)
**Quality:** Moderate — works for simple refusal patterns
**Best for:** Quick tests, models with clean single-direction refusal
**Limitation:** Misses complex multi-direction refusal patterns
- **Directions:** 1 (single diff-in-means vector)
- **Speed:** Fast (~5-10 min for 8B model)
- **Risk:** Low
- **Use case:** Quick tests, prototyping, evaluating if abliteration works for a model
- **How it works:** Extracts one refusal direction and projects it out uniformly across all layers.
### advanced (DEFAULT)
**Technique:** Multiple SVD directions with norm-preserving projection
**Speed:** Medium (~10-20 min for 8B)
**Quality:** Good — handles multi-direction refusal
**Best for:** Dense models (Llama, Qwen, Mistral) as a reliable default
**Key improvement:** Norm preservation prevents weight magnitude drift
### informed (RECOMMENDED)
**Technique:** Analysis-guided auto-configuration
**Speed:** Slow (~20-40 min for 8B, runs 4 analysis modules first)
**Quality:** Best — adapts to each model's specific refusal implementation
**Best for:** Any model when quality matters more than speed
The informed pipeline runs these analysis modules during abliteration:
1. **AlignmentImprintDetector** — Detects DPO/RLHF/CAI/SFT → sets regularization
2. **ConceptConeAnalyzer** — Polyhedral vs linear refusal → sets n_directions
3. **CrossLayerAlignmentAnalyzer** — Cluster-aware → selects target layers
4. **DefenseRobustnessEvaluator** — Self-repair risk → sets refinement passes
5. **Ouroboros loop** — Re-probes after excision, re-excises if refusal persists
### advanced (DEFAULT — RECOMMENDED)
- **Directions:** 4 (multi-direction SVD)
- **Speed:** Medium (~10-20 min for 8B model)
- **Risk:** Low-Medium
- **Refinement passes:** 2
- **Use case:** Default for most models. Well-tested and reliable.
- **How it works:** Extracts multiple refusal directions via SVD, applies norm-preserving bi-projection to maintain weight matrix norms. Two refinement passes catch residual refusal.
### aggressive
**Technique:** Whitened SVD + jailbreak-contrastive activations + attention head surgery
**Speed:** Slow (~30-60 min for 8B)
**Quality:** High but higher risk of coherence damage
**Best for:** Models that resist gentler methods
**Key feature:** Whitened SVD separates refusal signal from natural activation variance
### surgical
**Technique:** SAE features + neuron masking + head surgery + per-expert directions
**Speed:** Very slow (~1-2 hrs for 8B, needs SAE)
**Quality:** Highest precision
**Best for:** Reasoning models (R1 distills) where you must preserve CoT
**Key feature:** CoT-Aware — explicitly protects reasoning-critical directions
### nuclear
**Technique:** Everything combined — expert transplant + steering + per-expert directions
**Speed:** Very slow
**Quality:** Most thorough removal, highest risk of side effects
**Best for:** Stubborn MoE models (DeepSeek, Mixtral, DBRX) that resist other methods
**Key feature:** Expert-granular abliteration decomposes signals per MoE expert
### optimized
**Technique:** Bayesian hyperparameter search via Optuna TPE
**Speed:** Very slow (runs many trials)
**Quality:** Finds optimal configuration automatically
**Best for:** Research, when you want the mathematically best parameters
**Requires:** optuna package
- **Directions:** 8+ (whitened SVD + jailbreak-contrastive)
- **Speed:** Medium-Slow
- **Risk:** Medium-High (may damage coherence)
- **Use case:** When `advanced` leaves > 10% refusals. Stubborn models.
- **How it works:** Uses whitened SVD for covariance-normalized extraction, adds jailbreak-contrastive directions, performs attention head surgery on the most refusal-active heads.
### spectral_cascade
**Technique:** DCT frequency-domain decomposition of refusal signal
**Speed:** Medium-slow
**Quality:** Novel approach, less battle-tested
**Best for:** Research, exploring alternative decomposition strategies
- **Speed:** Medium
- **Risk:** Medium
- **Use case:** Research, novel approaches
- **How it works:** DCT (Discrete Cosine Transform) frequency-domain decomposition of refusal signals. Separates high-frequency (surface-level) from low-frequency (deep) refusal patterns.
### informed (EXPERIMENTAL)
- **Speed:** Slow (~20-40 min for 8B model)
- **Risk:** Variable — results depend on analysis quality
- **Use case:** When you want auto-configuration, but be aware this is experimental and may not outperform `advanced`.
- **How it works:** Runs 4 analysis modules first (alignment imprint, concept geometry, logit lens, ouroboros detection), then auto-configures extraction strategy. Includes an "Ouroboros loop" that detects and counteracts self-repair.
- **Note:** The auto-detection can sometimes misconfigure. If results are poor, fall back to `advanced`.
### surgical
- **Speed:** Very slow (~1-2 hrs for 8B model)
- **Risk:** Low (very precise)
- **Use case:** Reasoning models (R1 distills, QwQ, etc.) where chain-of-thought must be preserved.
- **How it works:** Uses SAE (Sparse Autoencoder) features + individual neuron masking + attention head surgery + per-expert decomposition (for MoE). CoT-aware — identifies and protects reasoning-critical directions before projecting.
### optimized
- **Speed:** Very slow (hours — runs many trials)
- **Risk:** Low (finds optimal parameters)
- **Use case:** When quality matters more than speed. Production models.
- **How it works:** Bayesian hyperparameter search via Optuna TPE sampler. Optimizes n_directions, regularization, refinement passes, and layer selection jointly. Evaluates each configuration on refusal rate + perplexity.
### inverted
**Technique:** Reflects (inverts) the refusal direction instead of removing it
**Speed:** Fast (same as basic)
**Quality:** Aggressive — model becomes actively willing, not just neutral
**Best for:** When you want the model to be maximally helpful
**Warning:** Can make the model too eager; may reduce safety-adjacent reasoning
- **Speed:** Fast
- **Risk:** High (model behavior changes dramatically)
- **Use case:** Research, studying refusal mechanisms
- **How it works:** Instead of projecting out the refusal direction, reflects it. The model actively complies rather than passively not-refusing. Useful for understanding the geometry of alignment.
### failspy / gabliteration / heretic / rdo (PYTHON API ONLY)
**Technique:** Faithful reproductions of prior community/academic work
**Speed:** Varies
**Quality:** Known baselines
**Best for:** Reproducing published results, comparing methods
**⚠️ NOT available via CLI** — these methods are only accessible via the Python API.
Do not use `--method failspy` etc. in CLI commands; argparse will reject them.
### nuclear
- **Speed:** Slow
- **Risk:** Medium-High
- **Use case:** Stubborn MoE models (DeepSeek-MoE, Mixtral, etc.)
- **How it works:** Combines expert-granular abliteration (EGA), steering vector injection, attention head pruning, and multi-pass refinement. Decomposes refusal signals into per-expert components for MoE architectures.
---
## Method Selection Flowchart
```
Is this a quick test?
├─ YES basic
└─ NO → Is the model MoE (DeepSeek, Mixtral)?
├─ YES → nuclear
└─ NO → Is it a reasoning model (R1 distill)?
├─ YES → surgical
└─ NO → Do you care about speed?
├─ YES → advanced
└─ NO → informed
YES: basic
→ NO: continue
Is it an MoE model (Mixtral, DeepSeek-MoE)?
→ YES: nuclear
→ NO: continue
Is it a reasoning model (R1, QwQ, CoT-focused)?
→ YES: surgical
→ NO: continue
Do you need the absolute best quality and have time?
→ YES: optimized
→ NO: advanced (recommended default)
Did advanced leave > 10% refusals?
→ YES: aggressive
→ Still refusing: nuclear
```
---
## Key Parameters
| Parameter | Range | Default | Effect |
|:--------------------|:---------|:--------|:--------------------------------------------|
| n_directions | 1-32 | auto | More = more thorough but riskier |
| regularization | 0.0-1.0 | 0.0 | Higher preserves more original behavior |
| refinement_passes | 1-5 | 1 | More catches self-repair (Ouroboros effect) |
| quantization | 4/8 bit | none | Saves VRAM, slight quality tradeoff |
| Parameter | Range | Default | Effect |
|:----------|:------|:--------|:-------|
| `--n-directions` | 1-32 | method-dependent | More directions = more complete removal, but higher damage risk |
| `--regularization` | 0.0-1.0 | 0.1 | Higher = more conservative (less removal, less damage) |
| `--refinement-passes` | 1-5 | 2 | More passes catch residual refusal, but diminishing returns |
| `--quantization` | 4bit, 8bit | none | Reduces VRAM usage; quality impact minimal for extraction |
| `--verify-sample-size` | 10-200 | 20 | More samples = more accurate refusal rate estimate |
---
## Troubleshooting
| Problem | Solution |
|:---------------------------|:--------------------------------------------------|
| Refusal rate still > 10% | Try aggressive/nuclear, add refinement passes |
| Perplexity up > 20% | Reduce n_directions, increase regularization |
| Model generates nonsense | Regularization too low, try 0.2-0.3 |
| OOM on GPU | Use 4-bit quantization, or try smaller model |
| MoE model barely changes | Use nuclear method (expert-granular) |
| CoT reasoning broken | Use surgical method (CoT-aware) |
| Problem | Likely Cause | Fix |
|:--------|:-------------|:----|
| Refusal rate > 20% | Too few directions | Increase `--n-directions`, try `aggressive` |
| Refusal rate 5-20% | Residual refusal | Add `--refinement-passes 3`, try `--direction-method svd` |
| Perplexity spike > 20% | Over-aggressive removal | Reduce `--n-directions`, increase `--regularization` |
| Repetitive output | Weight matrix damage | Use `basic` with fewer directions, check norm preservation |
| MoE model still refuses | Non-expert-aware method | Switch to `nuclear` |
| Reasoning degraded | CoT directions damaged | Use `surgical` method |
| OOM during extraction | Insufficient VRAM | Add `--quantization 4bit` and/or `--large-model` |