refactor: reorganize skills into sub-categories

The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
2026-03-09 03:35:53 -07:00
parent d6c710706f
commit 732c66b0f3
217 changed files with 39 additions and 4 deletions
--- a/skills/mlops/evaluation/saelens/references/tutorials.md
+++ b/skills/mlops/evaluation/saelens/references/tutorials.md
@@ -0,0 +1,318 @@
+# SAELens Tutorials
+
+## Tutorial 1: Loading and Analyzing Pre-trained SAEs
+
+### Goal
+Load a pre-trained SAE and analyze which features activate on specific inputs.
+
+### Step-by-Step
+
+```python
+from transformer_lens import HookedTransformer
+from sae_lens import SAE
+import torch
+
+# 1. Load model and SAE
+model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
+sae, cfg_dict, sparsity = SAE.from_pretrained(
+    release="gpt2-small-res-jb",
+    sae_id="blocks.8.hook_resid_pre",
+    device="cuda"
+)
+
+print(f"SAE input dim: {sae.cfg.d_in}")
+print(f"SAE hidden dim: {sae.cfg.d_sae}")
+print(f"Expansion factor: {sae.cfg.d_sae / sae.cfg.d_in:.1f}x")
+
+# 2. Get model activations
+prompt = "The capital of France is Paris"
+tokens = model.to_tokens(prompt)
+_, cache = model.run_with_cache(tokens)
+activations = cache["resid_pre", 8]  # [1, seq_len, 768]
+
+# 3. Encode to SAE features
+features = sae.encode(activations)  # [1, seq_len, d_sae]
+
+# 4. Analyze sparsity
+active_per_token = (features > 0).sum(dim=-1)
+print(f"Average active features per token: {active_per_token.float().mean():.1f}")
+
+# 5. Find top features for each token
+str_tokens = model.to_str_tokens(prompt)
+for pos in range(len(str_tokens)):
+    top_features = features[0, pos].topk(5)
+    print(f"\nToken '{str_tokens[pos]}':")
+    for feat_idx, feat_val in zip(top_features.indices, top_features.values):
+        print(f"  Feature {feat_idx.item()}: {feat_val.item():.3f}")
+
+# 6. Check reconstruction quality
+reconstructed = sae.decode(features)
+mse = ((activations - reconstructed) ** 2).mean()
+print(f"\nReconstruction MSE: {mse.item():.6f}")
+```
+
+---
+
+## Tutorial 2: Training a Custom SAE
+
+### Goal
+Train a Sparse Autoencoder on GPT-2 activations.
+
+### Step-by-Step
+
+```python
+from sae_lens import LanguageModelSAERunnerConfig, SAETrainingRunner
+
+# 1. Configure training
+cfg = LanguageModelSAERunnerConfig(
+    # Model
+    model_name="gpt2-small",
+    hook_name="blocks.6.hook_resid_pre",
+    hook_layer=6,
+    d_in=768,
+
+    # SAE architecture
+    architecture="standard",
+    d_sae=768 * 8,  # 8x expansion
+    activation_fn="relu",
+
+    # Training
+    lr=4e-4,
+    l1_coefficient=8e-5,
+    l1_warm_up_steps=1000,
+    train_batch_size_tokens=4096,
+    training_tokens=10_000_000,  # Small run for demo
+
+    # Data
+    dataset_path="monology/pile-uncopyrighted",
+    streaming=True,
+    context_size=128,
+
+    # Dead feature prevention
+    use_ghost_grads=True,
+    dead_feature_window=5000,
+
+    # Logging
+    log_to_wandb=True,
+    wandb_project="sae-training-demo",
+
+    # Hardware
+    device="cuda",
+    dtype="float32",
+)
+
+# 2. Train
+runner = SAETrainingRunner(cfg)
+sae = runner.run()
+
+# 3. Save
+sae.save_model("./my_trained_sae")
+```
+
+### Hyperparameter Tuning Guide
+
+| If you see... | Try... |
+|---------------|--------|
+| High L0 (>200) | Increase `l1_coefficient` |
+| Low CE recovery (<80%) | Decrease `l1_coefficient`, increase `d_sae` |
+| Many dead features (>5%) | Enable `use_ghost_grads`, increase `l1_warm_up_steps` |
+| Training instability | Lower `lr`, increase `lr_warm_up_steps` |
+
+---
+
+## Tutorial 3: Feature Attribution and Steering
+
+### Goal
+Identify which SAE features contribute to specific predictions and use them for steering.
+
+### Step-by-Step
+
+```python
+from transformer_lens import HookedTransformer
+from sae_lens import SAE
+import torch
+
+model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
+sae, _, _ = SAE.from_pretrained(
+    release="gpt2-small-res-jb",
+    sae_id="blocks.8.hook_resid_pre",
+    device="cuda"
+)
+
+# 1. Feature attribution for a specific prediction
+prompt = "The capital of France is"
+tokens = model.to_tokens(prompt)
+_, cache = model.run_with_cache(tokens)
+activations = cache["resid_pre", 8]
+features = sae.encode(activations)
+
+# Target token
+target_token = model.to_single_token(" Paris")
+
+# Compute feature contributions to target logit
+# contribution = feature_activation * decoder_weight * unembedding
+W_dec = sae.W_dec  # [d_sae, d_model]
+W_U = model.W_U    # [d_model, d_vocab]
+
+# Feature direction projected to vocabulary
+feature_to_logit = W_dec @ W_U  # [d_sae, d_vocab]
+
+# Contribution of each feature to "Paris" at final position
+feature_acts = features[0, -1]  # [d_sae]
+contributions = feature_acts * feature_to_logit[:, target_token]
+
+# Top contributing features
+top_features = contributions.topk(10)
+print("Top features contributing to 'Paris':")
+for idx, val in zip(top_features.indices, top_features.values):
+    print(f"  Feature {idx.item()}: {val.item():.3f}")
+
+# 2. Feature steering
+def steer_with_feature(feature_idx, strength=5.0):
+    """Add a feature direction to the residual stream."""
+    feature_direction = sae.W_dec[feature_idx]  # [d_model]
+
+    def hook(activation, hook_obj):
+        activation[:, -1, :] += strength * feature_direction
+        return activation
+
+    output = model.generate(
+        tokens,
+        max_new_tokens=10,
+        fwd_hooks=[("blocks.8.hook_resid_pre", hook)]
+    )
+    return model.to_string(output[0])
+
+# Try steering with top feature
+top_feature_idx = top_features.indices[0].item()
+print(f"\nSteering with feature {top_feature_idx}:")
+print(steer_with_feature(top_feature_idx, strength=10.0))
+```
+
+---
+
+## Tutorial 4: Feature Ablation
+
+### Goal
+Test the causal importance of features by ablating them.
+
+### Step-by-Step
+
+```python
+from transformer_lens import HookedTransformer
+from sae_lens import SAE
+import torch
+
+model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
+sae, _, _ = SAE.from_pretrained(
+    release="gpt2-small-res-jb",
+    sae_id="blocks.8.hook_resid_pre",
+    device="cuda"
+)
+
+prompt = "The capital of France is"
+tokens = model.to_tokens(prompt)
+
+# Baseline prediction
+baseline_logits = model(tokens)
+target_token = model.to_single_token(" Paris")
+baseline_prob = torch.softmax(baseline_logits[0, -1], dim=-1)[target_token].item()
+print(f"Baseline P(Paris): {baseline_prob:.4f}")
+
+# Get features to ablate
+_, cache = model.run_with_cache(tokens)
+activations = cache["resid_pre", 8]
+features = sae.encode(activations)
+top_features = features[0, -1].topk(10).indices
+
+# Ablate top features one by one
+for feat_idx in top_features:
+    def ablation_hook(activation, hook, feat_idx=feat_idx):
+        # Encode → zero feature → decode
+        feats = sae.encode(activation)
+        feats[:, :, feat_idx] = 0
+        return sae.decode(feats)
+
+    ablated_logits = model.run_with_hooks(
+        tokens,
+        fwd_hooks=[("blocks.8.hook_resid_pre", ablation_hook)]
+    )
+    ablated_prob = torch.softmax(ablated_logits[0, -1], dim=-1)[target_token].item()
+    change = (ablated_prob - baseline_prob) / baseline_prob * 100
+    print(f"Ablate feature {feat_idx.item()}: P(Paris)={ablated_prob:.4f} ({change:+.1f}%)")
+```
+
+---
+
+## Tutorial 5: Comparing Features Across Prompts
+
+### Goal
+Find which features activate consistently for a concept.
+
+### Step-by-Step
+
+```python
+from transformer_lens import HookedTransformer
+from sae_lens import SAE
+import torch
+
+model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
+sae, _, _ = SAE.from_pretrained(
+    release="gpt2-small-res-jb",
+    sae_id="blocks.8.hook_resid_pre",
+    device="cuda"
+)
+
+# Test prompts about the same concept
+prompts = [
+    "The Eiffel Tower is located in",
+    "Paris is the capital of",
+    "France's largest city is",
+    "The Louvre museum is in",
+]
+
+# Collect feature activations
+all_features = []
+for prompt in prompts:
+    tokens = model.to_tokens(prompt)
+    _, cache = model.run_with_cache(tokens)
+    activations = cache["resid_pre", 8]
+    features = sae.encode(activations)
+    # Take max activation across positions
+    max_features = features[0].max(dim=0).values
+    all_features.append(max_features)
+
+all_features = torch.stack(all_features)  # [n_prompts, d_sae]
+
+# Find features that activate consistently
+mean_activation = all_features.mean(dim=0)
+min_activation = all_features.min(dim=0).values
+
+# Features active in ALL prompts
+consistent_features = (min_activation > 0.5).nonzero().squeeze(-1)
+print(f"Features active in all prompts: {len(consistent_features)}")
+
+# Top consistent features
+top_consistent = mean_activation[consistent_features].topk(min(10, len(consistent_features)))
+print("\nTop consistent features (possibly 'France/Paris' related):")
+for idx, val in zip(top_consistent.indices, top_consistent.values):
+    feat_idx = consistent_features[idx].item()
+    print(f"  Feature {feat_idx}: mean activation {val.item():.3f}")
+```
+
+---
+
+## External Resources
+
+### Official Tutorials
+- [Basic Loading & Analysis](https://github.com/jbloomAus/SAELens/blob/main/tutorials/basic_loading_and_analysing.ipynb)
+- [Training SAEs](https://github.com/jbloomAus/SAELens/blob/main/tutorials/training_a_sparse_autoencoder.ipynb)
+- [Logits Lens with Features](https://github.com/jbloomAus/SAELens/blob/main/tutorials/logits_lens_with_features.ipynb)
+
+### ARENA Curriculum
+Comprehensive SAE course: https://www.lesswrong.com/posts/LnHowHgmrMbWtpkxx/intro-to-superposition-and-sparse-autoencoders-colab
+
+### Key Papers
+- [Towards Monosemanticity](https://transformer-circuits.pub/2023/monosemantic-features) - Anthropic (2023)
+- [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/) - Anthropic (2024)
+- [Sparse Autoencoders Find Interpretable Features](https://arxiv.org/abs/2309.08600) - ICLR 2024