Add stuck initiatives audit report

2026-04-03 22:42:06 +00:00
parent dc3d975c2f
commit 56aa692d1c
1267 changed files with 1263232 additions and 0 deletions
--- a/protected/skills-backup/mlops/training/slime/SKILL.md
+++ b/protected/skills-backup/mlops/training/slime/SKILL.md
@@ -0,0 +1,467 @@
+---
+name: slime-rl-training
+description: Provides guidance for LLM post-training with RL using slime, a Megatron+SGLang framework. Use when training GLM models, implementing custom data generation workflows, or needing tight Megatron-LM integration for RL scaling.
+version: 1.0.0
+author: Orchestra Research
+license: MIT
+dependencies: [sglang-router>=0.2.3, ray, torch>=2.0.0, transformers>=4.40.0]
+metadata:
+  hermes:
+    tags: [Reinforcement Learning, Megatron-LM, SGLang, GRPO, Post-Training, GLM]
+
+---
+
+# slime: LLM Post-Training Framework for RL Scaling
+
+slime is an LLM post-training framework from Tsinghua's THUDM team, powering GLM-4.5, GLM-4.6, and GLM-4.7. It connects Megatron-LM for training with SGLang for high-throughput rollout generation.
+
+## When to Use slime
+
+**Choose slime when you need:**
+- Megatron-LM native training with SGLang inference
+- Custom data generation workflows with flexible data buffers
+- Training GLM, Qwen3, DeepSeek V3, or Llama 3 models
+- Research-grade framework with production backing (Z.ai)
+
+**Consider alternatives when:**
+- You need enterprise-grade stability features → use **miles**
+- You want flexible backend swapping → use **verl**
+- You need PyTorch-native abstractions → use **torchforge**
+
+## Key Features
+
+- **Training**: Megatron-LM with full parallelism support (TP, PP, DP, SP)
+- **Rollout**: SGLang-based high-throughput generation with router
+- **Data Buffer**: Flexible prompt management and sample storage
+- **Models**: GLM-4.x, Qwen3, DeepSeek V3/R1, Llama 3
+
+## Architecture Overview
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                    Data Buffer                          │
+│ - Prompt initialization and management                  │
+│ - Custom data generation and filtering                  │
+│ - Rollout sample storage                                │
+└─────────────┬───────────────────────────┬───────────────┘
+              │                           │
+┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
+│ Training (Megatron-LM)  │ │ Rollout (SGLang + Router)   │
+│ - Actor model training  │ │ - Response generation       │
+│ - Critic (optional)     │ │ - Reward/verifier output    │
+│ - Weight sync to rollout│ │ - Multi-turn support        │
+└─────────────────────────┘ └─────────────────────────────┘
+```
+
+## Installation
+
+```bash
+# Recommended: Docker
+docker pull slimerl/slime:latest
+docker run --rm --gpus all --ipc=host --shm-size=16g \
+  -it slimerl/slime:latest /bin/bash
+
+# Inside container
+cd /root/slime && pip install -e . --no-deps
+```
+
+### From Source
+
+```bash
+git clone https://github.com/THUDM/slime.git
+cd slime
+pip install -r requirements.txt
+pip install -e .
+```
+
+## Quick Start: GRPO Training
+
+```bash
+# Source model configuration
+source scripts/models/qwen3-4B.sh
+
+# Launch training
+python train.py \
+    --actor-num-nodes 1 \
+    --actor-num-gpus-per-node 4 \
+    --rollout-num-gpus 4 \
+    --advantage-estimator grpo \
+    --use-kl-loss --kl-loss-coef 0.001 \
+    --rollout-batch-size 32 \
+    --n-samples-per-prompt 8 \
+    --global-batch-size 256 \
+    --num-rollout 3000 \
+    --prompt-data /path/to/data.jsonl \
+    ${MODEL_ARGS[@]} ${CKPT_ARGS[@]}
+```
+
+---
+
+## Workflow 1: Standard GRPO Training
+
+Use this workflow for training reasoning models with group-relative advantages.
+
+### Prerequisites Checklist
+- [ ] Docker environment or Megatron-LM + SGLang installed
+- [ ] Model checkpoint (HuggingFace or Megatron format)
+- [ ] Training data in JSONL format
+
+### Step 1: Prepare Data
+
+```python
+# data.jsonl format
+{"prompt": "What is 2 + 2?", "label": "4"}
+{"prompt": "Solve: 3x = 12", "label": "x = 4"}
+```
+
+Or with chat format:
+```python
+{
+    "prompt": [
+        {"role": "system", "content": "You are a math tutor."},
+        {"role": "user", "content": "What is 15 + 27?"}
+    ],
+    "label": "42"
+}
+```
+
+### Step 2: Configure Model
+
+Choose a pre-configured model script:
+
+```bash
+# List available models
+ls scripts/models/
+# glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...
+
+# Source your model
+source scripts/models/qwen3-4B.sh
+```
+
+### Step 3: Launch Training
+
+```bash
+python train.py \
+    --actor-num-nodes 1 \
+    --actor-num-gpus-per-node 8 \
+    --rollout-num-gpus 8 \
+    --advantage-estimator grpo \
+    --use-kl-loss \
+    --kl-loss-coef 0.001 \
+    --prompt-data /path/to/train.jsonl \
+    --input-key prompt \
+    --label-key label \
+    --apply-chat-template \
+    --rollout-batch-size 32 \
+    --n-samples-per-prompt 8 \
+    --global-batch-size 256 \
+    --num-rollout 3000 \
+    --save-interval 100 \
+    --eval-interval 50 \
+    ${MODEL_ARGS[@]}
+```
+
+### Step 4: Monitor Training
+- [ ] Check TensorBoard: `tensorboard --logdir outputs/`
+- [ ] Verify reward curves are increasing
+- [ ] Monitor GPU utilization across nodes
+
+---
+
+## Workflow 2: Asynchronous Training
+
+Use async mode for higher throughput by overlapping rollout and training.
+
+### When to Use Async
+- Large models with long generation times
+- High GPU idle time in synchronous mode
+- Sufficient memory for buffering
+
+### Launch Async Training
+
+```bash
+python train_async.py \
+    --actor-num-nodes 1 \
+    --actor-num-gpus-per-node 8 \
+    --rollout-num-gpus 8 \
+    --advantage-estimator grpo \
+    --async-buffer-size 4 \
+    --prompt-data /path/to/train.jsonl \
+    ${MODEL_ARGS[@]}
+```
+
+### Async-Specific Parameters
+
+```bash
+--async-buffer-size 4        # Number of rollouts to buffer
+--update-weights-interval 2  # Sync weights every N rollouts
+```
+
+---
+
+## Workflow 3: Multi-Turn Agentic Training
+
+Use this workflow for training agents with tool use or multi-step reasoning.
+
+### Prerequisites
+- [ ] Custom generate function for multi-turn logic
+- [ ] Tool/environment interface
+
+### Step 1: Define Custom Generate Function
+
+```python
+# custom_generate.py
+async def custom_generate(args, samples, evaluation=False):
+    """Multi-turn generation with tool calling."""
+    for sample in samples:
+        conversation = sample.prompt
+
+        for turn in range(args.max_turns):
+            # Generate response
+            response = await generate_single(conversation)
+
+            # Check for tool call
+            tool_call = extract_tool_call(response)
+            if tool_call:
+                tool_result = execute_tool(tool_call)
+                conversation.append({"role": "assistant", "content": response})
+                conversation.append({"role": "tool", "content": tool_result})
+            else:
+                break
+
+        sample.response = response
+        sample.reward = compute_reward(sample)
+
+    return samples
+```
+
+### Step 2: Launch with Custom Function
+
+```bash
+python train.py \
+    --custom-generate-function-path custom_generate.py \
+    --max-turns 5 \
+    --prompt-data /path/to/agent_data.jsonl \
+    ${MODEL_ARGS[@]}
+```
+
+See `examples/search-r1/` for a complete multi-turn search example.
+
+---
+
+## Configuration Reference
+
+### Three Argument Categories
+
+slime uses three types of arguments:
+
+**1. Megatron Arguments** (passed directly):
+```bash
+--tensor-model-parallel-size 2
+--pipeline-model-parallel-size 1
+--num-layers 32
+--hidden-size 4096
+```
+
+**2. SGLang Arguments** (prefixed with `--sglang-`):
+```bash
+--sglang-mem-fraction-static 0.8
+--sglang-context-length 8192
+--sglang-log-level INFO
+```
+
+**3. slime Arguments**:
+```bash
+# Resource allocation
+--actor-num-nodes 1
+--actor-num-gpus-per-node 8
+--rollout-num-gpus 8
+--colocate  # Share GPUs between training/inference
+
+# Data
+--prompt-data /path/to/data.jsonl
+--input-key prompt
+--label-key label
+
+# Training loop
+--num-rollout 3000
+--rollout-batch-size 32
+--n-samples-per-prompt 8
+--global-batch-size 256
+
+# Algorithm
+--advantage-estimator grpo  # or: gspo, ppo, reinforce_plus_plus
+--use-kl-loss
+--kl-loss-coef 0.001
+```
+
+### Key Constraints
+
+```
+rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
+```
+
+Example: 32 × 8 = 256 × 1
+
+---
+
+## Data Buffer System
+
+slime's data buffer enables flexible data management:
+
+### Basic Data Source
+
+```python
+class RolloutDataSource:
+    def get_samples(self, num_samples):
+        """Fetch prompts from dataset."""
+        return self.dataset.sample(num_samples)
+
+    def add_samples(self, samples):
+        """Called after generation (no-op by default)."""
+        pass
+```
+
+### Buffered Data Source (Off-Policy)
+
+```python
+class RolloutDataSourceWithBuffer(RolloutDataSource):
+    def __init__(self):
+        self.buffer = []
+
+    def add_samples(self, samples):
+        """Store generated samples for reuse."""
+        self.buffer.extend(samples)
+
+    def buffer_filter(self, args, buffer, num_samples):
+        """Custom selection logic (prioritized, stratified, etc.)."""
+        return select_best(buffer, num_samples)
+```
+
+---
+
+## Common Issues and Solutions
+
+### Issue: SGLang Engine Crash
+
+**Symptoms**: Inference engine dies mid-training
+
+**Solutions**:
+```bash
+# Enable fault tolerance
+--use-fault-tolerance
+
+# Increase memory allocation
+--sglang-mem-fraction-static 0.85
+
+# Reduce batch size
+--rollout-batch-size 16
+```
+
+### Issue: Weight Sync Timeout
+
+**Symptoms**: Training hangs after rollout
+
+**Solutions**:
+```bash
+# Increase sync interval
+--update-weights-interval 5
+
+# Use colocated mode (no network transfer)
+--colocate
+```
+
+### Issue: OOM During Training
+
+**Symptoms**: CUDA OOM in backward pass
+
+**Solutions**:
+```bash
+# Enable gradient checkpointing
+--recompute-activations
+
+# Reduce micro-batch size
+--micro-batch-size 1
+
+# Enable sequence parallelism
+--sequence-parallel
+```
+
+### Issue: Slow Data Loading
+
+**Symptoms**: GPU idle during data fetch
+
+**Solutions**:
+```bash
+# Increase data workers
+--num-data-workers 4
+
+# Use streaming dataset
+--streaming-data
+```
+
+---
+
+## Supported Models
+
+| Model Family | Configurations |
+|--------------|----------------|
+| GLM | GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B |
+| Qwen | Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5 |
+| DeepSeek | V3, V3.1, R1 |
+| Llama | Llama 3 (8B, 70B) |
+| Others | Kimi K2, Moonlight-16B |
+
+Each model has pre-configured scripts in `scripts/models/`.
+
+---
+
+## Advanced Topics
+
+### Co-location Mode
+
+Share GPUs between training and inference to reduce memory:
+
+```bash
+python train.py \
+    --colocate \
+    --actor-num-gpus-per-node 8 \
+    --sglang-mem-fraction-static 0.4 \
+    ${MODEL_ARGS[@]}
+```
+
+### Custom Reward Model
+
+```python
+# custom_rm.py
+class CustomRewardModel:
+    def __init__(self, model_path):
+        self.model = load_model(model_path)
+
+    def compute_reward(self, prompts, responses):
+        inputs = self.tokenize(prompts, responses)
+        scores = self.model(inputs)
+        return scores.tolist()
+```
+
+```bash
+--custom-rm-path custom_rm.py
+```
+
+### Evaluation Multi-Task
+
+```bash
+--eval-prompt-data aime /path/to/aime.jsonl \
+--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
+--n-samples-per-eval-prompt 16
+```
+
+---
+
+## Resources
+
+- **Documentation**: https://thudm.github.io/slime/
+- **GitHub**: https://github.com/THUDM/slime
+- **Blog**: https://lmsys.org/blog/2025-07-09-slime/
+- **Examples**: See `examples/` directory for 14+ worked examples
+