skills/mlops/slime/references/troubleshooting.md

# slime Troubleshooting Guide

## Common Issues and Solutions

### SGLang Issues

#### Issue: SGLang Engine Crash

**Symptoms**: Inference engine dies mid-training, connection errors

**Solutions**:

1. **Enable fault tolerance**:
```bash
--use-fault-tolerance
```

2. **Increase memory allocation**:
```bash
--sglang-mem-fraction-static 0.85  # Increase from 0.8
```

3. **Reduce batch size**:
```bash
--rollout-batch-size 16  # Reduce from 32
```

4. **Disable CUDA graphs** (for debugging):
```bash
--sglang-disable-cuda-graph
```

#### Issue: SGLang Router Load Imbalance

**Symptoms**: Some SGLang engines overloaded while others idle

**Solutions**:

1. **Adjust routing strategy**:
```bash
--sglang-router-strategy round_robin
```

2. **Increase number of engines**:
```bash
--rollout-num-gpus-per-engine 1  # More engines, less GPUs each
```

### Weight Synchronization Issues

#### Issue: Weight Sync Timeout

**Symptoms**: Training hangs after rollout, timeout errors

**Solutions**:

1. **Increase sync interval** (async mode):
```bash
--update-weights-interval 5  # Increase from 2
```

2. **Use colocated mode** (eliminates network transfer):
```bash
--colocate
```

3. **Check network bandwidth**:
```bash
# Verify InfiniBand is enabled
ibstat
```

#### Issue: Weight Sync Failures in Multi-Node

**Symptoms**: Nodes fail to receive updated weights

**Solutions**:

1. **Set NCCL environment**:
```bash
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0
```

2. **Increase timeout**:
```bash
export NCCL_TIMEOUT=1800
```

### Memory Issues

#### Issue: OOM During Training

**Symptoms**: CUDA OOM in backward pass

**Solutions**:

1. **Enable gradient checkpointing**:
```bash
--recompute-activations
```

2. **Reduce micro-batch size**:
```bash
--micro-batch-size 1
```

3. **Enable sequence parallelism**:
```bash
--sequence-parallel
```

4. **Reduce global batch size**:
```bash
--global-batch-size 128  # Reduce from 256
```

#### Issue: OOM in Colocated Mode

**Symptoms**: OOM when both training and inference run on same GPUs

**Solutions**:

1. **Reduce SGLang memory**:
```bash
--sglang-mem-fraction-static 0.4  # Reduce from 0.8
```

2. **Enable offloading**:
```bash
--offload-optimizer-states
```

3. **Use smaller sequence length**:
```bash
--seq-length 2048  # Reduce from 4096
```

### Data Loading Issues

#### Issue: Slow Data Loading

**Symptoms**: GPU idle during data fetch, low GPU utilization

**Solutions**:

1. **Increase data workers**:
```bash
--num-data-workers 4
```

2. **Use streaming dataset**:
```bash
--streaming-data
```

3. **Pre-tokenize data**:
```python
# Pre-process data offline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("model_path")
# Save tokenized data
```

#### Issue: Data Format Errors

**Symptoms**: KeyError, missing fields, parsing failures

**Solutions**:

1. **Verify data format**:
```python
import json
with open("data.jsonl") as f:
    for line in f:
        data = json.loads(line)
        assert "prompt" in data, "Missing prompt field"
        assert "label" in data, "Missing label field"
```

2. **Check key names**:
```bash
--input-key prompt  # Must match your data
--label-key label   # Must match your data
```

### Training Stability Issues

#### Issue: Loss Explosion / NaN

**Symptoms**: Loss becomes NaN or explodes

**Solutions**:

1. **Reduce learning rate**:
```bash
--lr 1e-6  # Reduce from 5e-6
```

2. **Enable gradient clipping**:
```bash
--clip-grad 1.0
```

3. **Check for data issues**:
```python
# Verify no empty prompts or responses
for sample in dataset:
    assert len(sample["prompt"]) > 0
```

4. **Use BF16 instead of FP16**:
```bash
--bf16  # More numerically stable
```

#### Issue: Reward Collapse

**Symptoms**: Reward drops to zero, model outputs garbage

**Solutions**:

1. **Increase KL penalty**:
```bash
--kl-loss-coef 0.01  # Increase from 0.001
```

2. **Reduce number of samples**:
```bash
--n-samples-per-prompt 4  # Reduce from 8
```

3. **Verify reward function**:
```python
# Test reward function independently
from custom_rm import reward_func
sample = Sample(prompt="test", response="test response")
reward = reward_func(args, sample)
print(f"Reward: {reward}")  # Should be reasonable
```

### Async Training Issues

#### Issue: Async Training Not Supported with Colocate

**Symptoms**: Error when using `--colocate` with `train_async.py`

**Solution**: Colocated mode is NOT supported for async training. Use separate GPUs:
```bash
# Remove --colocate flag
python train_async.py \
    --actor-num-gpus-per-node 4 \
    --rollout-num-gpus 4 \
    # No --colocate
```

#### Issue: Stale Weights in Async Mode

**Symptoms**: Policy divergence, inconsistent behavior

**Solutions**:

1. **Reduce async buffer size**:
```bash
--async-buffer-size 2  # Reduce from 4
```

2. **Increase weight update frequency**:
```bash
--update-weights-interval 1  # Sync every rollout
```

### Multi-Turn Training Issues

#### Issue: Tool Responses Included in Loss

**Symptoms**: Model learns to output tool responses verbatim

**Solution**: Properly set loss mask in custom generate function:
```python
def build_loss_mask(sample):
    """Create loss mask that excludes tool responses."""
    mask = []
    for i, token in enumerate(sample.tokens):
        if is_tool_response(token, sample.metadata):
            mask.append(0)  # Don't compute loss
        else:
            mask.append(1)  # Compute loss
    return mask
```

#### Issue: Multi-Turn Context Too Long

**Symptoms**: OOM or truncation in multi-turn conversations

**Solutions**:

1. **Limit conversation history**:
```python
# In custom generate function
conversation = sample.prompt[-10:]  # Keep last 10 turns
```

2. **Increase context length**:
```bash
--sglang-context-length 16384
```

### Checkpoint Issues

#### Issue: Checkpoint Loading Fails

**Symptoms**: Cannot load saved checkpoint

**Solutions**:

1. **Verify checkpoint path**:
```bash
ls -la /path/to/checkpoint/
```

2. **Check parallelism matches**:
```bash
# Checkpoint was saved with TP=2, must load with TP=2
--tensor-model-parallel-size 2
```

3. **Convert HuggingFace to Megatron** (if needed):
```bash
python tools/convert_hf_to_megatron.py \
    --hf_model_path /path/to/hf/model \
    --save_path /path/to/megatron/checkpoint
```

### Debugging Tips

#### Enable Verbose Logging

```bash
--log-level DEBUG
export SLIME_DEBUG=1
```

#### Check GPU Utilization

```bash
watch -n 1 nvidia-smi
```

#### Monitor Training

```bash
tensorboard --logdir outputs/
```

#### Test Custom Functions Independently

```python
# Test reward function
import asyncio
from custom_rm import reward_func

async def test():
    sample = Sample(prompt="test", response="test", label="expected")
    reward = await reward_func(args, sample)
    print(f"Reward: {reward}")

asyncio.run(test())
```

## Constraint Reference

Key constraint to remember:

```
rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
```

Example: `32 × 8 = 256 × 1`

## Resources

- GitHub Issues: https://github.com/THUDM/slime/issues
- Documentation: https://thudm.github.io/slime/
- Examples: `examples/` directory