- Introduced new skills tools: `skills_categories`, `skills_list`, and `skill_view` in `model_tools.py`, allowing for better organization and access to skill-related functionalities. - Updated `toolsets.py` to include a new `skills` toolset, providing a dedicated space for skill tools. - Enhanced `batch_runner.py` to recognize and validate skills tools during batch processing. - Added comprehensive tool definitions for skills tools, ensuring compatibility with OpenAI's expected format. - Created new shell script `test_skills_kimi.sh` for testing skills tool functionality with Kimi K2.5. - Added example skill files demonstrating the structure and usage of skills within the Hermes-Agent framework, including `SKILL.md` for example and audiocraft skills. - Improved documentation for skills tools and their integration into the existing tool framework, ensuring clarity for future development and usage.
7.1 KiB
slime Troubleshooting Guide
Common Issues and Solutions
SGLang Issues
Issue: SGLang Engine Crash
Symptoms: Inference engine dies mid-training, connection errors
Solutions:
- Enable fault tolerance:
--use-fault-tolerance
- Increase memory allocation:
--sglang-mem-fraction-static 0.85 # Increase from 0.8
- Reduce batch size:
--rollout-batch-size 16 # Reduce from 32
- Disable CUDA graphs (for debugging):
--sglang-disable-cuda-graph
Issue: SGLang Router Load Imbalance
Symptoms: Some SGLang engines overloaded while others idle
Solutions:
- Adjust routing strategy:
--sglang-router-strategy round_robin
- Increase number of engines:
--rollout-num-gpus-per-engine 1 # More engines, less GPUs each
Weight Synchronization Issues
Issue: Weight Sync Timeout
Symptoms: Training hangs after rollout, timeout errors
Solutions:
- Increase sync interval (async mode):
--update-weights-interval 5 # Increase from 2
- Use colocated mode (eliminates network transfer):
--colocate
- Check network bandwidth:
# Verify InfiniBand is enabled
ibstat
Issue: Weight Sync Failures in Multi-Node
Symptoms: Nodes fail to receive updated weights
Solutions:
- Set NCCL environment:
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0
- Increase timeout:
export NCCL_TIMEOUT=1800
Memory Issues
Issue: OOM During Training
Symptoms: CUDA OOM in backward pass
Solutions:
- Enable gradient checkpointing:
--recompute-activations
- Reduce micro-batch size:
--micro-batch-size 1
- Enable sequence parallelism:
--sequence-parallel
- Reduce global batch size:
--global-batch-size 128 # Reduce from 256
Issue: OOM in Colocated Mode
Symptoms: OOM when both training and inference run on same GPUs
Solutions:
- Reduce SGLang memory:
--sglang-mem-fraction-static 0.4 # Reduce from 0.8
- Enable offloading:
--offload-optimizer-states
- Use smaller sequence length:
--seq-length 2048 # Reduce from 4096
Data Loading Issues
Issue: Slow Data Loading
Symptoms: GPU idle during data fetch, low GPU utilization
Solutions:
- Increase data workers:
--num-data-workers 4
- Use streaming dataset:
--streaming-data
- Pre-tokenize data:
# Pre-process data offline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("model_path")
# Save tokenized data
Issue: Data Format Errors
Symptoms: KeyError, missing fields, parsing failures
Solutions:
- Verify data format:
import json
with open("data.jsonl") as f:
for line in f:
data = json.loads(line)
assert "prompt" in data, "Missing prompt field"
assert "label" in data, "Missing label field"
- Check key names:
--input-key prompt # Must match your data
--label-key label # Must match your data
Training Stability Issues
Issue: Loss Explosion / NaN
Symptoms: Loss becomes NaN or explodes
Solutions:
- Reduce learning rate:
--lr 1e-6 # Reduce from 5e-6
- Enable gradient clipping:
--clip-grad 1.0
- Check for data issues:
# Verify no empty prompts or responses
for sample in dataset:
assert len(sample["prompt"]) > 0
- Use BF16 instead of FP16:
--bf16 # More numerically stable
Issue: Reward Collapse
Symptoms: Reward drops to zero, model outputs garbage
Solutions:
- Increase KL penalty:
--kl-loss-coef 0.01 # Increase from 0.001
- Reduce number of samples:
--n-samples-per-prompt 4 # Reduce from 8
- Verify reward function:
# Test reward function independently
from custom_rm import reward_func
sample = Sample(prompt="test", response="test response")
reward = reward_func(args, sample)
print(f"Reward: {reward}") # Should be reasonable
Async Training Issues
Issue: Async Training Not Supported with Colocate
Symptoms: Error when using --colocate with train_async.py
Solution: Colocated mode is NOT supported for async training. Use separate GPUs:
# Remove --colocate flag
python train_async.py \
--actor-num-gpus-per-node 4 \
--rollout-num-gpus 4 \
# No --colocate
Issue: Stale Weights in Async Mode
Symptoms: Policy divergence, inconsistent behavior
Solutions:
- Reduce async buffer size:
--async-buffer-size 2 # Reduce from 4
- Increase weight update frequency:
--update-weights-interval 1 # Sync every rollout
Multi-Turn Training Issues
Issue: Tool Responses Included in Loss
Symptoms: Model learns to output tool responses verbatim
Solution: Properly set loss mask in custom generate function:
def build_loss_mask(sample):
"""Create loss mask that excludes tool responses."""
mask = []
for i, token in enumerate(sample.tokens):
if is_tool_response(token, sample.metadata):
mask.append(0) # Don't compute loss
else:
mask.append(1) # Compute loss
return mask
Issue: Multi-Turn Context Too Long
Symptoms: OOM or truncation in multi-turn conversations
Solutions:
- Limit conversation history:
# In custom generate function
conversation = sample.prompt[-10:] # Keep last 10 turns
- Increase context length:
--sglang-context-length 16384
Checkpoint Issues
Issue: Checkpoint Loading Fails
Symptoms: Cannot load saved checkpoint
Solutions:
- Verify checkpoint path:
ls -la /path/to/checkpoint/
- Check parallelism matches:
# Checkpoint was saved with TP=2, must load with TP=2
--tensor-model-parallel-size 2
- Convert HuggingFace to Megatron (if needed):
python tools/convert_hf_to_megatron.py \
--hf_model_path /path/to/hf/model \
--save_path /path/to/megatron/checkpoint
Debugging Tips
Enable Verbose Logging
--log-level DEBUG
export SLIME_DEBUG=1
Check GPU Utilization
watch -n 1 nvidia-smi
Monitor Training
tensorboard --logdir outputs/
Test Custom Functions Independently
# Test reward function
import asyncio
from custom_rm import reward_func
async def test():
sample = Sample(prompt="test", response="test", label="expected")
reward = await reward_func(args, sample)
print(f"Reward: {reward}")
asyncio.run(test())
Constraint Reference
Key constraint to remember:
rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
Example: 32 × 8 = 256 × 1
Resources
- GitHub Issues: https://github.com/THUDM/slime/issues
- Documentation: https://thudm.github.io/slime/
- Examples:
examples/directory