The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
GRPO/RL Training Skill
Expert-level guidance for Group Relative Policy Optimization with TRL
📁 Skill Structure
grpo-rl-training/
├── SKILL.md # Main skill documentation (READ THIS FIRST)
├── README.md # This file
├── templates/
│ └── basic_grpo_training.py # Production-ready training template
└── examples/
└── reward_functions_library.py # 20+ reward function examples
🚀 Quick Start
- Read SKILL.md - Comprehensive guide with all concepts and patterns
- Copy
templates/basic_grpo_training.py- Start with working code - Browse
examples/reward_functions_library.py- Pick reward functions for your task - Modify for your use case - Adapt dataset, rewards, and config
💡 What's Inside
SKILL.md (Main Documentation)
- Core GRPO concepts and algorithm fundamentals
- Complete implementation workflow (dataset → rewards → training → deployment)
- 10+ reward function examples with code
- Hyperparameter tuning guide
- Training insights (loss behavior, metrics, debugging)
- Troubleshooting guide
- Production best practices
Templates
- basic_grpo_training.py: Minimal, production-ready training script
- Uses Qwen 2.5 1.5B Instruct
- 3 reward functions (format + correctness)
- LoRA for efficient training
- Fully documented and ready to run
Examples
- reward_functions_library.py: 20+ battle-tested reward functions
- Correctness rewards (exact match, fuzzy match, numeric, code execution)
- Format rewards (XML, JSON, strict/soft)
- Length rewards (ideal length, min/max)
- Style rewards (reasoning quality, citations, repetition penalty)
- Combined rewards (multi-objective optimization)
- Preset collections for common tasks
📖 Usage for Agents
When this skill is loaded in your agent's context:
- Always read SKILL.md first before implementing
- Start simple - Use length-based reward to validate setup
- Build incrementally - Add one reward function at a time
- Reference examples - Copy patterns from reward_functions_library.py
- Monitor training - Watch reward metrics (not loss!)
🎯 Common Use Cases
| Task Type | Recommended Rewards | Template |
|---|---|---|
| Math reasoning | MATH_REASONING_REWARDS preset |
basic_grpo_training.py |
| Code generation | CODE_GENERATION_REWARDS preset |
Modify dataset in template |
| Summarization | SUMMARIZATION_REWARDS preset |
Adjust prompts + rewards |
| Q&A | QA_REWARDS preset |
Use fuzzy match + citations |
⚠️ Critical Reminders
- Loss goes UP during training - This is normal (it's KL divergence)
- Use 3-5 reward functions - Single rewards often fail
- Test rewards before training - Debug each function independently
- Monitor reward_std - Should stay > 0.1 (avoid mode collapse)
- Start with num_generations=4-8 - Scale up if GPU allows
🔗 External Resources
📝 Version
v1.0.0 - Initial release (January 2025)
👨💻 Maintained By
Orchestra Research For questions or improvements, see https://orchestra.com
License: MIT Last Updated: January 2025