Files

teknium1 732c66b0f3 refactor: reorganize skills into sub-categories

The skills directory was getting disorganized — mlops alone had 40
skills in a flat list, and 12 categories were singletons with just
one skill each.

Code change:
- prompt_builder.py: Support sub-categories in skill scanner.
  skills/mlops/training/axolotl/SKILL.md now shows as category
  'mlops/training' instead of just 'mlops'. Backwards-compatible
  with existing flat structure.

Split mlops (40 skills) into 7 sub-categories:
- mlops/training (12): accelerate, axolotl, flash-attention,
  grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning,
  simpo, slime, torchtitan, trl-fine-tuning, unsloth
- mlops/inference (8): gguf, guidance, instructor, llama-cpp,
  obliteratus, outlines, tensorrt-llm, vllm
- mlops/models (6): audiocraft, clip, llava, segment-anything,
  stable-diffusion, whisper
- mlops/vector-databases (4): chroma, faiss, pinecone, qdrant
- mlops/evaluation (5): huggingface-tokenizers,
  lm-evaluation-harness, nemo-curator, saelens, weights-and-biases
- mlops/cloud (2): lambda-labs, modal
- mlops/research (1): dspy

Merged singleton categories:
- gifs → media (gif-search joins youtube-content)
- music-creation → media (heartmula, songsee)
- diagramming → creative (excalidraw joins ascii-art)
- ocr-and-documents → productivity
- domain → research (domain-intel)
- feeds → research (blogwatcher)
- market-data → research (polymarket)

Fixed misplaced skills:
- mlops/code-review → software-development (not ML-specific)
- mlops/ml-paper-writing → research (academic writing)

Added DESCRIPTION.md files for all new/updated categories.

2026-03-09 03:35:53 -07:00

templates

refactor: reorganize skills into sub-categories

2026-03-09 03:35:53 -07:00

README.md

refactor: reorganize skills into sub-categories

2026-03-09 03:35:53 -07:00

SKILL.md

refactor: reorganize skills into sub-categories

2026-03-09 03:35:53 -07:00

README.md

GRPO/RL Training Skill

Expert-level guidance for Group Relative Policy Optimization with TRL

📁 Skill Structure

grpo-rl-training/
├── SKILL.md                              # Main skill documentation (READ THIS FIRST)
├── README.md                             # This file
├── templates/
│   └── basic_grpo_training.py            # Production-ready training template
└── examples/
    └── reward_functions_library.py       # 20+ reward function examples

🚀 Quick Start

Read SKILL.md - Comprehensive guide with all concepts and patterns
Copy templates/basic_grpo_training.py - Start with working code
Browse examples/reward_functions_library.py - Pick reward functions for your task
Modify for your use case - Adapt dataset, rewards, and config

💡 What's Inside

SKILL.md (Main Documentation)

Core GRPO concepts and algorithm fundamentals
Complete implementation workflow (dataset → rewards → training → deployment)
10+ reward function examples with code
Hyperparameter tuning guide
Training insights (loss behavior, metrics, debugging)
Troubleshooting guide
Production best practices

Templates

basic_grpo_training.py: Minimal, production-ready training script
- Uses Qwen 2.5 1.5B Instruct
- 3 reward functions (format + correctness)
- LoRA for efficient training
- Fully documented and ready to run

Examples

reward_functions_library.py: 20+ battle-tested reward functions
- Correctness rewards (exact match, fuzzy match, numeric, code execution)
- Format rewards (XML, JSON, strict/soft)
- Length rewards (ideal length, min/max)
- Style rewards (reasoning quality, citations, repetition penalty)
- Combined rewards (multi-objective optimization)
- Preset collections for common tasks

📖 Usage for Agents

When this skill is loaded in your agent's context:

Always read SKILL.md first before implementing
Start simple - Use length-based reward to validate setup
Build incrementally - Add one reward function at a time
Reference examples - Copy patterns from reward_functions_library.py
Monitor training - Watch reward metrics (not loss!)

🎯 Common Use Cases

Task Type	Recommended Rewards	Template
Math reasoning	`MATH_REASONING_REWARDS` preset	basic_grpo_training.py
Code generation	`CODE_GENERATION_REWARDS` preset	Modify dataset in template
Summarization	`SUMMARIZATION_REWARDS` preset	Adjust prompts + rewards
Q&A	`QA_REWARDS` preset	Use fuzzy match + citations

⚠️ Critical Reminders

Loss goes UP during training - This is normal (it's KL divergence)
Use 3-5 reward functions - Single rewards often fail
Test rewards before training - Debug each function independently
Monitor reward_std - Should stay > 0.1 (avoid mode collapse)
Start with num_generations=4-8 - Scale up if GPU allows

🔗 External Resources

📝 Version

v1.0.0 - Initial release (January 2025)

👨‍💻 Maintained By

Orchestra Research For questions or improvements, see https://orchestra.com

License: MIT Last Updated: January 2025