Files
hermes-agent/skills/mlops/training/accelerate/SKILL.md
teknium1 732c66b0f3 refactor: reorganize skills into sub-categories
The skills directory was getting disorganized — mlops alone had 40
skills in a flat list, and 12 categories were singletons with just
one skill each.

Code change:
- prompt_builder.py: Support sub-categories in skill scanner.
  skills/mlops/training/axolotl/SKILL.md now shows as category
  'mlops/training' instead of just 'mlops'. Backwards-compatible
  with existing flat structure.

Split mlops (40 skills) into 7 sub-categories:
- mlops/training (12): accelerate, axolotl, flash-attention,
  grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning,
  simpo, slime, torchtitan, trl-fine-tuning, unsloth
- mlops/inference (8): gguf, guidance, instructor, llama-cpp,
  obliteratus, outlines, tensorrt-llm, vllm
- mlops/models (6): audiocraft, clip, llava, segment-anything,
  stable-diffusion, whisper
- mlops/vector-databases (4): chroma, faiss, pinecone, qdrant
- mlops/evaluation (5): huggingface-tokenizers,
  lm-evaluation-harness, nemo-curator, saelens, weights-and-biases
- mlops/cloud (2): lambda-labs, modal
- mlops/research (1): dspy

Merged singleton categories:
- gifs → media (gif-search joins youtube-content)
- music-creation → media (heartmula, songsee)
- diagramming → creative (excalidraw joins ascii-art)
- ocr-and-documents → productivity
- domain → research (domain-intel)
- feeds → research (blogwatcher)
- market-data → research (polymarket)

Fixed misplaced skills:
- mlops/code-review → software-development (not ML-specific)
- mlops/ml-paper-writing → research (academic writing)

Added DESCRIPTION.md files for all new/updated categories.
2026-03-09 03:35:53 -07:00

8.2 KiB

name, description, version, author, license, dependencies, metadata
name description version author license dependencies metadata
huggingface-accelerate Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard. 1.0.0 Orchestra Research MIT
accelerate
torch
transformers
hermes
tags
Distributed Training
HuggingFace
Accelerate
DeepSpeed
FSDP
Mixed Precision
PyTorch
DDP
Unified API
Simple

HuggingFace Accelerate - Unified Distributed Training

Quick start

Accelerate simplifies distributed training to 4 lines of code.

Installation:

pip install accelerate

Convert PyTorch script (4 lines):

import torch
+ from accelerate import Accelerator

+ accelerator = Accelerator()

  model = torch.nn.Transformer()
  optimizer = torch.optim.Adam(model.parameters())
  dataloader = torch.utils.data.DataLoader(dataset)

+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

  for batch in dataloader:
      optimizer.zero_grad()
      loss = model(batch)
-     loss.backward()
+     accelerator.backward(loss)
      optimizer.step()

Run (single command):

accelerate launch train.py

Common workflows

Workflow 1: From single GPU to multi-GPU

Original script:

# train.py
import torch

model = torch.nn.Linear(10, 2).to('cuda')
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

for epoch in range(10):
    for batch in dataloader:
        batch = batch.to('cuda')
        optimizer.zero_grad()
        loss = model(batch).mean()
        loss.backward()
        optimizer.step()

With Accelerate (4 lines added):

# train.py
import torch
from accelerate import Accelerator  # +1

accelerator = Accelerator()  # +2

model = torch.nn.Linear(10, 2)
optimizer = torch.optim.Adam(model.parameters())
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)  # +3

for epoch in range(10):
    for batch in dataloader:
        # No .to('cuda') needed - automatic!
        optimizer.zero_grad()
        loss = model(batch).mean()
        accelerator.backward(loss)  # +4
        optimizer.step()

Configure (interactive):

accelerate config

Questions:

  • Which machine? (single/multi GPU/TPU/CPU)
  • How many machines? (1)
  • Mixed precision? (no/fp16/bf16/fp8)
  • DeepSpeed? (no/yes)

Launch (works on any setup):

# Single GPU
accelerate launch train.py

# Multi-GPU (8 GPUs)
accelerate launch --multi_gpu --num_processes 8 train.py

# Multi-node
accelerate launch --multi_gpu --num_processes 16 \
  --num_machines 2 --machine_rank 0 \
  --main_process_ip $MASTER_ADDR \
  train.py

Workflow 2: Mixed precision training

Enable FP16/BF16:

from accelerate import Accelerator

# FP16 (with gradient scaling)
accelerator = Accelerator(mixed_precision='fp16')

# BF16 (no scaling, more stable)
accelerator = Accelerator(mixed_precision='bf16')

# FP8 (H100+)
accelerator = Accelerator(mixed_precision='fp8')

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

# Everything else is automatic!
for batch in dataloader:
    with accelerator.autocast():  # Optional, done automatically
        loss = model(batch)
    accelerator.backward(loss)

Workflow 3: DeepSpeed ZeRO integration

Enable DeepSpeed ZeRO-2:

from accelerate import Accelerator

accelerator = Accelerator(
    mixed_precision='bf16',
    deepspeed_plugin={
        "zero_stage": 2,  # ZeRO-2
        "offload_optimizer": False,
        "gradient_accumulation_steps": 4
    }
)

# Same code as before!
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

Or via config:

accelerate config
# Select: DeepSpeed → ZeRO-2

deepspeed_config.json:

{
    "fp16": {"enabled": false},
    "bf16": {"enabled": true},
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {"device": "cpu"},
        "allgather_bucket_size": 5e8,
        "reduce_bucket_size": 5e8
    }
}

Launch:

accelerate launch --config_file deepspeed_config.json train.py

Workflow 4: FSDP (Fully Sharded Data Parallel)

Enable FSDP:

from accelerate import Accelerator, FullyShardedDataParallelPlugin

fsdp_plugin = FullyShardedDataParallelPlugin(
    sharding_strategy="FULL_SHARD",  # ZeRO-3 equivalent
    auto_wrap_policy="TRANSFORMER_AUTO_WRAP",
    cpu_offload=False
)

accelerator = Accelerator(
    mixed_precision='bf16',
    fsdp_plugin=fsdp_plugin
)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

Or via config:

accelerate config
# Select: FSDP → Full Shard → No CPU Offload

Workflow 5: Gradient accumulation

Accumulate gradients:

from accelerate import Accelerator

accelerator = Accelerator(gradient_accumulation_steps=4)

model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

for batch in dataloader:
    with accelerator.accumulate(model):  # Handles accumulation
        optimizer.zero_grad()
        loss = model(batch)
        accelerator.backward(loss)
        optimizer.step()

Effective batch size: batch_size * num_gpus * gradient_accumulation_steps

When to use vs alternatives

Use Accelerate when:

  • Want simplest distributed training
  • Need single script for any hardware
  • Use HuggingFace ecosystem
  • Want flexibility (DDP/DeepSpeed/FSDP/Megatron)
  • Need quick prototyping

Key advantages:

  • 4 lines: Minimal code changes
  • Unified API: Same code for DDP, DeepSpeed, FSDP, Megatron
  • Automatic: Device placement, mixed precision, sharding
  • Interactive config: No manual launcher setup
  • Single launch: Works everywhere

Use alternatives instead:

  • PyTorch Lightning: Need callbacks, high-level abstractions
  • Ray Train: Multi-node orchestration, hyperparameter tuning
  • DeepSpeed: Direct API control, advanced features
  • Raw DDP: Maximum control, minimal abstraction

Common issues

Issue: Wrong device placement

Don't manually move to device:

# WRONG
batch = batch.to('cuda')

# CORRECT
# Accelerate handles it automatically after prepare()

Issue: Gradient accumulation not working

Use context manager:

# CORRECT
with accelerator.accumulate(model):
    optimizer.zero_grad()
    accelerator.backward(loss)
    optimizer.step()

Issue: Checkpointing in distributed

Use accelerator methods:

# Save only on main process
if accelerator.is_main_process:
    accelerator.save_state('checkpoint/')

# Load on all processes
accelerator.load_state('checkpoint/')

Issue: Different results with FSDP

Ensure same random seed:

from accelerate.utils import set_seed
set_seed(42)

Advanced topics

Megatron integration: See references/megatron-integration.md for tensor parallelism, pipeline parallelism, and sequence parallelism setup.

Custom plugins: See references/custom-plugins.md for creating custom distributed plugins and advanced configuration.

Performance tuning: See references/performance.md for profiling, memory optimization, and best practices.

Hardware requirements

  • CPU: Works (slow)
  • Single GPU: Works
  • Multi-GPU: DDP (default), DeepSpeed, or FSDP
  • Multi-node: DDP, DeepSpeed, FSDP, Megatron
  • TPU: Supported
  • Apple MPS: Supported

Launcher requirements:

  • DDP: torch.distributed.run (built-in)
  • DeepSpeed: deepspeed (pip install deepspeed)
  • FSDP: PyTorch 1.12+ (built-in)
  • Megatron: Custom setup

Resources