Sovereign backup of all Hermes Agent configuration and data. Excludes: secrets, auth tokens, sessions, caches, code (separate repo). Tracked: - config.yaml (model, fallback chain, toolsets, display prefs) - SOUL.md (Timmy personality charter) - memories/ (persistent MEMORY.md + USER.md) - skills/ (371 files — full skill library) - cron/jobs.json (scheduled tasks) - channel_directory.json (platform channels) - hooks/ (custom hooks)
4.1 KiB
Checkpointing in TorchTitan
TorchTitan uses PyTorch Distributed Checkpoint (DCP) for fault-tolerant, interoperable checkpointing.
Basic Configuration
[checkpoint]
enable = true
folder = "checkpoint"
interval = 500
Save Model Only (Smaller Checkpoints)
Exclude optimizer state and training metadata:
[checkpoint]
enable = true
last_save_model_only = true
export_dtype = "bfloat16" # Optional: export in lower precision
Excluding Keys from Loading
Partial checkpoint loading for modified settings:
[checkpoint]
enable = true
exclude_from_loading = ["data_loader", "lr_scheduler"]
CLI equivalent:
--checkpoint.exclude_from_loading data_loader,lr_scheduler
Creating Seed Checkpoints
Required for Pipeline Parallelism to ensure consistent initialization:
NGPU=1 CONFIG_FILE=<path_to_config> ./run_train.sh \
--checkpoint.enable \
--checkpoint.create_seed_checkpoint \
--parallelism.data_parallel_replicate_degree 1 \
--parallelism.data_parallel_shard_degree 1 \
--parallelism.tensor_parallel_degree 1 \
--parallelism.pipeline_parallel_degree 1 \
--parallelism.context_parallel_degree 1 \
--parallelism.expert_parallel_degree 1
This initializes on single CPU for reproducible initialization across any GPU count.
Async Checkpointing
Reduce checkpoint overhead with async writes:
[checkpoint]
enable = true
async_mode = "async" # Options: "disabled", "async", "async_with_pinned_mem"
HuggingFace Conversion
During Training
Save directly in HuggingFace format:
[checkpoint]
last_save_in_hf = true
last_save_model_only = true
Load from HuggingFace:
[checkpoint]
initial_load_in_hf = true
[model]
hf_assets_path = "./path/to/hf/checkpoint"
Offline Conversion
Convert without running training:
# HuggingFace -> TorchTitan
python ./scripts/checkpoint_conversion/convert_from_hf.py \
<input_dir> <output_dir> \
--model_name llama3 \
--model_flavor 8B
# TorchTitan -> HuggingFace
python ./scripts/checkpoint_conversion/convert_to_hf.py \
<input_dir> <output_dir> \
--hf_assets_path ./assets/hf/Llama3.1-8B \
--model_name llama3 \
--model_flavor 8B
Example
python ./scripts/convert_from_hf.py \
~/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/8cde5ca8380496c9a6cc7ef3a8b46a0372a1d920/ \
./initial_load_path/ \
--model_name llama3 \
--model_flavor 8B
Converting to Single .pt File
Convert DCP sharded checkpoint to single PyTorch file:
python -m torch.distributed.checkpoint.format_utils \
dcp_to_torch \
torchtitan/outputs/checkpoint/step-1000 \
checkpoint.pt
Checkpoint Structure
DCP saves sharded checkpoints that can be resharded for different parallelism configurations:
checkpoint/
├── step-500/
│ ├── .metadata
│ ├── __0_0.distcp
│ ├── __0_1.distcp
│ └── ...
└── step-1000/
└── ...
Resume Training
Training auto-resumes from the latest checkpoint in the configured folder. To resume from a specific step:
[checkpoint]
load_step = 500 # Resume from step 500
Interoperability with TorchTune
Checkpoints saved with last_save_model_only = true can be loaded directly into torchtune for fine-tuning.
Full Configuration Example
[checkpoint]
enable = true
folder = "checkpoint"
interval = 500
load_step = -1 # -1 = latest, or specify step number
last_save_model_only = true
export_dtype = "bfloat16"
async_mode = "async"
exclude_from_loading = []
last_save_in_hf = false
initial_load_in_hf = false
create_seed_checkpoint = false
Best Practices
- Large models: Use
async_mode = "async"to overlap checkpoint saves with training - Fine-tuning export: Enable
last_save_model_onlyandexport_dtype = "bfloat16"for smaller files - Pipeline parallelism: Always create seed checkpoint first
- Debugging: Save frequent checkpoints during development, reduce for production
- HF interop: Use conversion scripts for offline conversion, direct save/load for training workflows