Megatron GRPO

If you are new to GRPO, please refer to the GRPO documentation first.

Megatron GRPO currently supports the following features:

  • Training Modes: Full parameter training and LoRA fine-tuning

  • Parallelism Strategies: Context Parallelism (CP), Pipeline Parallelism (PP), Tensor Parallelism (TP), and Expert Parallelism (EP)

  • Inference Acceleration: vLLM colocate mode and server mode

  • Model Support: Compatible with LLMs and MLLMs (multimodal large models) in Megatron Swift

  • Algorithm Support: Covers most features of Swift GRPO

Similar to ms-swift GRPO, all batch size-related parameters in Megatron GRPO are at the completion-level, meaning they represent the number of completions generated by the model, not the number of prompts.

Parameter Comparison

The following table compares the batch-related parameters between ms-swift and Megatron-SWIFT:

ms-swift Parameter Megatron-SWIFT Parameter Description
per_device_train_batch_size micro_batch_size Training batch size per DP group (completion-level)
gradient_accumulation_steps - Gradient accumulation steps, already included in global_batch_size calculation in Megatron-SWIFT
- global_batch_size Global batch size (completion-level)
Megatron-SWIFT: micro_batch_size × dp_size × gradient_accumulation_steps
ms-swift: per_device_train_batch_size × world_size × gradient_accumulation_steps
num_generations num_generations Number of completions generated per prompt
steps_per_generation steps_per_generation Ratio of Rollout batch size to training batch size
Note: In ms-swift, must be an integer multiple of gradient_accumulation_steps
generation_batch_size generation_batch_size Batch size during Rollout phase (completion-level), must be an integer multiple of global_batch_size

The following formulas are used to calculate batch sizes in Megatron GRPO:

  • Data Parallel Size: dp_size = world_size / (TP × PP × CP)

  • Global Batch Size: global_batch_size = micro_batch_size × dp_size × gradient_accumulation_steps

  • Generation Batch Size: generation_batch_size = global_batch_size × steps_per_generation

  • Rollout Prompt Count: num_rollout_prompts = generation_batch_size / num_generations

  • Training Prompt Count: num_train_prompts = global_batch_size / num_generations

  • Training Prompt Count per DP Group: num_prompts_per_dp_group = global_batch_size / num_generations / dp_size

Note: In Megatron GRPO, the training prompt count per DP group must satisfy that num_prompts_per_dp_group is an integer multiple of micro_batch_size to ensure proper batch allocation during training.

For more parameters, please refer to the Command-line Parameters documentation.

For training scripts, please refer to Megatron GRPO Scripts.