Megatron GRPO

If you are new to GRPO, please refer to the GRPO documentation first.

Megatron GRPO currently supports the following features:

Training Modes: Full parameter training and LoRA fine-tuning
Parallelism Strategies: Context Parallelism (CP), Pipeline Parallelism (PP), Tensor Parallelism (TP), and Expert Parallelism (EP)
Inference Acceleration: vLLM colocate mode and server mode
Model Support: Compatible with LLMs and MLLMs (multimodal large models) in Megatron Swift
Algorithm Support: Covers most features of Swift GRPO

Similar to ms-swift GRPO, all batch size-related parameters in Megatron GRPO are at the completion-level, meaning they represent the number of completions generated by the model, not the number of prompts.

Parameter Comparison

The following table compares the batch-related parameters between ms-swift and Megatron-SWIFT:

ms-swift Parameter	Megatron-SWIFT Parameter	Description
`per_device_train_batch_size`	`micro_batch_size`	Training batch size per DP group (completion-level)
`gradient_accumulation_steps`	-	Gradient accumulation steps, already included in `global_batch_size` calculation in Megatron-SWIFT
-	`global_batch_size`	Global batch size (completion-level) Megatron-SWIFT: `micro_batch_size × dp_size × gradient_accumulation_steps` ms-swift: `per_device_train_batch_size × world_size × gradient_accumulation_steps`
`num_generations`	`num_generations`	Number of completions generated per prompt
`steps_per_generation`	`steps_per_generation`	Ratio of Rollout batch size to training batch size Note: In ms-swift, must be an integer multiple of `gradient_accumulation_steps`
`generation_batch_size`	`generation_batch_size`	Batch size during Rollout phase (completion-level), must be an integer multiple of `global_batch_size`

The following formulas are used to calculate batch sizes in Megatron GRPO:

Data Parallel Size: dp_size = world_size / (TP × PP × CP)
Global Batch Size: global_batch_size = micro_batch_size × dp_size × gradient_accumulation_steps
Generation Batch Size: generation_batch_size = global_batch_size × steps_per_generation
Rollout Prompt Count: num_rollout_prompts = generation_batch_size / num_generations
Training Prompt Count: num_train_prompts = global_batch_size / num_generations
Training Prompt Count per DP Group: num_prompts_per_dp_group = global_batch_size / num_generations / dp_size

Note: In Megatron GRPO, the training prompt count per DP group must satisfy that num_prompts_per_dp_group is an integer multiple of micro_batch_size to ensure proper batch allocation during training.

For more parameters, please refer to the Command-line Parameters documentation.

For training scripts, please refer to Megatron GRPO Scripts.