# Megatron GRPO If you are new to GRPO, please refer to the [GRPO documentation](../Instruction/GRPO/GetStarted/GRPO.md) first. Megatron GRPO currently supports the following features: - **Training Modes**: Full parameter training and LoRA fine-tuning - **Parallelism Strategies**: Context Parallelism (CP), Pipeline Parallelism (PP), Tensor Parallelism (TP), and Expert Parallelism (EP) - **Inference Acceleration**: vLLM colocate mode and server mode - **Model Support**: Compatible with LLMs and MLLMs (multimodal large models) in Megatron Swift - **Algorithm Support**: Covers most features of Swift GRPO Similar to ms-swift GRPO, all batch size-related parameters in Megatron GRPO are at the **completion-level**, meaning they represent the number of completions generated by the model, not the number of prompts. #### Parameter Comparison The following table compares the batch-related parameters between ms-swift and Megatron-SWIFT: | ms-swift Parameter | Megatron-SWIFT Parameter | Description | |-------------------|--------------------------|-------------| | `per_device_train_batch_size` | `micro_batch_size` | Training batch size per DP group (completion-level) | | `gradient_accumulation_steps` | - | Gradient accumulation steps, already included in `global_batch_size` calculation in Megatron-SWIFT | | - | `global_batch_size` | Global batch size (completion-level)
**Megatron-SWIFT**: `micro_batch_size × dp_size × gradient_accumulation_steps`
**ms-swift**: `per_device_train_batch_size × world_size × gradient_accumulation_steps` | | `num_generations` | `num_generations` | Number of completions generated per prompt | | `steps_per_generation` | `steps_per_generation` | Ratio of Rollout batch size to training batch size
**Note**: In ms-swift, must be an integer multiple of `gradient_accumulation_steps` | | `generation_batch_size` | `generation_batch_size` | Batch size during Rollout phase (completion-level), must be an integer multiple of `global_batch_size` | The following formulas are used to calculate batch sizes in Megatron GRPO: - **Data Parallel Size**: `dp_size = world_size / (TP × PP × CP)` - **Global Batch Size**: `global_batch_size = micro_batch_size × dp_size × gradient_accumulation_steps` - **Generation Batch Size**: `generation_batch_size = global_batch_size × steps_per_generation` - **Rollout Prompt Count**: `num_rollout_prompts = generation_batch_size / num_generations` - **Training Prompt Count**: `num_train_prompts = global_batch_size / num_generations` - **Training Prompt Count per DP Group**: `num_prompts_per_dp_group = global_batch_size / num_generations / dp_size` **Note**: In Megatron GRPO, the training prompt count per DP group must satisfy that `num_prompts_per_dp_group` is an integer multiple of `micro_batch_size` to ensure proper batch allocation during training. For more parameters, please refer to the [Command-line Parameters documentation](./Command-line-parameters.md#grpo-parameters). For training scripts, please refer to [Megatron GRPO Scripts](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/grpo).