# Megatron GRPO


If you are new to GRPO, please refer to the [GRPO documentation](../Instruction/GRPO/GetStarted/GRPO.md) first.

Megatron GRPO currently supports the following features:

- **Training Modes**: Full parameter training and LoRA fine-tuning
- **Parallelism Strategies**: Context Parallelism (CP), Pipeline Parallelism (PP), Tensor Parallelism (TP), and Expert Parallelism (EP)
- **Inference Acceleration**: vLLM colocate mode and server mode
- **Model Support**: Compatible with LLMs and MLLMs (multimodal large models) in Megatron Swift
- **Algorithm Support**: Covers most features of Swift GRPO

Similar to ms-swift GRPO, all batch size-related parameters in Megatron GRPO are at the **completion-level**, meaning they represent the number of completions generated by the model, not the number of prompts.

#### Parameter Comparison

The following table compares the batch-related parameters between ms-swift and Megatron-SWIFT:

| ms-swift Parameter | Megatron-SWIFT Parameter | Description |
|-------------------|--------------------------|-------------|
| `per_device_train_batch_size` | `micro_batch_size` | Training batch size per DP group (completion-level) |
| `gradient_accumulation_steps` | - | Gradient accumulation steps, already included in `global_batch_size` calculation in Megatron-SWIFT |
| - | `global_batch_size` | Global batch size (completion-level)<br/>**Megatron-SWIFT**: `micro_batch_size × dp_size × gradient_accumulation_steps`<br/>**ms-swift**: `per_device_train_batch_size × world_size × gradient_accumulation_steps` |
| `num_generations` | `num_generations` | Number of completions generated per prompt |
| `steps_per_generation` | `steps_per_generation` | Ratio of Rollout batch size to training batch size<br/>**Note**: In ms-swift, must be an integer multiple of `gradient_accumulation_steps` |
| `generation_batch_size` | `generation_batch_size` | Batch size during Rollout phase (completion-level), must be an integer multiple of `global_batch_size` |

The following formulas are used to calculate batch sizes in Megatron GRPO:

- **Data Parallel Size**: `dp_size = world_size / (TP × PP × CP)`
- **Global Batch Size**: `global_batch_size = micro_batch_size × dp_size × gradient_accumulation_steps`
- **Generation Batch Size**: `generation_batch_size = global_batch_size × steps_per_generation`
- **Rollout Prompt Count**: `num_rollout_prompts = generation_batch_size / num_generations`
- **Training Prompt Count**: `num_train_prompts = global_batch_size / num_generations`
- **Training Prompt Count per DP Group**: `num_prompts_per_dp_group = global_batch_size / num_generations / dp_size`

**Note**: In Megatron GRPO, the training prompt count per DP group must satisfy that `num_prompts_per_dp_group` is an integer multiple of `micro_batch_size` to ensure proper batch allocation during training.

For more parameters, please refer to the [Command-line Parameters documentation](./Command-line-parameters.md#grpo-parameters).

For training scripts, please refer to [Megatron GRPO Scripts](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/grpo).