# REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models [REINFORCE++ Baseline](https://arxiv.org/abs/2501.03262) is a simplified version of the REINFORCE++ algorithm, designed for outcome rewards (response-level scalar rewards). Similar to GRPO, it samples multiple model outputs for each prompt and uses an intra-group baseline to estimate advantages. The key difference lies in the statistics used for normalization. ## Algorithm Overview For clarity, we explain REINFORCE++ Baseline by contrasting it with GRPO (Group Relative Policy Optimization). Both GRPO and REINFORCE++ Baseline estimate advantages via intra-group comparisons. Their main differences are: ### Difference 1: Statistics Used for Normalization **GRPO (Group Relative Policy Optimization)** For each prompt, GRPO generates $G$ response samples and normalizes using the **mean and standard deviation of all samples within the group**: $$ \hat{A}_{i} = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)} $$ When `scale_rewards='batch'` is set, it uses the **batch-level std of original rewards**: $$ \hat{A}_{i} = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^{N})} $$ where $N$ is the total number of samples in the batch. **REINFORCE++ Baseline** For each prompt, REINFORCE++ generates $G$ response samples, subtracts the group mean, and then normalizes using the **standard deviation of the group-mean-subtracted rewards**: $$ \begin{align} \tilde{A}_{i} &= R_i - \text{mean}(\{R_j\}_{j=1}^G) \\ \hat{A}_{i} &= \frac{\tilde{A}_{i}}{\text{std}(\{\tilde{A}_k\}_{k=1}^{N})} \end{align} $$ where $N$ is the total number of samples in the batch. **Key Difference**: - **GRPO**: Uses the std of **original rewards $R$** for normalization - **REINFORCE++**: Uses the std of **group-mean-subtracted rewards $\tilde{A}$** for normalization ### Difference 2: KL Divergence Regularization Similar to RLOO, REINFORCE++ Baseline integrates KL divergence directly into the reward: $$ R'_i = R_i - \beta \cdot \text{KL}(\pi_\theta || \pi_{\text{ref}}) $$ where $\beta$ is the KL divergence weight coefficient (corresponding to the parameter `beta`), and $\pi_{\text{ref}}$ is the reference policy. ## Parameter Configuration We can implement REINFORCE++ Baseline training by configuring the following parameters with `GRPOTrainer`: ```bash --advantage_estimator reinforce_plus_plus --scale_rewards batch --kl_in_reward true ``` For training examples, please refer to this [script](https://github.com/modelscope/ms-swift/tree/main/examples/train/grpo/internal/reinforce_plus_plus.sh) ### Key Parameter Descriptions - **`--advantage_estimator`**: Selects the advantage estimation method - `grpo` (default): Uses the std of original rewards for normalization - `reinforce_plus_plus`: Uses the std of group-mean-subtracted rewards for normalization - **`--kl_in_reward`**: Controls where the KL divergence regularization term is applied - `false`: KL divergence is an independent regularization term in the loss function (GRPO default) - `true`: KL divergence is subtracted directly from the reward (REINFORCE++ original implementation) - **`--scale_rewards`**: Controls the normalization method - `group` (default): Intra-group normalization - `batch`: Global batch-level normalization (REINFORCE++ original implementation) - `none`: No normalization - **`--num_generations`**: Number of samples generated per prompt ($G$) - **`--beta`**: KL divergence regularization coefficient ($\beta$) For other parameters, please refer to [GRPO Parameters](../../Command-line-parameters.md#grpo-arguments)