REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models

REINFORCE++ Baseline is a simplified version of the REINFORCE++ algorithm, designed for outcome rewards (response-level scalar rewards). Similar to GRPO, it samples multiple model outputs for each prompt and uses an intra-group baseline to estimate advantages. The key difference lies in the statistics used for normalization.

Algorithm Overview

For clarity, we explain REINFORCE++ Baseline by contrasting it with GRPO (Group Relative Policy Optimization).

Both GRPO and REINFORCE++ Baseline estimate advantages via intra-group comparisons. Their main differences are:

Difference 1: Statistics Used for Normalization

GRPO (Group Relative Policy Optimization)

For each prompt, GRPO generates \(G\) response samples and normalizes using the mean and standard deviation of all samples within the group:

\[ \hat{A}_{i} = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)} \]

When scale_rewards='batch' is set, it uses the batch-level std of original rewards:

\[ \hat{A}_{i} = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^{N})} \]

where \(N\) is the total number of samples in the batch.

REINFORCE++ Baseline

For each prompt, REINFORCE++ generates \(G\) response samples, subtracts the group mean, and then normalizes using the standard deviation of the group-mean-subtracted rewards:

\[\begin{split} \begin{align} \tilde{A}_{i} &= R_i - \text{mean}(\{R_j\}_{j=1}^G) \\ \hat{A}_{i} &= \frac{\tilde{A}_{i}}{\text{std}(\{\tilde{A}_k\}_{k=1}^{N})} \end{align} \end{split}\]

where \(N\) is the total number of samples in the batch.

Key Difference:

GRPO: Uses the std of original rewards \(R\) for normalization
REINFORCE++: Uses the std of group-mean-subtracted rewards \(\tilde{A}\) for normalization

Difference 2: KL Divergence Regularization

Similar to RLOO, REINFORCE++ Baseline integrates KL divergence directly into the reward:

\[ R'_i = R_i - \beta \cdot \text{KL}(\pi_\theta || \pi_{\text{ref}}) \]

where \(\beta\) is the KL divergence weight coefficient (corresponding to the parameter beta), and \(\pi_{\text{ref}}\) is the reference policy.

Parameter Configuration

We can implement REINFORCE++ Baseline training by configuring the following parameters with GRPOTrainer:

--advantage_estimator reinforce_plus_plus
--scale_rewards batch
--kl_in_reward true

For training examples, please refer to this script

Key Parameter Descriptions

--advantage_estimator: Selects the advantage estimation method
- grpo (default): Uses the std of original rewards for normalization
- reinforce_plus_plus: Uses the std of group-mean-subtracted rewards for normalization
--kl_in_reward: Controls where the KL divergence regularization term is applied
- false: KL divergence is an independent regularization term in the loss function (GRPO default)
- true: KL divergence is subtracted directly from the reward (REINFORCE++ original implementation)
--scale_rewards: Controls the normalization method
- group (default): Intra-group normalization
- batch: Global batch-level normalization (REINFORCE++ original implementation)
- none: No normalization
--num_generations: Number of samples generated per prompt (\(G\))
--beta: KL divergence regularization coefficient (\(\beta\))

For other parameters, please refer to GRPO Parameters