Loss Types

GRPO training supports multiple loss types, with the main differences being the normalization dimension and gradient handling.

Loss Function

At the token level, GRPO training uses the following loss function:

\[\mathcal{L}_{i,t} = -\min\left(\rho_{i,t} A_{i,t}, \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) A_{i,t}\right)\]

When setting loss_type cispo, the CISPO loss is used:

\[\mathcal{L}_{i,t}^{\text{CISPO}} = -\text{detach}\left(\min(\rho_{i,t}, \epsilon_{\text{high}})\right) \cdot A_{i,t} \cdot \log \pi_\theta(y_{i,t}|y_{i,<t})\]

When setting loss_type sapo, soft gating replaces hard clipping, see SAPO

\[\mathcal{L}_{i,t}^{\text{SAPO}} = -g_{i,t} \cdot A_{i,t}\]

where \(g_{i,t} = \sigma(\tau \cdot (\rho_{i,t} - 1))\) is the temperature-controlled soft gate function.

where:

  • \(\rho_{i,t} = \frac{\pi_\theta(y_{i,t}|y_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t}|y_{i,<t})}\) is the importance sampling weight

  • \(A_{i,t}\) is the advantage function

  • \(\epsilon\) and \(\epsilon_{\text{high}}\) are the clipping parameters

  • \(\text{detach}(\cdot)\) indicates that this term does not participate in gradient computation

  • \(\sigma(\cdot)\) is the sigmoid function, \(\tau\) is the temperature parameter

GRPO

--loss_type grpo

GRPO is the standard loss function implementation that averages the token-level losses for each sample, then averages across all samples.

Formula:

\[\mathcal{L}_{\text{GRPO}} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{T_i} \sum_{t=1}^{T_i} \mathcal{L}_{i,t}\]

where:

  • \(N\) is the number of samples in the batch

  • \(T_i\) is the number of completion tokens for the \(i\)-th sample

Normalization Dimension: Sample dimension (first average over tokens for each sample, then average over all samples)

BNPO (Batch Normalized Policy Optimization)

--loss_type bnpo

BNPO sums all token losses from all samples and then divides by the total number of completion tokens.

Formula:

\[\mathcal{L}_{\text{BNPO}} = \frac{\sum_{i=1}^{N} \sum_{t=1}^{T_i} \mathcal{L}_{i,t}}{\sum_{i=1}^{N} T_i}\]

where:

  • \(N\) is the number of samples in the batch

  • \(T_i\) is the number of completion tokens for the \(i\)-th sample

Normalization Dimension: Token dimension (average over all completion tokens)

DR-GRPO

--loss_type dr_grpo

DR-GRPO sums all token losses from all samples and then divides by the batch size multiplied by the maximum completion length.

Formula:

\[\mathcal{L}_{\text{DR-GRPO}} = \frac{\sum_{i=1}^{N} \sum_{t=1}^{T_i} \mathcal{L}_{i,t}}{N \times L_{\text{max}}}\]

where:

  • \(N\) is the number of samples in the batch

  • \(T_i\) is the number of completion tokens for the \(i\)-th sample

  • \(L_{\text{max}}\) is the maximum completion length

Normalization Dimension: Fixed dimension (batch size × maximum completion length)

CISPO

--loss_type cispo

CISPO loss is normalized by the total number of completion tokens across all processes.

Formula:

\[\mathcal{L}_{\text{CISPO}} = \frac{\sum_{i=1}^{N} \sum_{t=1}^{T_i} \mathcal{L}_{i,t}^{\text{CISPO}}}{\sum_{\text{all processes}} \sum_{i=1}^{N_p} T_{p,i}}\]

where:

  • \(N\) is the number of samples in the current process batch

  • \(T_i\) is the number of completion tokens for the \(i\)-th sample

  • \(N_p\) is the number of samples for the \(p\)-th process

Normalization Dimension: Global token dimension (total completion tokens across all processes)

DAPO

--loss_type dapo

DAPO is similar to BNPO, using token-level normalization, but based on global data (multi-process) normalization.

Formula:

\[\mathcal{L}_{\text{DAPO}} = \frac{\sum_{i=1}^{N} \sum_{t=1}^{T_i} \mathcal{L}_{i,t}}{\sum_{\text{all processes}} \sum_{i=1}^{N_p} T_{p,i}}\]

where:

  • \(N\) is the number of samples in the current process batch

  • \(T_i\) is the number of completion tokens for the \(i\)-th sample

  • \(N_p\) is the number of samples for the \(p\)-th process

Normalization Dimension: Global token dimension (total completion tokens across all processes)

FIPO

--loss_type fipo

FIPO adds a Future-KL influence weight on top of the DAPO/GRPO clipped policy loss. The sequence-level advantage for each token is weighted by the discounted accumulated KL shift from the current token to future tokens:

\[f_{i,t} = \text{clip}\left(\exp\left(\sum_{k=t}^{T_i} \gamma^{k-t} M_{i,k} \Delta \log p_{i,k}\right), 1-\epsilon_f, 1+\epsilon_f\right)\]
\[\mathcal{L}_{i,t}^{\text{FIPO}} = f_{i,t} \cdot \mathcal{L}_{i,t}\]

The FIPO influence weight is detached by default and uses the same global token normalization as DAPO.

Normalization Dimension: Global token dimension (total completion tokens across all processes)

SAPO

--loss_type sapo

SAPO uses temperature-controlled soft gating instead of hard clipping to achieve smooth gradient attenuation. The normalization method is the same as GRPO.

For details, please refer to SAPO