# Rewards as Labels: Revisiting RLVR from a Classification Perspective

Author: [li2zhi](https://github.com/li2zhi)

[Rewards as Labels: Revisiting RLVR from a Classification Perspective](https://arxiv.org/abs/2602.05630) proposes a reformulation of GRPO by treating rewards as labels and performing **in-group classification** instead of advantage estimation. This converts the policy optimization problem into a classification problem, thereby addressing two key issues in the GRPO loss:
- **Gradient Misassignment** for positive samples
- **Gradient Domination** for negative samples

## Background and Motivation

GRPO Objective

$$
J_{\mathrm{GRPO}}(\theta)=\mathbb{E}_{q,o\sim\pi_{\mathrm{od}}(\cdot|q)}\left[\frac{1}{|o|}\sum_{t=1}^{|o|}\left(\min\left(\rho_tA_t,\mathrm{clip}(\rho_t,1-\epsilon,1+\epsilon)A_t\right)\right)\right]
$$

where:
- $\rho_t = \frac{\pi_\theta(o_t|q)}{\pi_{\mathrm{old}}(o_t|q)}$ is the probability ratio
- $A_t$ is the advantage function

The corresponding gradient is:

$$
\nabla_{\theta} J_{\mathrm{GRPO}} = \mathbb { E } \left[ \frac { 1 } { | o | } \sum _ { t = 1 } ^ { | o | } \mathbb { I } _ { \mathrm { clip } } \cdot A _ { t } e ^ { s _ { t } } \nabla _ { \theta } \log \pi _ { \theta } \left( o _ { t } | q \right) \right]
$$

where:
- $s_t = \log \frac{\pi_\theta(o_t|q)}{\pi_{\mathrm{old}}(o_t|q)}$ is the relative log-probability
- $\mathbb{I}_{\mathrm{clip}}$ is the clipping indicator

Thus, the per-token gradient weight in GRPO is:

$$
|\mathcal{W}_{\mathrm{GRPO}}|=\left\{ \begin{array} {ll}\left|A\cdot e^s\right|, & \mathrm{if~}\mathbb{I}_{\mathrm{clip}}=1, \\ 0, & \text{otherwise.} \end{array}\right.
$$

![Gradient magnitude visualizations in GRPO](../../../../resources/real.png)

1. **Gradient Misassignment (Positive Samples)**：
For positive samples, as the relative log-probability $s$ decreases, the gradient magnitude also decreases.
This is counterintuitive: tokens that the model is less confident about but correct should receive larger updates. However, GRPO assigns more weight to already confident tokens, causing under-trained tokens to receive insufficient learning signal.

2. **Gradient Domination (Negative Samples)**：
For negative samples, as $s$ decreases, the gradient magnitude increases exponentially.
This leads to a situation where a few overconfident incorrect tokens dominate the gradient, overwhelming other negative signals within the same group. Due to the absence of an upper bound, this may result in unstable and excessively large parameter updates.

To address the above issues, REAL treats rewards directly as labels and performs **group-wise classification training**.

![Real Framework](../../../../resources/real_framework.png)

The classification logit for each sample is defined as:

$$
\bar{s}^k=\frac{1}{|o^k|}\sum_{t=1}^{|o^k|}\left(\log\frac{\pi_\theta(o_t^k\mid q)}{\pi_{\mathrm{old}}(o_t^k\mid q)}\right)
$$

- $\bar{s}^k > 0$: The sample is more likely under the current policy than the old policy → the model tends to **promote** this sample
- $\bar{s}^k < 0$: The sample is less likely under the current policy → the model tends to **suppress** this sample

Loss Function

$$
\mathcal{L}_{REAL}=\log\left(1+\sum_{\mathcal{O}_+}e^{-\bar{s}^i/\tau}\right)+\log\left(1+\sum_{\mathcal{O}_-}e^{\bar{s}^j/\tau}\right)
$$

Gradient Properties

$$
|\mathcal{W}_{\mathrm{REAL}}|=
\begin{cases}
\frac{1}{\tau}\frac{1}{1+C_{+}e^{\bar{s}^{k}/\tau}}, & r=1 \\
 \\
\frac{1}{\tau}\frac{1}{1+C_{-}e^{-\bar{s}^{k}/\tau}}, & r=0 & & &
\end{cases}
$$

## Parameter Settings

| Parameter | Type | Default | Description                                                        |
|-----------|------|---------|--------------------------------------------------------------------|
| `--loss_type` | `str` | -       | Set to `real`                                                      |
| `--real_tau` | `float` | `0.5`   | Temperature parameter controlling decision boundary sharpness |

Training Script Reference

[swift](https://github.com/modelscope/ms-swift/tree/main/examples/train/grpo/internal/real.sh)

## Important Notes

When configuring training parameters, ensure that:
- `per_device_train_batch_size` is divisible by `num_generations`

This guarantees that each training batch contains complete groups, which is required for correct in-group classification.