DAPO

Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) introduces several tricks based on GRPO, including:

Clip Higher

PPO and GRPO use symmetric clipping ranges (e.g., ±0.2) to limit the magnitude of policy updates. While this ensures stability, it also restricts the model’s exploratory capabilities. Specifically, when certain tokens have extremely low probabilities under the old policy, even if the current gradient indicates they should be reinforced (A > 0), the maximum increase is strictly limited.

DAPO employs an asymmetric clipping range, raising the upper clipping limit to encourage exploration:

  • The upper bound (encouragement side) is relaxed to 0.28.

  • The lower bound (suppression side) remains unchanged at 0.2.

In GRPO, the default symmetric clipping range is set using epsilon.

Parameters:

  • epsilon_high sets the upper clipping range, while epsilon serves as the lower clipping range.

Dynamic Sampling

GRPO samples multiple responses per question to compute inter-group advantages:

\[ \hat{A}_{i,t} = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)} \]

However, when all generated outputs {o_i} receive the same reward, the inter-group advantage becomes zero, leading to vanishing gradients and reduced training efficiency.

DAPO addresses this issue with a dynamic sampling strategy:

  • Skips data with zero inter-group reward standard deviation during sampling.

  • Continues generating samples until the batch is filled.

Parameters:

  • dynamic_sample true enables dynamic sampling.

  • max_resample_times sets the maximum number of resampling attempts.

Token level Loss

GRPO normalizes losses at the sentence level, which introduces bias based on response length.

DAPO uses token-level normalization to avoid this bias in loss calculation.

Parameters:

  • loss_type bnpo enables token-level normalization.

Overlong Filtering

DAPO argues that forcibly truncated responses contain high reward noise, making it difficult for the model to distinguish between quality issues and length issues. To address this, DAPO filters out truncated data during training, excluding it from loss computation.

Parameters:

  • overlong_filter enables filtering of overly long samples.

Soft Overlong Punishment

Language models often struggle with controlling output length:

  • Overly long outputs may be truncated, leading to incorrect judgments of valid content.

  • Unconstrained length generation affects practicality and computational efficiency.

DAPO designs a three-stage length penalty function:

\[\begin{split} R_{\text{length}}(L) = \begin{cases} 0, & L \leq L_{\text{max}} - L_{\text{cache}} \\[10pt] \dfrac{(L_{\text{max}} - L_{\text{cache}}) - L}{L_{\text{cache}}}, & L_{\text{max}} - L_{\text{cache}} < L \leq L_{\text{max}} \\[10pt] -1, & L > L_{\text{max}} \end{cases} \end{split}\]

When the length falls within the interval \((L_{\text{max}} - L_{\text{cache}} < L \leq L_{\text{max}})\), a linearly increasing penalty is applied. For lengths \((L > L_{\text{max}})\), the maximum penalty (-1) is imposed.

Parameters:

  • reward_funcs soft_overlong enables this reward function.

  • soft_max_length sets L_max, which defaults to the model’s maximum output length (max_completion_length).

  • soft_cache_length sets L_cache.

Parameter Settings

In summary, the following parameters can be set based on GRPOTrainer to implement DAPO training.

Parameter Type Value
--loss_type str bnpo
--epsilon_high float 0.28
--dynamic_sample bool true
--max_resample_times int 3
--overlong_filter bool true
--reward_funcs str soft_overlong
--soft_cache_length int 4096