DAPO
Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) introduces several tricks based on GRPO, including:
Clip Higher
PPO and GRPO use symmetric clipping ranges (e.g., ±0.2) to limit the magnitude of policy updates. While this ensures stability, it also restricts the model’s exploratory capabilities. Specifically, when certain tokens have extremely low probabilities under the old policy, even if the current gradient indicates they should be reinforced (A > 0), the maximum increase is strictly limited.
DAPO employs an asymmetric clipping range, raising the upper clipping limit to encourage exploration:
The upper bound (encouragement side) is relaxed to 0.28.
The lower bound (suppression side) remains unchanged at 0.2.
In GRPO, the default symmetric clipping range is set using epsilon.
Parameters:
epsilon_highsets the upper clipping range, whileepsilonserves as the lower clipping range.
Dynamic Sampling
GRPO samples multiple responses per question to compute inter-group advantages:
However, when all generated outputs {o_i} receive the same reward, the inter-group advantage becomes zero, leading to vanishing gradients and reduced training efficiency.
DAPO addresses this issue with a dynamic sampling strategy:
Skips data with zero inter-group reward standard deviation during sampling.
Continues generating samples until the batch is filled.
Parameters:
dynamic_sample trueenables dynamic sampling.max_resample_timessets the maximum number of resampling attempts.
Token level Loss
GRPO normalizes losses at the sentence level, which introduces bias based on response length.
DAPO uses token-level normalization to avoid this bias in loss calculation.
Parameters:
loss_type bnpoenables token-level normalization.
Overlong Filtering
DAPO argues that forcibly truncated responses contain high reward noise, making it difficult for the model to distinguish between quality issues and length issues. To address this, DAPO filters out truncated data during training, excluding it from loss computation.
Parameters:
overlong_filterenables filtering of overly long samples.
Soft Overlong Punishment
Language models often struggle with controlling output length:
Overly long outputs may be truncated, leading to incorrect judgments of valid content.
Unconstrained length generation affects practicality and computational efficiency.
DAPO designs a three-stage length penalty function:
When the length falls within the interval \((L_{\text{max}} - L_{\text{cache}} < L \leq L_{\text{max}})\), a linearly increasing penalty is applied. For lengths \((L > L_{\text{max}})\), the maximum penalty (-1) is imposed.
Parameters:
reward_funcs soft_overlongenables this reward function.soft_max_lengthsets L_max, which defaults to the model’s maximum output length (max_completion_length).soft_cache_lengthsets L_cache.
Parameter Settings
In summary, the following parameters can be set based on GRPOTrainer to implement DAPO training.
| Parameter | Type | Value |
|---|---|---|
--loss_type |
str |
bnpo |
--epsilon_high |
float |
0.28 |
--dynamic_sample |
bool |
true |
--max_resample_times |
int |
3 |
--overlong_filter |
bool |
true |
--reward_funcs |
str |
soft_overlong |
--soft_cache_length |
int |
4096 |