# Training-Inference-Mismatch **TL;DR**: While GRPO introduces vLLM to accelerate the sampling process, it also introduces Training-Inference Mismatch issues that may affect training stability. This document explains the background, causes, and solutions to this problem. ## Background ### Basic Assumptions of GRPO The training objective of GRPO (Group Relative Policy Optimization) can be expressed as: $$ \mathcal{L}_{\text{GRPO}} = - \mathbb{E}_{y \sim \pi_\theta} \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right] $$ Where: - $r_t(\theta) = \frac{\pi_\theta(y_t|x, y_{ \tau$ 2. **AND** $\hat{A}_i < 0$ Where: - $\pi_{\text{old}}$ preferentially uses `rollout_per_token_logps` (logprobs from rollout/behavior policy); if unavailable, falls back to `old_per_token_logps` - $\tau$ is the user-set threshold (`--off_policy_sequence_mask_delta`, default None = disabled) ## References 1. https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda 2. https://fengyao.notion.site/off-policy-rl 3. https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/rollout_corr_helper.py 4. [DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models](https://arxiv.org/abs/2512.02556)