# Router Replay (R2/R3) **TL;DR**: In RL training of MoE models, routing inconsistency between the training engine and the inference engine can significantly amplify training-inference mismatch, and even cause training collapse. Router Replay eliminates this inconsistency by replaying fixed routing masks during the training forward pass. Depending on the replay source, there are two strategies: R2 (Vanilla Routing Replay) and R3 (Rollout Routing Replay). ## Background ### Three Policies in MoE RL In GRPO training of MoE models, there are three distinct policy stages that share the same model weights but may differ in routing behavior: | Policy | Notation | Routing Result | Description | |--------|----------|---------------|-------------| | **Training Policy** | $\pi_\theta$ | $e^{\pi}_t$ | The model during gradient updates | | **Old Policy** | $\pi_{\theta_{\text{old}}}$ | $e^{\pi}_{\text{old},t}$ | The model state before batch updates | | **Rollout Policy** | $\mu_{\theta_{\text{old}}}$ | $e^{\mu}_{\text{old},t}$ | The sampling policy in the inference engine (e.g., vLLM), with the same weights as old policy, but different routing due to kernel implementation differences, precision, etc. | Here, $\pi_{\theta_{\text{old}}}$ and $\mu_{\theta_{\text{old}}}$ have identical weights at sampling time, but due to implementation differences between the inference and training engines (e.g., operator implementations), routing results may differ even for the same input. ### Decomposition of Training-Inference Mismatch According to the [paper](https://arxiv.org/abs/2507.18071), the token-level importance sampling ratio can be decomposed into two factors: $$ \frac{\pi_\theta(y_t|x, y_{