# FIPO: Future-KL Influenced Policy Optimization Author: [li2zhi](https://github.com/li2zhi) [FIPO](https://arxiv.org/abs/2603.19835) is a value-free RL method for eliciting longer and deeper reasoning. It keeps the GRPO/DAPO training scaffold, but changes how token-level policy updates are weighted: instead of applying one sequence-level advantage uniformly to every token, FIPO uses a discounted Future-KL signal to estimate whether the future trajectory after each token is being reinforced or suppressed. ## Core Idea In GRPO/DAPO, tokens in the same response usually share the same sequence-level advantage: $$ \hat{A}_{i,t} = \hat{A}_{i} $$ This is simple and stable, but the credit assignment is coarse. FIPO starts from the signed log-probability shift between the current policy and the old policy: $$ \Delta \log p_t = \log \pi_\theta(y_t \mid x, y_{