FIPO: Future-KL Influenced Policy Optimization

Author: li2zhi

FIPO is a value-free RL method for eliciting longer and deeper reasoning. It keeps the GRPO/DAPO training scaffold, but changes how token-level policy updates are weighted: instead of applying one sequence-level advantage uniformly to every token, FIPO uses a discounted Future-KL signal to estimate whether the future trajectory after each token is being reinforced or suppressed.

Core Idea

In GRPO/DAPO, tokens in the same response usually share the same sequence-level advantage:

\[ \hat{A}_{i,t} = \hat{A}_{i} \]

This is simple and stable, but the credit assignment is coarse. FIPO starts from the signed log-probability shift between the current policy and the old policy:

\[ \Delta \log p_t = \log \pi_\theta(y_t \mid x, y_{<t}) - \log \pi_{\mathrm{old}}(y_t \mid x, y_{<t}) \]

A positive value means the token probability is being increased by the current update, while a negative value means it is being suppressed. FIPO then accumulates this signal from the current token to the end of the response:

\[ \mathrm{FutureKL}_t = \sum_{k=t}^{T}\gamma^{k-t} M_k \Delta \log p_k \]

where \(M_k\) is the completion mask and \(\gamma = 2^{-1 / \text{decay\_rate}}\). A larger decay_rate gives farther future tokens more influence; a smaller value makes the signal more local. FIPO maps the Future-KL value into a bounded influence weight:

\[ f_t = \mathrm{clip}(\exp(\mathrm{FutureKL}_t), 1-\epsilon_f, 1+\epsilon_f) \]

The original advantage is then replaced by a future-aware advantage:

\[ \tilde{A}_{i,t} = \hat{A}_{i} \cdot f_{i,t} \]

Parameters

Parameter Type Default Description
--loss_type str grpo Set tofipo to enable FIPO loss
--delta float None When enabled, it is used for both Future-KL high-IS-ratio token filtering and the main-loss dual-clip upper bound, and should be greater than 1 + epsilon_high. Set it to 10.0 to match the official 32B script
--fipo_decay_rate float 32.0 Half-life parameter for Future-KL; the actual discount is2 ** (-1 / fipo_decay_rate)
--fipo_clip_range float 0.2 Influence weight clipping range;0.2 clips to [0.8, 1.2]
--fipo_clip_high_only bool true Iftrue, clips the weight to [1.0, 1.0 + fipo_clip_range]
--fipo_safety_threshold float 4.0 Caps the FIPO weight to [0.8, 1.0] for negative-advantage tokens whose IS ratio exceeds this threshold

Training Example

swift