FIPO: Future-KL Influenced Policy Optimization
Author: li2zhi
FIPO is a value-free RL method for eliciting longer and deeper reasoning. It keeps the GRPO/DAPO training scaffold, but changes how token-level policy updates are weighted: instead of applying one sequence-level advantage uniformly to every token, FIPO uses a discounted Future-KL signal to estimate whether the future trajectory after each token is being reinforced or suppressed.
Core Idea
In GRPO/DAPO, tokens in the same response usually share the same sequence-level advantage:
This is simple and stable, but the credit assignment is coarse. FIPO starts from the signed log-probability shift between the current policy and the old policy:
A positive value means the token probability is being increased by the current update, while a negative value means it is being suppressed. FIPO then accumulates this signal from the current token to the end of the response:
where \(M_k\) is the completion mask and \(\gamma = 2^{-1 / \text{decay\_rate}}\). A larger decay_rate gives farther future tokens more influence; a smaller value makes the signal more local. FIPO maps the Future-KL value into a bounded influence weight:
The original advantage is then replaced by a future-aware advantage:
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--loss_type |
str |
grpo |
Set tofipo to enable FIPO loss |
--delta |
float |
None |
When enabled, it is used for both Future-KL high-IS-ratio token filtering and the main-loss dual-clip upper bound, and should be greater than 1 + epsilon_high. Set it to 10.0 to match the official 32B script |
--fipo_decay_rate |
float |
32.0 |
Half-life parameter for Future-KL; the actual discount is2 ** (-1 / fipo_decay_rate) |
--fipo_clip_range |
float |
0.2 |
Influence weight clipping range;0.2 clips to [0.8, 1.2] |
--fipo_clip_high_only |
bool |
true |
Iftrue, clips the weight to [1.0, 1.0 + fipo_clip_range] |
--fipo_safety_threshold |
float |
4.0 |
Caps the FIPO weight to [0.8, 1.0] for negative-advantage tokens whose IS ratio exceeds this threshold |