Advanced Research
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
- Clipped Importance Sampling Policy Optimization (CISPO)
- DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
- Group Sequence Policy Optimization
- On-Policy RL Meets Off-Policy Experts: Harmonizing SFT and RL via Dynamic Weighting (CHORD)
- REINFORCE Leave-One-Out (RLOO)
- REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models
- Soft Adaptive Policy Optimization (SAPO)
- Training-Inference-Mismatch
- TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling