Clipped Importance Sampling Policy Optimization (CISPO)

Clipped Importance Sampling Policy Optimization (CISPO) is a reinforcement learning algorithm proposed in the MiniMax-M1 paper. Compared to GRPO (Group Relative Policy Optimization), CISPO clips the importance sampling weights themselves.

Algorithm Overview

For clarity, we explain CISPO by contrasting it with GRPO.

GRPO limits the magnitude of policy updates by clipping the policy ratio. Its loss function is:

\[ \mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}\left[\min\left(r_t(\theta) \cdot \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t\right)\right] \]

where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\) is the importance sampling ratio.

When handling long reasoning chains, this clipping approach can lead to the following issues:

Gradient Suppression of Critical Tokens: In complex reasoning tasks, certain critical low-probability tokens (such as However, Recheck, Wait, Aha) are crucial for triggering deep thinking and reasoning error correction. These tokens have low probability in the old policy \(\pi_{\theta_{\text{old}}}\). When the new policy attempts to increase their probability, it results in a large policy ratio \(r_t(\theta)\), and GRPO’s clipping mechanism will discard these tokens.

CISPO’s Solution

The core idea of CISPO is to clip the importance sampling weights while preserving gradient updates. Specifically, CISPO’s loss function is:

\[ \mathcal{L}_{\text{CISPO}}(\theta) = -\mathbb{E}\left[\text{detach}\left(\min(r_t(\theta), \epsilon_{\text{high}})\right) \cdot \hat{A}_t \cdot \log \pi_\theta(a_t|s_t)\right] \]

where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\) is the importance sampling ratio.

Key Mechanisms:

  • Clip the importance sampling weights: \(\min(r_t(\theta), \epsilon_{\text{high}})\)

  • Detach operation: The clipped weights do not participate in gradient computation and serve as constant coefficients

  • Gradients come from the \(\log \pi_\theta(a_t|s_t)\) term, ensuring all tokens contribute gradients

Implementation Details

The pseudo-code implementation of CISPO is as follows:

log_ratio = per_token_logps - old_per_token_logps
importance_weights = torch.exp(log_ratio)  # r_t(θ) = π_θ / π_θ_old

clamped_ratios = torch.clamp(importance_weights, max=epsilon_high).detach()

per_token_loss = -clamped_ratios * advantages.unsqueeze(1) * per_token_logps

Parameter Configuration

CISPO training can be enabled based on GRPOTrainer by setting the following parameters:

--loss_type cispo
--epsilon_high 5.0

Compared to other algorithms, cispo generally uses a larger value for epsilon_high. The minimax paper does not provide specific parameter settings; the value used here refers to the experimental setup in the paper ScaleRL.

For other training parameters, refer to the GRPO parameter documentation.