Clipped Importance Sampling Policy Optimization (CISPO)

Clipped Importance Sampling Policy Optimization (CISPO) is a reinforcement learning algorithm proposed in the MiniMax-M1 paper. Compared to GRPO (Group Relative Policy Optimization), CISPO clips the importance sampling weights themselves.

Algorithm Overview

For clarity, we explain CISPO by contrasting it with GRPO.

GRPO limits the magnitude of policy updates by clipping the policy ratio. Its loss function is:

\[ \mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}\left[\min\left(r_t(\theta) \cdot \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t\right)\right] \]

where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\) is the importance sampling ratio.

When handling long reasoning chains, this clipping approach can lead to the following issues:

Gradient Suppression of Critical Tokens: In complex reasoning tasks, certain critical low-probability tokens (such as However, Recheck, Wait, Aha) are crucial for triggering deep thinking and reasoning error correction. These tokens have low probability in the old policy \(\pi_{\theta_{\text{old}}}\). When the new policy attempts to increase their probability, it results in a large policy ratio \(r_t(\theta)\), and GRPO’s clipping mechanism will discard these tokens.

CISPO’s Solution

The core idea of CISPO is to clip the importance sampling weights while preserving gradient updates. Specifically, CISPO’s loss function is:

\[ \mathcal{L}_{\text{CISPO}}(\theta) = -\mathbb{E}\left[\text{detach}\left(\min(r_t(\theta), \epsilon_{\text{high}})\right) \cdot \hat{A}_t \cdot \log \pi_\theta(a_t|s_t)\right] \]

where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\) is the importance sampling ratio.

Key Mechanisms:

Clip the importance sampling weights: \(\min(r_t(\theta), \epsilon_{\text{high}})\)
Detach operation: The clipped weights do not participate in gradient computation and serve as constant coefficients
Gradients come from the \(\log \pi_\theta(a_t|s_t)\) term, ensuring all tokens contribute gradients

Implementation Details

The pseudo-code implementation of CISPO is as follows:

log_ratio = per_token_logps - old_per_token_logps
importance_weights = torch.exp(log_ratio)  # r_t(θ) = π_θ / π_θ_old

clamped_ratios = torch.clamp(importance_weights, max=epsilon_high).detach()

per_token_loss = -clamped_ratios * advantages.unsqueeze(1) * per_token_logps

Parameter Configuration

CISPO training can be enabled based on GRPOTrainer by setting the following parameters:

--loss_type cispo
--epsilon_high 5.0

Compared to other algorithms, cispo generally uses a larger value for epsilon_high. The minimax paper does not provide specific parameter settings; the value used here refers to the experimental setup in the paper ScaleRL.

For other training parameters, refer to the GRPO parameter documentation.