# Clipped Importance Sampling Policy Optimization (CISPO)

Clipped Importance Sampling Policy Optimization (CISPO) is a reinforcement learning algorithm proposed in the [MiniMax-M1](https://arxiv.org/abs/2506.13585) paper. Compared to GRPO (Group Relative Policy Optimization), CISPO clips the importance sampling weights themselves.

## Algorithm Overview

For clarity, we explain CISPO by contrasting it with GRPO.

GRPO limits the magnitude of policy updates by clipping the policy ratio. Its loss function is:

$$
\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}\left[\min\left(r_t(\theta) \cdot \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t\right)\right]
$$

where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the importance sampling ratio.

When handling long reasoning chains, this clipping approach can lead to the following issues:

**Gradient Suppression of Critical Tokens**: In complex reasoning tasks, certain critical low-probability tokens (such as *However, Recheck, Wait, Aha*) are crucial for triggering deep thinking and reasoning error correction. These tokens have low probability in the old policy $\pi_{\theta_{\text{old}}}$. When the new policy attempts to increase their probability, it results in a large policy ratio $r_t(\theta)$, and GRPO's clipping mechanism will discard these tokens.


### CISPO's Solution

The core idea of CISPO is to clip the importance sampling weights while preserving gradient updates. Specifically, CISPO's loss function is:

$$
\mathcal{L}_{\text{CISPO}}(\theta) = -\mathbb{E}\left[\text{detach}\left(\min(r_t(\theta), \epsilon_{\text{high}})\right) \cdot \hat{A}_t \cdot \log \pi_\theta(a_t|s_t)\right]
$$

where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the importance sampling ratio.

**Key Mechanisms**:
- Clip the importance sampling weights: $\min(r_t(\theta), \epsilon_{\text{high}})$
- **Detach operation**: The clipped weights do not participate in gradient computation and serve as constant coefficients
- Gradients come from the $\log \pi_\theta(a_t|s_t)$ term, ensuring all tokens contribute gradients


## Implementation Details

The pseudo-code implementation of CISPO is as follows:

```python
log_ratio = per_token_logps - old_per_token_logps
importance_weights = torch.exp(log_ratio)  # r_t(θ) = π_θ / π_θ_old

clamped_ratios = torch.clamp(importance_weights, max=epsilon_high).detach()

per_token_loss = -clamped_ratios * advantages.unsqueeze(1) * per_token_logps
```

## Parameter Configuration

CISPO training can be enabled based on `GRPOTrainer` by setting the following parameters:

```bash
--loss_type cispo
--epsilon_high 5.0
```

> Compared to other algorithms, cispo generally uses a larger value for epsilon_high. The minimax paper does not provide specific parameter settings; the value used here refers to the experimental setup in the paper [ScaleRL](https://arxiv.org/pdf/2510.13786).

For other training parameters, refer to the [GRPO parameter documentation](../../Command-line-parameters.md#grpo-arguments).