swift

Get Started

  • SWIFT Installation
  • Quick Start
  • Web-UI

Instruction

  • Command Line Parameters
  • Pre-training and Fine-tuning
  • GRPO
    • Get Started
    • Developer Guide
    • Advanced Research
      • Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
      • Clipped Importance Sampling Policy Optimization (CISPO)
      • DAPO: An Open-Source LLM Reinforcement Learning System at Scale
      • DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning
      • FIPO: Future-KL Influenced Policy Optimization
      • Group Sequence Policy Optimization (GSPO)
      • On-Policy RL Meets Off-Policy Experts: Harmonizing SFT and RL via Dynamic Weighting (CHORD)
      • REINFORCE Leave-One-Out (RLOO)
      • REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models
      • Rewards as Labels: Revisiting RLVR from a Classification Perspective
      • Router Replay (R2/R3)
      • Soft Adaptive Policy Optimization (SAPO)
      • Training-Inference-Mismatch
      • TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling
  • GKD
  • RLHF
  • Inference and Deployment
  • Sampling
  • Evaluation
  • Export and Push
  • Ray Support
  • Reinforced Fine-Tuning
  • Agent Support
  • Supported Models and Datasets
  • Using Tuners
  • Frequently-asked-questions

Megatron-SWIFT

  • Quick Start
  • Command Line Arguments
  • LoRA Training
  • Multimodal Models
  • Mcore Bridge
  • Megatron GRPO
  • GKD
  • Ascend NPU
  • NPU Accuracy Data Collection
  • Custom Megatron Model

Customization

  • Architecture Introduction
  • Custom Model
  • Custom Dataset

Best Practices

  • Complete GRPO Experiment Process
  • Complete Multimodal GRPO Experiment Workflow
  • Code Training with GRPO
  • Qwen3 Best Practices
  • Qwen3-VL Best Practices
  • Qwen3.5 Best Practices
  • DeepSeek-V4 Training Support
  • Best Practices for Registering Multimodal Models
  • Embedding Training
  • Reranker Training
  • Best Practices for Rapidly Training Vision-Language (VL) Models
  • NPU Support
  • Metax Support
  • AMD GPU Support
  • More Best Practices
swift
  • GRPO
  • Advanced Research
  • View page source

Advanced Research

  • Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
  • Clipped Importance Sampling Policy Optimization (CISPO)
  • DAPO: An Open-Source LLM Reinforcement Learning System at Scale
  • DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning
  • FIPO: Future-KL Influenced Policy Optimization
  • Group Sequence Policy Optimization (GSPO)
  • On-Policy RL Meets Off-Policy Experts: Harmonizing SFT and RL via Dynamic Weighting (CHORD)
  • REINFORCE Leave-One-Out (RLOO)
  • REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models
  • Rewards as Labels: Revisiting RLVR from a Classification Perspective
  • Router Replay (R2/R3)
  • Soft Adaptive Policy Optimization (SAPO)
  • Training-Inference-Mismatch
  • TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling
Previous Next

© Copyright 2022-2025, Alibaba ModelScope.

Built with Sphinx using a theme provided by Read the Docs.