swift

Get Started

  • SWIFT Installation
  • Quick Start
  • Web-UI

Instruction

  • Command Line Parameters
  • Pre-training and Fine-tuning
  • GRPO
    • Get Started
    • Developer Guide
    • Advanced Research
  • GKD
  • RLHF
  • Inference and Deployment
  • Sampling
  • Evaluation
  • Export and Push
  • Ray Support
  • Reinforced Fine-Tuning
  • Agent Support
  • Supported Models and Datasets
  • Using Tuners
  • Frequently-asked-questions

Megatron-SWIFT

  • Quick Start
  • Command Line Arguments
  • LoRA Training
  • Multimodal Models
  • Mcore Bridge
  • Megatron GRPO
  • GKD
  • Ascend NPU
  • NPU Accuracy Data Collection
  • Custom Megatron Model

Customization

  • Architecture Introduction
  • Custom Model
  • Custom Dataset

Best Practices

  • Complete GRPO Experiment Process
  • Complete Multimodal GRPO Experiment Workflow
  • Code Training with GRPO
  • Qwen3 Best Practices
  • Qwen3-VL Best Practices
  • Qwen3.5 Best Practices
  • DeepSeek-V4 Training Support
  • Best Practices for Registering Multimodal Models
  • Embedding Training
  • Reranker Training
  • Best Practices for Rapidly Training Vision-Language (VL) Models
  • NPU Support
  • Metax Support
  • AMD GPU Support
  • More Best Practices
swift
  • GRPO
  • View page source

GRPO

Get Started

  • Get Started
    • GRPO

Developer Guide

  • Developer Guide
    • Loss Types
    • Multi-turn Training
    • Multi-Task Training
    • Reward Function
    • Reward Model
    • GYM Environment Training

Advanced Research

  • Advanced Research
    • Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
    • Clipped Importance Sampling Policy Optimization (CISPO)
    • DAPO: An Open-Source LLM Reinforcement Learning System at Scale
    • DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning
    • FIPO: Future-KL Influenced Policy Optimization
    • Group Sequence Policy Optimization (GSPO)
    • On-Policy RL Meets Off-Policy Experts: Harmonizing SFT and RL via Dynamic Weighting (CHORD)
    • REINFORCE Leave-One-Out (RLOO)
    • REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models
    • Rewards as Labels: Revisiting RLVR from a Classification Perspective
    • Router Replay (R2/R3)
    • Soft Adaptive Policy Optimization (SAPO)
    • Training-Inference-Mismatch
    • TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling
Previous Next

© Copyright 2022-2025, Alibaba ModelScope.

Built with Sphinx using a theme provided by Read the Docs.