NPU Support

ms-swift supports Ascend NPUs. You can fine-tune models and run inference on Ascend NPUs.

This document describes how to prepare the environment, train models, save and merge checkpoints, run inference, deploy services, and troubleshoot common issues on Ascend NPUs.

If this is your first time using ms-swift on NPUs, we recommend reading this document in the following order:

  1. Check “Support Scope at a Glance” first to confirm whether your model, algorithm, and backend have been verified.

  2. Use “Choose Your Usage Path” to decide whether you only need the base environment or also need MindSpeed/Megatron-SWIFT.

  3. Choose “Local Environment Installation” or “Image/Container Environment Installation” according to your own environment management preference, then run “NPU Availability Check”.

  4. Use “Quick Start” to complete one ModelScope model LoRA training, merge, inference, and deployment flow.

  5. For larger-scale training, continue reading the DDP, DeepSpeed, and MindSpeed/Megatron-SWIFT sections.

Hardware and Supported Operating Systems

Table 1 Product hardware support list

Product Supported
Ascend 950 series products
Atlas A3 training series products
Atlas A3 inference series products x
Atlas A2 training series products
Atlas A2 inference series products x
Atlas 200I/500 A2 inference products x
Atlas inference series products x
Atlas training series products x

[!NOTE]

In this section, “√” indicates supported and “x” indicates not supported.

  • For operating systems supported by each hardware product in physical-machine deployment scenarios, see the Compatibility Query Assistant.

  • For operating systems supported by each hardware product in VM and container deployment scenarios, see “Operating System Compatibility” in CANN Software Installation for the commercial edition or community edition.

Support Scope at a Glance

Recommended base environment versions:

software version
Python >= 3.10, < 3.12
CANN >= 8.5.1
torch >= 2.7.1
torch_npu >= 2.7.1.post4
Note: The officially recommended version compatibility matrix for the vLLM Ascend series has been updated to CANN 9.0.0, torch 2.9.0, torch_npu 2.9.0, vllm-ascend 0.18.0 for A3, and vllm-ascend 0.19.1 for A5. For details, see the vLLM Ascend installation guide.

For base environment setup, see the Ascend PyTorch installation guide. The examples in this document were verified on 8 * Ascend 910B3 64G.

Primary feature Feature Status
Training CPT Supported
SFT Supported
DPO Supported
RM Supported
Distributed DDP Supported
FSDP Supported
FSDP2 Supported
DeepSpeed Supported
MindSpeed(Megatron) Supported
PEFT FULL Supported
LoRA Supported
QLoRA Not supported
RLHF GRPO Supported
PPO Supported
Optimization FA and other fused ops Supported
Liger-Kernel Not supported
Deployment PT Supported
vLLM Supported
SGLang Not supported

Verified SFT Combinations

algorithm model families strategy hardware
SFT Qwen2.5-0.5B-Instruct FSDP1/FSDP2/deepspeed Atlas 900 A2 PODc
SFT Qwen2.5-1.5B-Instruct FSDP1/FSDP2/deepspeed Atlas 900 A2 PODc
SFT Qwen2.5-7B-Instruct FSDP1/FSDP2/deepspeed Atlas 900 A2 PODc
SFT Qwen2.5-VL-3B-Instruct FSDP1/FSDP2/deepspeed Atlas 900 A2 PODc
SFT Qwen2.5-VL-7B-Instruct FSDP1/FSDP2/deepspeed Atlas 900 A2 PODc
SFT Qwen2.5-Omni-3B FSDP1/FSDP2/deepspeed Atlas 900 A2 PODc
SFT Qwen3-8B FSDP1/FSDP2/deepspeed Atlas 900 A2 PODc
SFT Qwen3-30B-A3B FSDP1/FSDP2/deepspeed Atlas 900 A2 PODc
SFT Qwen3-32B FSDP1/FSDP2/deepspeed Atlas 900 A2 PODc
SFT Qwen3-VL-30B-A3B-Instruct FSDP1/FSDP2/deepspeed Atlas 900 A2 PODc
SFT Qwen3-Omni-30B-A3B-Instruct FSDP1/FSDP2/deepspeed/Megatron Atlas 900 A2 PODc/A3 SuperPoD
SFT InternVL3-8B FSDP1/FSDP2/deepspeed Atlas 900 A2 PODc
SFT Ovis2.5-2B FSDP1/FSDP2/deepspeed Atlas 900 A2 PODc
SFT Qwen3.5-27B FSDP1/FSDP2/deepspeed/Megatron Atlas 900 A2 PODc/A3 SuperPoD
SFT Qwen3.5-35B-A3B FSDP1/FSDP2/deepspeed/Megatron Atlas 900 A2 PODc/A3 SuperPoD

Verified RL Combinations

algorithm model families strategy rollout engine hardware
GRPO Qwen2.5-7B-Instruct deepspeed vllm-ascend Atlas 900 A2 PODc
GRPO Qwen3-8B deepspeed vllm-ascend Atlas 900 A2 PODc
DPO Qwen2.5-7B-Instruct deepspeed vllm-ascend Atlas 900 A2 PODc
DPO Qwen3-8B deepspeed vllm-ascend Atlas 900 A2 PODc
PPO Qwen2.5-7B-Instruct deepspeed vllm-ascend Atlas 900 A2 PODc
PPO Qwen3-8B deepspeed vllm-ascend Atlas 900 A2 PODc

Not Yet Supported or Not Fully Verified

item
Liger-kernel
Quantization/QLoRA related features
Using SGLang as the inference engine
Enabling ETP for LoRA training in Megatron

PEFT Transformers 5 MoE Fused Expert LoRA Limitation

When training Transformers 5 MoE models such as Qwen3.5-MoE and Qwen3-Omni-MoE with LoRA, some expert weights may be stored as fused nn.Parameter tensors instead of ordinary nn.Linear modules. Injecting LoRA into these parameters relies on PEFT’s target_parameters path.

This path is not fully stable yet with combinations such as lora_dropout, ZeRO-3/FSDP, and multiple adapters. Typical trigger conditions include:

  • using a MoE model;

  • using LoRA and trying to cover fused expert parameters;

  • triggering PEFT’s Transformers 5 MoE target conversion path through the model config or the command-line --model_type;

  • using the default lora_dropout != 0, or using parameter-sharding backends such as ZeRO-3/FSDP.

For regular Qwen3.5 GRPO/SFT LoRA training, avoid explicitly passing --model_type when it is not needed. If the model config itself already triggers this PEFT path, prefer full-parameter training or disable the affected LoRA combination. If fused expert LoRA is required, wait for the PEFT upstream support to stabilize, or use it only after validating the target model and training backend with lora_dropout=0.

Choose Your Usage Path

Scenario Recommended path Need MindSpeed
Ordinary SFT/LoRA/inference Local environment installation or image/container installation No
Megatron-SWIFT large-model training Install the base environment, then install MindSpeed/Megatron/mcore-bridge Yes
GRPO/PPO/DPO and other RLHF workflows Base training environment + vLLM-Ascend rollout/deploy Usually no
Only verifying whether NPUs are usable Run the NPU availability check script No

Environment Preparation

Image/Container Environment Installation

The official NPU image is available at quay.io/ascend/ms-swift. We recommend choosing an image tag that matches your device generation, Python version, CANN version, and OS version first. If you need to pin a branch or customize dependencies, build the image from the Dockerfile provided by the project. The container approach makes dependency versions easier to freeze and helps reproduce the same environment across multiple Ascend machines.

The following example uses the A2, Python 3.11, CANN 9.0.0, Ubuntu 22.04 tag. In actual use, choose the latest tag from the Quay tag page that matches your machine and software stack.

docker pull quay.io/ascend/ms-swift:v4.3.0-A2-py311-CANN9.0.0-ubuntu22.04
export IMAGE_NAME=quay.io/ascend/ms-swift:v4.3.0-A2-py311-CANN9.0.0-ubuntu22.04
export WORKSPACE=/path/to/workspace

If you need to build the image yourself, clone the modelscope repository first, then use Dockerfile.ascend and build_image.py:

git clone https://github.com/modelscope/modelscope.git
cd modelscope
DOCKER_REGISTRY=ms-swift python docker/build_image.py \
  --image_type ascend \
  --python_version 3.11.11 \
  --soc_version ascend910b1 \
  --arch arm

The current build_image.py generates Ascend image names in the format {DOCKER_REGISTRY}:{swift_branch}-{atlas_hardware}-{python_tag}-{arch}. The command above uses the ARM-based Atlas 900 A2 PODc as an example and usually generates ms-swift:main-A2-py311-arm. If you use a self-built image, replace IMAGE_NAME above with the actual image name from your build log.

export IMAGE_NAME=ms-swift:main-A2-py311-arm

Before starting the container, check which NPU devices are exposed on the host:

ls /dev/davinci*

When starting the container, mount the NPU devices, driver, firmware, npu-smi, and required log directories. The following example uses the common 8-card device range davinci0 to davinci7. Some machines may also expose devices up to davinci15; in that case, add the corresponding devices to docker run according to the output of ls /dev/davinci*.

docker run -it \
  --name swift-ascend \
  --network=host --ipc=host --shm-size=128g \
  --device=/dev/davinci0 --device=/dev/davinci1 \
  --device=/dev/davinci2 --device=/dev/davinci3 \
  --device=/dev/davinci4 --device=/dev/davinci5 \
  --device=/dev/davinci6 --device=/dev/davinci7 \
  --device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc \
  -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
  -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware:ro \
  -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi:ro \
  -v /etc/ascend_install.info:/etc/ascend_install.info:ro \
  -v /var/log/npu:/var/log/npu \
  -v ${WORKSPACE}:/workspace \
  ${IMAGE_NAME} \
  /bin/bash

After entering the container, run source /usr/local/Ascend/ascend-toolkit/set_env.sh first, then run the NPU availability check below to confirm that the container can access the Ascend devices. If the container cannot detect NPUs, check /dev/davinci*, /dev/davinci_manager, the driver directory, and npu-smi mounts first.

Local Environment Installation

# Create a new conda virtual environment (optional)
conda create -n swift-npu python=3.11 -y
conda activate swift-npu

# Source the CANN environment before the following steps
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# Set the global pip mirror (optional, speeds up downloads)
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
pip install ms-swift -U

# Install from source
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .

# Install torch_npu
pip install torch_npu==2.9.0 decorator
# If you want to use deepspeed (to reduce memory usage, with some speed overhead)
pip install deepspeed

# If you need evaluation features, install the following package
pip install evalscope[opencompass]

# If you need vllm-ascend for inference, install the following packages (for more versions, see the [vLLM-Ascend official website](https://docs.vllm.ai/projects/ascend/en/latest/installation.html))
pip install vllm==0.18.0
pip install vllm-ascend==0.18.0

NPU Availability Check

Check whether the environment is installed correctly and whether NPUs can be loaded:

from transformers.utils import is_torch_npu_available
import torch

print(is_torch_npu_available())  # True
print(torch.npu.device_count())  # 8
print(torch.randn(10, device='npu:0'))

Optional MindSpeed/Megatron-SWIFT Installation

If you need MindSpeed(Megatron-LM), install the required dependencies as follows.

# 1. Clone Megatron-LM and switch to v0.16.0
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_v0.16.0
cd ..

# 2. Clone and install MindSpeed
git clone https://gitcode.com/Ascend/MindSpeed.git
cd MindSpeed
git checkout core_r0.16.0
pip install -e .
cd ..

# 3. Clone and install mcore-bridge
git clone https://github.com/modelscope/mcore-bridge.git
cd mcore-bridge
pip install -e .
cd ..

# 4. Download and install triton-ascend
pip install triton-ascend==3.2.1 --extra-index-url=https://triton-ascend.osinfra.cn/pypi/simple

# 5. Set environment variables
export PYTHONPATH=$PYTHONPATH:<your_local_megatron_lm_path>
export MEGATRON_LM_PATH=<your_local_megatron_lm_path>

# 6. Disable Megatron GDN if you need to fall back to the transformers GatedDeltaNet implementation
export USE_MCORE_GDN=0

Run the following command to verify that MindSpeed(Megatron-LM) is configured correctly:

python -c "import mindspeed.megatron_adaptor; from swift.megatron.init import init_megatron_env; init_megatron_env(); print('✓ Megatron-SWIFT configuration verified successfully in the NPU environment!')"

Qwen3.5 FLA Patch Notes

The current repository already includes a built-in Qwen3.5 linear attention patch for Ascend NPUs, so users do not need to manually modify the transformers or fla source code. This patch does not replace the entire flash-linear-attention package directly. Instead, it redirects the low-level GPU Triton operator path used by Qwen3.5 through chunk_gated_delta_rule to the MindSpeed NPU implementation.

When the patch takes effect, ms-swift performs the following replacements:

  1. Set transformers.utils.is_flash_linear_attention_available and transformers.utils.import_utils.is_flash_linear_attention_available to return True, so that transformers.models.qwen3_5.modeling_qwen3_5 can complete initialization through the FLA fast path.

  2. Redirect transformers.models.qwen3_5.modeling_qwen3_5.chunk_gated_delta_rule and transformers.models.qwen3_5_moe.modeling_qwen3_5_moe.chunk_gated_delta_rule to the built-in ms-swift implementation swift.model.chunk_gated_delta_rule.chunk_gated_delta_rule.

  3. Inside swift.model.chunk_gated_delta_rule, continue calling the native Triton operators provided by MindSpeed, including:

    • mindspeed.lite.ops.triton.chunk_delta_h

    • mindspeed.lite.ops.triton.chunk_o

    • mindspeed.lite.ops.triton.chunk_scaled_dot_kkt

    • mindspeed.lite.ops.triton.wy_fast

  4. Keep the native torch l2norm helper, reducing per-layer per-step launch overhead as well as compile/autotune overhead during cold start, which improves model performance on NPU.

  5. For FusedRMSNormGated, which depends on torch.cuda.current_device() during FLA initialization, NPU keeps the native Qwen3.5 torch path to avoid compatibility issues caused by CUDA-only initialization logic.

The call chain can be understood as:

Qwen3.5 modeling.chunk_gated_delta_rule
    -> swift.model.chunk_gated_delta_rule.chunk_gated_delta_rule
    -> MindSpeed Triton kernels

Therefore:

  • This patch mainly covers the gated-delta-rule path of Qwen3.5 linear attention.

  • It is not equivalent to “fully replacing the entire fla package with MindSpeed”.

  • To make this path effective, ensure that MindSpeed can be imported correctly in the current environment.

  • Verified versions for accuracy alignment: torch 2.9.0 + MindSpeed 0.16.0 + flash-linear-attention 0.4.2 + triton-ascend 3.2.1 + transformers 5.2.0

When running Qwen3.5 with Megatron-SWIFT on NPU, note the following version and feature constraints:

  1. The MindSpeed training combination currently pinned by the NPU documentation is Megatron-LM v0.16.0 + MindSpeed core_r0.16.0. With this combination, megatron-core already ships the native GDN kernel core.ssm.gated_delta_net, and mcore-bridge defaults to the Megatron-Core/MindSpeed GDN path with USE_MCORE_GDN=1. Set USE_MCORE_GDN=0 only when you intentionally need to fall back to the transformers-native GDN implementation wrapped by mcore-bridge. Combined with ms-swift’s built-in Qwen3.5 FLA NPU patch, chunk_gated_delta_rule is then redirected to MindSpeed’s Triton kernels. The known costs of this fallback path are:

    • The transformers GDN implementation does not support packing, nor TP/CP for the GDN layer.

  2. When using the native 0.16 GDN path with USE_MCORE_GDN=1, do not apply the fallback-only limitations above to that path. The native path’s packing, TP/CP, mask routing, and parallel combinations should still be verified against the current MindSpeed/Megatron-LM, mcore-bridge, and target script combination.

Environment Viewing

Check the P2P connections of the NPU, where we can see that each NPU is interconnected through 7 HCCS links with other NPUs.

(valle) root@valle:~/src# npu-smi info -t topo
	   NPU0       NPU1       NPU2       NPU3       NPU4       NPU5       NPU6       NPU7       CPU Affinity
NPU0       X          HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       144-167
NPU1       HCCS       X          HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       144-167
NPU2       HCCS       HCCS       X          HCCS       HCCS       HCCS       HCCS       HCCS       96-119
NPU3       HCCS       HCCS       HCCS       X          HCCS       HCCS       HCCS       HCCS       96-119
NPU4       HCCS       HCCS       HCCS       HCCS       X          HCCS       HCCS       HCCS       0-23
NPU5       HCCS       HCCS       HCCS       HCCS       HCCS       X          HCCS       HCCS       0-23
NPU6       HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       X          HCCS       48-71
NPU7       HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       X          48-71

Legend:

  X    = Self
  SYS  = Path traversing PCIe and NUMA nodes. Nodes are connected through SMP, such as QPI, UPI.
  PHB  = Path traversing PCIe and the PCIe host bridge of a CPU.
  PIX  = Path traversing a single PCIe switch
  PXB  = Path traversing multiple PCIe switches
  HCCS = Connection traversing HCCS.
  NA   = Unknown relationship.

Check NPU status. For details about the npu-smi command, see the official documentation.

(valle) root@valle:~/src# npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc1.b030            Version: 24.1.rc1.b030                                        |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B3               | OK            | 101.8       43                0    / 0             |
| 0                         | 0000:C1:00.0  | 0           0    / 0          3318 / 65536         |
+===========================+===============+====================================================+
| 1     910B3               | OK            | 92.0        39                0    / 0             |
| 0                         | 0000:C2:00.0  | 0           0    / 0          3314 / 65536         |
+===========================+===============+====================================================+
| 2     910B3               | OK            | 102.0       40                0    / 0             |
| 0                         | 0000:81:00.0  | 0           0    / 0          3314 / 65536         |
+===========================+===============+====================================================+
| 3     910B3               | OK            | 99.8        40                0    / 0             |
| 0                         | 0000:82:00.0  | 0           0    / 0          3314 / 65536         |
+===========================+===============+====================================================+
| 4     910B3               | OK            | 98.6        45                0    / 0             |
| 0                         | 0000:01:00.0  | 0           0    / 0          3314 / 65536         |
+===========================+===============+====================================================+
| 5     910B3               | OK            | 99.7        44                0    / 0             |
| 0                         | 0000:02:00.0  | 0           0    / 0          3314 / 65536         |
+===========================+===============+====================================================+
| 6     910B3               | OK            | 103.8       45                0    / 0             |
| 0                         | 0000:41:00.0  | 0           0    / 0          3314 / 65536         |
+===========================+===============+====================================================+
| 7     910B3               | OK            | 98.2        44                0    / 0             |
| 0                         | 0000:42:00.0  | 0           0    / 0          3315 / 65536         |
+===========================+===============+====================================================+

Quick Start: ModelScope Model + Dataset

If you want to quickly verify the environment with a ModelScope model and dataset, you can use this section to complete the full flow: train LoRA, find the latest checkpoint, Merge LoRA, run CLI inference, start a service, and validate it with curl. The example uses a small model and a small data sample so that it can run quickly. To use your own model or dataset, modify the ID variables at the beginning.

export MODEL_ID=Qwen/Qwen3-0.6B
export DATASET_ID=AI-ModelScope/alpaca-gpt4-data-zh
export WORK_DIR=output/npu-modelscope-qwen3-0_6b-lora

Train and save a LoRA checkpoint:

ASCEND_RT_VISIBLE_DEVICES=0 \
swift sft \
    --model $MODEL_ID \
    --dataset $DATASET_ID#1000 \
    --split_dataset_ratio 0.01 \
    --torch_dtype bfloat16 \
    --tuner_type lora \
    --target_modules all-linear \
    --lora_rank 8 \
    --lora_alpha 32 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --learning_rate 1e-4 \
    --max_length 2048 \
    --save_steps 100 \
    --eval_steps 100 \
    --save_total_limit 2 \
    --logging_steps 10 \
    --output_dir $WORK_DIR

After training finishes, checkpoints are saved under $WORK_DIR/*/checkpoint-*. Use the following commands to select the latest checkpoint and merge LoRA into a full model:

export CKPT_DIR=$(ls -dt $WORK_DIR/*/checkpoint-* | head -n 1)

ASCEND_RT_VISIBLE_DEVICES=0 \
swift export \
    --adapters $CKPT_DIR \
    --merge_lora true

export MERGED_DIR=${CKPT_DIR}-merged

You can verify inference either with the LoRA checkpoint directly or with the merged full model:

# Load the LoRA checkpoint directly
ASCEND_RT_VISIBLE_DEVICES=0 \
swift infer \
    --adapters $CKPT_DIR \
    --stream true \
    --temperature 0 \
    --max_new_tokens 512

# Load the merged full model
ASCEND_RT_VISIBLE_DEVICES=0 \
swift infer \
    --model $MERGED_DIR \
    --stream true \
    --temperature 0 \
    --max_new_tokens 512

To start an OpenAI-compatible deployment service, use the merged full model:

ASCEND_RT_VISIBLE_DEVICES=0 \
swift deploy \
    --model $MERGED_DIR \
    --host 0.0.0.0 \
    --port 8000 \
    --max_new_tokens 512 \
    --served_model_name npu-modelscope-qwen3-0_6b

After the service starts, validate the API with curl:

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "npu-modelscope-qwen3-0_6b",
"messages": [{"role": "user", "content": "Briefly introduce Ascend NPU in one sentence."}],
"max_tokens": 128,
"temperature": 0
}'

Training

The following examples introduce LoRA fine-tuning. For full-parameter fine-tuning, set --tuner_type full. For more training scripts, see examples/ascend/train. For general pre-training, SFT, LoRA, full-parameter training, and custom dataset usage, see Pre-training and Fine-tuning.

Model size NPU count DeepSpeed type Max memory usage
7B 1 None 1 * 28 GB
7B 4 None 4 * 22 GB
7B 4 zero2 4 * 28 GB
7B 4 zero3 4 * 22 GB
7B 8 None 8 * 22 GB
14B 1 None 1 * 45 GB
14B 8 None 8 * 51 GB
14B 8 zero2 8 * 49 GB
14B 8 zero3 8 * 31 GB

Single-Card Training

Start single-card fine-tuning with the following command:

# Experiment environment: Ascend 910B3
# Memory requirement: 28 GB
# Runtime: 8 hours
ASCEND_RT_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen2-7B-Instruct \
    --dataset AI-ModelScope/blossom-math-v2 \
    --split_dataset_ratio 0.01 \
    --num_train_epochs 5 \
    --tuner_type lora \
    --output_dir output \
    --learning_rate 1e-4 \
    --gradient_accumulation_steps 16 \
    --save_steps 100 \
    --eval_steps 100

Data Parallel Training

The following example uses 4 NPUs for DDP training.

# Experiment environment: 4 * Ascend 910B3
# Memory requirement: 4 * 22 GB
# Runtime: 2 hours
NPROC_PER_NODE=4 \
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
    --model Qwen/Qwen2-7B-Instruct \
    --dataset AI-ModelScope/blossom-math-v2 \
    --split_dataset_ratio 0.01 \
    --num_train_epochs 5 \
    --tuner_type lora \
    --output_dir output \
    ...

DeepSpeed Training

ZeRO2:

# Experiment environment: 4 * Ascend 910B3
# Memory requirement: 4 * 28GB
# Runtime: 3.5 hours
NPROC_PER_NODE=4 \
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
    --model Qwen/Qwen2-7B-Instruct \
    --dataset AI-ModelScope/blossom-math-v2 \
    --split_dataset_ratio 0.01 \
    --num_train_epochs 5 \
    --tuner_type lora \
    --output_dir output \
    --deepspeed zero2 \
    ...

ZeRO3:

# Experiment environment: 4 * Ascend 910B3
# Memory requirement: 4 * 22 GB
# Runtime: 8.5 hours
NPROC_PER_NODE=4 \
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
    --model Qwen/Qwen2-7B-Instruct \
    --dataset AI-ModelScope/blossom-math-v2 \
    --split_dataset_ratio 0.01 \
    --num_train_epochs 5 \
    --tuner_type lora \
    --output_dir output \
    --deepspeed zero3 \
    ...

Qwen3.5 Single-Node Multi-Card LoRA Example

The following is an NPU LoRA example for a newer model. It uses Qwen3.5-4B for demonstration. Four-card data parallelism is usually faster than single-card training. If you already have local model and dataset files, replace --model and --dataset with local paths.

# Experiment environment: 4 * Ascend 910B3
NPROC_PER_NODE=4 \
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
    --model Qwen/Qwen3.5-4B \
    --dataset AI-ModelScope/alpaca-gpt4-data-zh#2000 \
    --split_dataset_ratio 0.01 \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --tuner_type lora \
    --target_modules all-linear \
    --lora_rank 8 \
    --lora_alpha 32 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --learning_rate 1e-4 \
    --max_length 2048 \
    --group_by_length true \
    --dataset_num_proc 4 \
    --dataloader_num_workers 4 \
    --save_steps 100 \
    --eval_steps 100 \
    --save_total_limit 2 \
    --output_dir output/Qwen3.5-4B-NPU

When tuning parameters, focus on memory, throughput, and stability:

  • Reduce memory usage: first reduce --max_length, --per_device_train_batch_size, and --lora_rank; if OOM still occurs, enable --deepspeed zero2/zero3. ZeRO can significantly reduce memory pressure but introduces communication and scheduling overhead.

  • Improve throughput: increase --per_device_train_batch_size when memory allows, and use --gradient_accumulation_steps to keep the global batch size. Increase --dataset_num_proc if preprocessing is slow, and increase --dataloader_num_workers if data loading is the bottleneck.

  • Control save overhead: do not set --save_steps too small, because frequent checkpoint saving slows down training. --save_total_limit 2 is usually enough to keep the best checkpoint and the last checkpoint.

  • Improve stability: on NPUs, prefer bfloat16. If you see abnormal loss or NaN, first lower the learning rate and batch size; if necessary, temporarily switch to float32 for comparison.

For more parameter details, see Command-line Parameters.

NPU Model Patch Switch

ms-swift enables model-level patches by default in NPU environments to adapt some Transformers models to Ascend NPU operators and compatibility requirements. You usually do not need to disable them. If you suspect abnormal loss or forward errors are related to the NPU model patch and want to compare against native Transformers behavior, set:

swift sft ... --enable_npu_model_patch false

Model Saving, Merge LoRA, and Resume Training

Use --output_dir to set the output directory, --save_steps to control checkpoint save intervals, and --save_total_limit to control how many checkpoints to keep. After LoRA training, the checkpoint directory contains adapter weights, training arguments, and trainer state. A typical directory layout is:

output/Qwen3.5-4B-NPU/vx-xxx/
├── checkpoint-100/
├── checkpoint-200/
└── ...

If you only need inference or want to continue LoRA training, use the checkpoint directory directly. If you want an independent full-model directory for vLLM-Ascend deployment, offline distribution, or later quantization, run Merge LoRA:

ASCEND_RT_VISIBLE_DEVICES=0 \
swift export \
    --adapters output/Qwen3.5-4B-NPU/vx-xxx/checkpoint-xxx \
    --merge_lora true

The merged model is saved under checkpoint-xxx-merged by default. You can then load it like any regular model with --model checkpoint-xxx-merged.

If training is interrupted and you need to resume from a checkpoint, keep the original training arguments unchanged and add --resume_from_checkpoint:

NPROC_PER_NODE=4 \
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
    --model Qwen/Qwen3.5-4B \
    --dataset AI-ModelScope/alpaca-gpt4-data-zh#2000 \
    --tuner_type lora \
    --output_dir output/Qwen3.5-4B-NPU \
    --resume_from_checkpoint output/Qwen3.5-4B-NPU/vx-xxx/checkpoint-xxx \
    ...

--resume_from_checkpoint restores model weights, optimizer state, random seeds, and training progress. If you only want to load model weights without restoring the optimizer state or data skipping state, also set --resume_only_model true. For details, see resume_from_checkpoint, resume_only_model, save_steps, and save_total_limit in Command-line Parameters.

Inference

Original model:

ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
    --model Qwen/Qwen2-7B-Instruct \
    --stream true --max_new_tokens 2048

After LoRA fine-tuning:

ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
    --adapters xxx/checkpoint-xxx --load_data_args true \
    --stream true --max_new_tokens 2048

For full-parameter training or a merged LoRA model, point --model to the full-weight directory:

ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
    --model xxx/checkpoint-xxx-merged \
    --stream true --max_new_tokens 2048

Deployment

Deployment with Native Transformers

Original model:

ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model Qwen/Qwen2-7B-Instruct --max_new_tokens 2048

After LoRA fine-tuning:

ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --adapters xxx/checkpoint-xxx --max_new_tokens 2048

# Deploy the full weights after Merge LoRA
ASCEND_RT_VISIBLE_DEVICES=0 swift export --adapters xx/checkpoint-xxx --merge_lora true
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model xxx/checkpoint-xxx-merged --max_new_tokens 2048

Deployment with vLLM-Ascend

Install from PyPI:

# Refer to the official vLLM-Ascend compatibility matrix; the following versions are verified in this document.
pip install vllm==0.14.0
pip install vllm-ascend==0.14.0rc1

Original model:

ASCEND_RT_VISIBLE_DEVICES=0 swift deploy \
    --model Qwen/Qwen2.5-7B-Instruct \
    --infer_backend vllm \
    --max_new_tokens 2048

After LoRA fine-tuning:

ASCEND_RT_VISIBLE_DEVICES=0 swift deploy \
    --adapters xxx/checkpoint-xxx \
    --infer_backend vllm \
    --max_new_tokens 2048

# Deploy the full weights after Merge LoRA
ASCEND_RT_VISIBLE_DEVICES=0 swift export \
    --adapters xx/checkpoint-xxx \
    --merge_lora true

ASCEND_RT_VISIBLE_DEVICES=0 swift deploy \
    --model xxx/checkpoint-xxx-merged \
    --infer_backend vllm \
    --max_new_tokens 2048

Evaluation

After training, inference, or deployment, you can evaluate the original model or fine-tuned checkpoint with SWIFT’s built-in EvalScope integration. For complete arguments and examples, see Evaluation.

Release

If you need to publish NPU-trained checkpoints, merged models, or quantized models to ModelScope/HuggingFace, use the push capability in swift export. For complete arguments and examples, see Export and Push.

FAQ

For general questions, see Frequently Asked Questions. This section records common NPU-specific issues and troubleshooting steps.

Q1: How do I confirm that the current environment detects NPUs correctly?

First confirm that you have run source /usr/local/Ascend/ascend-toolkit/set_env.sh, then run the environment check script in this document. Normally, is_torch_npu_available() should return True, torch.npu.device_count() should show the number of available NPUs, and you should be able to create a tensor on npu:0. If this fails, first check whether CANN, torch, and torch_npu match the recommended versions in this document.

Q2: How should I choose between FSDP, DeepSpeed, and Megatron-SWIFT?

For ordinary SFT, first refer to the verified FSDP1/FSDP2/deepspeed combinations in this document. For larger models or higher parallelism requirements, use Megatron-SWIFT and install MindSpeed, Megatron-LM, and mcore-bridge as described in the installation section. DeepSpeed can reduce memory pressure but may reduce speed, so compare it with FSDP when tuning performance.

Q3: Do I need to manually disable the NPU model patch?

Usually no. ms-swift enables model-level patches by default in NPU environments to adapt some Transformers models to Ascend NPU operators and compatibility requirements. Only when troubleshooting abnormal loss or forward errors, and when you suspect the issue is related to the NPU patch, should you temporarily set --enable_npu_model_patch false and compare against native Transformers behavior.

Q4: What should I know when using vLLM-Ascend for deployment or RL rollout?

Install the vllm and vllm-ascend versions recommended in this document, and prioritize model and algorithm combinations that have been verified in the compatibility tables. The sglang inference engine is not yet fully verified for NPU scenarios. For high-performance inference or RL rollout on NPUs, prefer vllm-ascend.

Q5: What happens if I forget to run source set_env.sh?

Common symptoms include is_torch_npu_available() returning False, torch.npu.device_count() returning 0, or runtime errors about missing CANN/HCCL shared libraries. After entering a new shell or container, run:

source /usr/local/Ascend/ascend-toolkit/set_env.sh

If NNAL/ATB or other components are installed, source their corresponding set_env.sh files according to your environment.

Q6: How do I diagnose a torch and torch_npu version mismatch?

Install according to the recommended versions in this document. When versions mismatch, common symptoms include import torch_npu failures, invisible NPU devices, operator registration failures, and C++ symbol/runtime errors. Check versions with:

python -c "import torch, torch_npu; print(torch.__version__); print(torch_npu.__version__)"

If versions do not match, uninstall and reinstall the full CANN/PyTorch/torch_npu stack consistently. Do not upgrade only one package.

Q7: What happens if ASCEND_RT_VISIBLE_DEVICES and NPROC_PER_NODE do not match?

For distributed training, they should match. For example, ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 usually corresponds to NPROC_PER_NODE=4. If the process count is larger than the number of visible devices, ranks may fail to bind devices, multiple processes may compete for the same device, initialization may hang, or HCCL may report errors. If the process count is smaller, only part of the visible NPUs will be used.

Q8: What should I check first when multi-card training hangs?

First confirm that every rank has started and that ASCEND_RT_VISIBLE_DEVICES matches NPROC_PER_NODE. Then identify whether logs stop during data preprocessing, model construction, weight loading, or HCCL initialization. For NPU/HCCL low-level logs, check:

ls ~/ascend/log/debug/plog

If the Python process is still running but produces no output for a long time, use pystack to inspect the stack of each rank and determine whether it is stuck in data loading, communication, or model forward/backward.

Q9: How do I initially troubleshoot HCCL connection or timeout issues?

Use npu-smi info and npu-smi info -t topo to check device health and topology, then check whether other jobs are occupying the same NPU group. For single-node training, first verify card IDs, process count, and visible devices. For multi-node training, also verify networking, rank configuration, communication ports, and environment variables on all nodes. If old training processes remain on the same machine, clean up the corresponding training processes and retry.

Q10: Why is npu-smi unavailable inside the container?

Usually some device or driver files were not mounted. Check whether docker run includes /dev/davinci*, /dev/davinci_manager, /dev/devmm_svm, /dev/hisi_hdc, /usr/local/Ascend/driver, /usr/local/Ascend/firmware, /usr/local/sbin/npu-smi, and /etc/ascend_install.info. If npu-smi info fails on the host itself, fix the host driver environment first.

Q11: How should I choose between native transformers deployment and vLLM-Ascend deployment?

Native transformers deployment has better compatibility and is suitable for validating whether the model, adapter, template, and output are correct. vLLM-Ascend is better for high-throughput services, RL rollout, or OpenAI-compatible serving. If you encounter vLLM-Ascend version or operator issues, first confirm that the model itself works with the transformers backend, then switch to vLLM-Ascend to troubleshoot the performance backend.

Q12: What should I do if vLLM-Ascend reports device type mismatch or undefined symbol?

This is usually not caused by training script arguments. It often means the vllm-ascend wheel does not match the current hardware, PyTorch version, or C++ ABI. First check package build information and current versions:

python -c "import torch, vllm_ascend; print(torch.__version__); print(vllm_ascend.__file__)"

If the error message contains Current device type ... does not match the installed version's device type ..., undefined symbol, or similar text, reinstall torch, torch_npu, vllm, and vllm-ascend according to the device type (A2/A3/other) and the official compatibility matrix. Do not replace only one package.

Q13: Can FP8 or quantized models be trained directly on NPUs?

Do not assume they can. Before downloading or loading a large model, check whether config.json contains quantization_config, and check the actual dtype in safetensors. Quantization/QLoRA is still listed as not supported or not fully verified in the NPU support scope. If model weights are FP8 block quantized and your NPU software stack does not support that FP8 path, use BF16 weights first, or convert the model offline to BF16 before training/loading.

Q14: How do I troubleshoot Megatron-SWIFT importing the wrong Megatron/MindSpeed?

Before running Megatron-SWIFT, PYTHONPATH and MEGATRON_LM_PATH must point to the same Megatron-LM source tree. Otherwise Python may start successfully while importing a different Megatron/MindSpeed combination, which can make later errors look like model or argument issues.

export PYTHONPATH=$PYTHONPATH:<your_local_megatron_lm_path>
export MEGATRON_LM_PATH=<your_local_megatron_lm_path>
python -c "import megatron, os; print(megatron.__file__); print(os.environ.get('MEGATRON_LM_PATH'))"

If they do not match, fix the environment variables before continuing with model construction, weight loading, or parallel configuration troubleshooting.

NPU WeChat Group