NPU Support
ms-swift supports Ascend NPUs. You can fine-tune models and run inference on Ascend NPUs.
This document describes how to prepare the environment, train models, save and merge checkpoints, run inference, deploy services, and troubleshoot common issues on Ascend NPUs.
If this is your first time using ms-swift on NPUs, we recommend reading this document in the following order:
Check “Support Scope at a Glance” first to confirm whether your model, algorithm, and backend have been verified.
Use “Choose Your Usage Path” to decide whether you only need the base environment or also need MindSpeed/Megatron-SWIFT.
Choose “Local Environment Installation” or “Image/Container Environment Installation” according to your own environment management preference, then run “NPU Availability Check”.
Use “Quick Start” to complete one ModelScope model LoRA training, merge, inference, and deployment flow.
For larger-scale training, continue reading the DDP, DeepSpeed, and MindSpeed/Megatron-SWIFT sections.
Hardware and Supported Operating Systems
Table 1 Product hardware support list
| Product | Supported |
|---|---|
| √ | |
| √ | |
| x | |
| √ | |
| x | |
| x | |
| x | |
| x |
[!NOTE]
In this section, “√” indicates supported and “x” indicates not supported.
For operating systems supported by each hardware product in physical-machine deployment scenarios, see the Compatibility Query Assistant.
For operating systems supported by each hardware product in VM and container deployment scenarios, see “Operating System Compatibility” in CANN Software Installation for the commercial edition or community edition.
Support Scope at a Glance
Recommended base environment versions:
| software | version |
|---|---|
| Python | >= 3.10, < 3.12 |
| CANN | >= 8.5.1 |
| torch | >= 2.7.1 |
| torch_npu | >= 2.7.1.post4 |
| Note: The officially recommended version compatibility matrix for the vLLM Ascend series has been updated to CANN 9.0.0, torch 2.9.0, torch_npu 2.9.0, vllm-ascend 0.18.0 for A3, and vllm-ascend 0.19.1 for A5. For details, see the vLLM Ascend installation guide. |
For base environment setup, see the Ascend PyTorch installation guide. The examples in this document were verified on 8 * Ascend 910B3 64G.
| Primary feature | Feature | Status |
|---|---|---|
| Training | CPT | Supported |
| SFT | Supported | |
| DPO | Supported | |
| RM | Supported | |
| Distributed | DDP | Supported |
| FSDP | Supported | |
| FSDP2 | Supported | |
| DeepSpeed | Supported | |
| MindSpeed(Megatron) | Supported | |
| PEFT | FULL | Supported |
| LoRA | Supported | |
| QLoRA | Not supported | |
| RLHF | GRPO | Supported |
| PPO | Supported | |
| Optimization | FA and other fused ops | Supported |
| Liger-Kernel | Not supported | |
| Deployment | PT | Supported |
| vLLM | Supported | |
| SGLang | Not supported |
Verified SFT Combinations
| algorithm | model families | strategy | hardware |
|---|---|---|---|
| SFT | Qwen2.5-0.5B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen2.5-1.5B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen2.5-7B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen2.5-VL-3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen2.5-VL-7B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen2.5-Omni-3B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen3-8B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen3-30B-A3B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen3-32B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen3-VL-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen3-Omni-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed/Megatron | Atlas 900 A2 PODc/A3 SuperPoD |
| SFT | InternVL3-8B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Ovis2.5-2B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
| SFT | Qwen3.5-27B | FSDP1/FSDP2/deepspeed/Megatron | Atlas 900 A2 PODc/A3 SuperPoD |
| SFT | Qwen3.5-35B-A3B | FSDP1/FSDP2/deepspeed/Megatron | Atlas 900 A2 PODc/A3 SuperPoD |
Verified RL Combinations
| algorithm | model families | strategy | rollout engine | hardware |
|---|---|---|---|---|
| GRPO | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
| GRPO | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
| DPO | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
| DPO | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
| PPO | Qwen2.5-7B-Instruct | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
| PPO | Qwen3-8B | deepspeed | vllm-ascend | Atlas 900 A2 PODc |
Not Yet Supported or Not Fully Verified
| item |
|---|
| Liger-kernel |
| Quantization/QLoRA related features |
| Using SGLang as the inference engine |
| Enabling ETP for LoRA training in Megatron |
PEFT Transformers 5 MoE Fused Expert LoRA Limitation
When training Transformers 5 MoE models such as Qwen3.5-MoE and Qwen3-Omni-MoE with LoRA, some expert weights may be stored as fused nn.Parameter tensors instead of ordinary nn.Linear modules. Injecting LoRA into these parameters relies on PEFT’s target_parameters path.
This path is not fully stable yet with combinations such as lora_dropout, ZeRO-3/FSDP, and multiple adapters. Typical trigger conditions include:
using a MoE model;
using LoRA and trying to cover fused expert parameters;
triggering PEFT’s Transformers 5 MoE target conversion path through the model config or the command-line
--model_type;using the default
lora_dropout != 0, or using parameter-sharding backends such as ZeRO-3/FSDP.
For regular Qwen3.5 GRPO/SFT LoRA training, avoid explicitly passing --model_type when it is not needed. If the model config itself already triggers this PEFT path, prefer full-parameter training or disable the affected LoRA combination. If fused expert LoRA is required, wait for the PEFT upstream support to stabilize, or use it only after validating the target model and training backend with lora_dropout=0.
Choose Your Usage Path
| Scenario | Recommended path | Need MindSpeed |
|---|---|---|
| Ordinary SFT/LoRA/inference | Local environment installation or image/container installation | No |
| Megatron-SWIFT large-model training | Install the base environment, then install MindSpeed/Megatron/mcore-bridge | Yes |
| GRPO/PPO/DPO and other RLHF workflows | Base training environment + vLLM-Ascend rollout/deploy | Usually no |
| Only verifying whether NPUs are usable | Run the NPU availability check script | No |
Environment Preparation
Image/Container Environment Installation
The official NPU image is available at quay.io/ascend/ms-swift. We recommend choosing an image tag that matches your device generation, Python version, CANN version, and OS version first. If you need to pin a branch or customize dependencies, build the image from the Dockerfile provided by the project. The container approach makes dependency versions easier to freeze and helps reproduce the same environment across multiple Ascend machines.
The following example uses the A2, Python 3.11, CANN 9.0.0, Ubuntu 22.04 tag. In actual use, choose the latest tag from the Quay tag page that matches your machine and software stack.
docker pull quay.io/ascend/ms-swift:v4.3.0-A2-py311-CANN9.0.0-ubuntu22.04
export IMAGE_NAME=quay.io/ascend/ms-swift:v4.3.0-A2-py311-CANN9.0.0-ubuntu22.04
export WORKSPACE=/path/to/workspace
If you need to build the image yourself, clone the modelscope repository first, then use Dockerfile.ascend and build_image.py:
git clone https://github.com/modelscope/modelscope.git
cd modelscope
DOCKER_REGISTRY=ms-swift python docker/build_image.py \
--image_type ascend \
--python_version 3.11.11 \
--soc_version ascend910b1 \
--arch arm
The current build_image.py generates Ascend image names in the format {DOCKER_REGISTRY}:{swift_branch}-{atlas_hardware}-{python_tag}-{arch}. The command above uses the ARM-based Atlas 900 A2 PODc as an example and usually generates ms-swift:main-A2-py311-arm. If you use a self-built image, replace IMAGE_NAME above with the actual image name from your build log.
export IMAGE_NAME=ms-swift:main-A2-py311-arm
Before starting the container, check which NPU devices are exposed on the host:
ls /dev/davinci*
When starting the container, mount the NPU devices, driver, firmware, npu-smi, and required log directories. The following example uses the common 8-card device range davinci0 to davinci7. Some machines may also expose devices up to davinci15; in that case, add the corresponding devices to docker run according to the output of ls /dev/davinci*.
docker run -it \
--name swift-ascend \
--network=host --ipc=host --shm-size=128g \
--device=/dev/davinci0 --device=/dev/davinci1 \
--device=/dev/davinci2 --device=/dev/davinci3 \
--device=/dev/davinci4 --device=/dev/davinci5 \
--device=/dev/davinci6 --device=/dev/davinci7 \
--device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware:ro \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi:ro \
-v /etc/ascend_install.info:/etc/ascend_install.info:ro \
-v /var/log/npu:/var/log/npu \
-v ${WORKSPACE}:/workspace \
${IMAGE_NAME} \
/bin/bash
After entering the container, run source /usr/local/Ascend/ascend-toolkit/set_env.sh first, then run the NPU availability check below to confirm that the container can access the Ascend devices. If the container cannot detect NPUs, check /dev/davinci*, /dev/davinci_manager, the driver directory, and npu-smi mounts first.
Local Environment Installation
# Create a new conda virtual environment (optional)
conda create -n swift-npu python=3.11 -y
conda activate swift-npu
# Source the CANN environment before the following steps
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# Set the global pip mirror (optional, speeds up downloads)
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
pip install ms-swift -U
# Install from source
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .
# Install torch_npu
pip install torch_npu==2.9.0 decorator
# If you want to use deepspeed (to reduce memory usage, with some speed overhead)
pip install deepspeed
# If you need evaluation features, install the following package
pip install evalscope[opencompass]
# If you need vllm-ascend for inference, install the following packages (for more versions, see the [vLLM-Ascend official website](https://docs.vllm.ai/projects/ascend/en/latest/installation.html))
pip install vllm==0.18.0
pip install vllm-ascend==0.18.0
NPU Availability Check
Check whether the environment is installed correctly and whether NPUs can be loaded:
from transformers.utils import is_torch_npu_available
import torch
print(is_torch_npu_available()) # True
print(torch.npu.device_count()) # 8
print(torch.randn(10, device='npu:0'))
Optional MindSpeed/Megatron-SWIFT Installation
If you need MindSpeed(Megatron-LM), install the required dependencies as follows.
# 1. Clone Megatron-LM and switch to v0.16.0
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_v0.16.0
cd ..
# 2. Clone and install MindSpeed
git clone https://gitcode.com/Ascend/MindSpeed.git
cd MindSpeed
git checkout core_r0.16.0
pip install -e .
cd ..
# 3. Clone and install mcore-bridge
git clone https://github.com/modelscope/mcore-bridge.git
cd mcore-bridge
pip install -e .
cd ..
# 4. Download and install triton-ascend
pip install triton-ascend==3.2.1 --extra-index-url=https://triton-ascend.osinfra.cn/pypi/simple
# 5. Set environment variables
export PYTHONPATH=$PYTHONPATH:<your_local_megatron_lm_path>
export MEGATRON_LM_PATH=<your_local_megatron_lm_path>
# 6. Disable Megatron GDN if you need to fall back to the transformers GatedDeltaNet implementation
export USE_MCORE_GDN=0
Run the following command to verify that MindSpeed(Megatron-LM) is configured correctly:
python -c "import mindspeed.megatron_adaptor; from swift.megatron.init import init_megatron_env; init_megatron_env(); print('✓ Megatron-SWIFT configuration verified successfully in the NPU environment!')"
Qwen3.5 FLA Patch Notes
The current repository already includes a built-in Qwen3.5 linear attention patch for Ascend NPUs, so users do not need to manually modify the transformers or fla source code. This patch does not replace the entire flash-linear-attention package directly. Instead, it redirects the low-level GPU Triton operator path used by Qwen3.5 through chunk_gated_delta_rule to the MindSpeed NPU implementation.
When the patch takes effect, ms-swift performs the following replacements:
Set
transformers.utils.is_flash_linear_attention_availableandtransformers.utils.import_utils.is_flash_linear_attention_availableto returnTrue, so thattransformers.models.qwen3_5.modeling_qwen3_5can complete initialization through the FLA fast path.Redirect
transformers.models.qwen3_5.modeling_qwen3_5.chunk_gated_delta_ruleandtransformers.models.qwen3_5_moe.modeling_qwen3_5_moe.chunk_gated_delta_ruleto the built-in ms-swift implementationswift.model.chunk_gated_delta_rule.chunk_gated_delta_rule.Inside
swift.model.chunk_gated_delta_rule, continue calling the native Triton operators provided by MindSpeed, including:mindspeed.lite.ops.triton.chunk_delta_hmindspeed.lite.ops.triton.chunk_omindspeed.lite.ops.triton.chunk_scaled_dot_kktmindspeed.lite.ops.triton.wy_fast
Keep the native torch l2norm helper, reducing per-layer per-step launch overhead as well as compile/autotune overhead during cold start, which improves model performance on NPU.
For
FusedRMSNormGated, which depends ontorch.cuda.current_device()during FLA initialization, NPU keeps the native Qwen3.5 torch path to avoid compatibility issues caused by CUDA-only initialization logic.
The call chain can be understood as:
Qwen3.5 modeling.chunk_gated_delta_rule
-> swift.model.chunk_gated_delta_rule.chunk_gated_delta_rule
-> MindSpeed Triton kernels
Therefore:
This patch mainly covers the gated-delta-rule path of Qwen3.5 linear attention.
It is not equivalent to “fully replacing the entire fla package with MindSpeed”.
To make this path effective, ensure that MindSpeed can be imported correctly in the current environment.
Verified versions for accuracy alignment: torch 2.9.0 + MindSpeed 0.16.0 + flash-linear-attention 0.4.2 + triton-ascend 3.2.1 + transformers 5.2.0
When running Qwen3.5 with Megatron-SWIFT on NPU, note the following version and feature constraints:
The MindSpeed training combination currently pinned by the NPU documentation is
Megatron-LM v0.16.0 + MindSpeed core_r0.16.0. With this combination,megatron-corealready ships the native GDN kernelcore.ssm.gated_delta_net, andmcore-bridgedefaults to the Megatron-Core/MindSpeed GDN path withUSE_MCORE_GDN=1. SetUSE_MCORE_GDN=0only when you intentionally need to fall back to the transformers-native GDN implementation wrapped bymcore-bridge. Combined with ms-swift’s built-in Qwen3.5 FLA NPU patch,chunk_gated_delta_ruleis then redirected to MindSpeed’s Triton kernels. The known costs of this fallback path are:The transformers GDN implementation does not support packing, nor TP/CP for the GDN layer.
When using the native 0.16 GDN path with
USE_MCORE_GDN=1, do not apply the fallback-only limitations above to that path. The native path’s packing, TP/CP, mask routing, and parallel combinations should still be verified against the current MindSpeed/Megatron-LM, mcore-bridge, and target script combination.
Environment Viewing
Check the P2P connections of the NPU, where we can see that each NPU is interconnected through 7 HCCS links with other NPUs.
(valle) root@valle:~/src# npu-smi info -t topo
NPU0 NPU1 NPU2 NPU3 NPU4 NPU5 NPU6 NPU7 CPU Affinity
NPU0 X HCCS HCCS HCCS HCCS HCCS HCCS HCCS 144-167
NPU1 HCCS X HCCS HCCS HCCS HCCS HCCS HCCS 144-167
NPU2 HCCS HCCS X HCCS HCCS HCCS HCCS HCCS 96-119
NPU3 HCCS HCCS HCCS X HCCS HCCS HCCS HCCS 96-119
NPU4 HCCS HCCS HCCS HCCS X HCCS HCCS HCCS 0-23
NPU5 HCCS HCCS HCCS HCCS HCCS X HCCS HCCS 0-23
NPU6 HCCS HCCS HCCS HCCS HCCS HCCS X HCCS 48-71
NPU7 HCCS HCCS HCCS HCCS HCCS HCCS HCCS X 48-71
Legend:
X = Self
SYS = Path traversing PCIe and NUMA nodes. Nodes are connected through SMP, such as QPI, UPI.
PHB = Path traversing PCIe and the PCIe host bridge of a CPU.
PIX = Path traversing a single PCIe switch
PXB = Path traversing multiple PCIe switches
HCCS = Connection traversing HCCS.
NA = Unknown relationship.
Check NPU status. For details about the npu-smi command, see the official documentation.
(valle) root@valle:~/src# npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc1.b030 Version: 24.1.rc1.b030 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B3 | OK | 101.8 43 0 / 0 |
| 0 | 0000:C1:00.0 | 0 0 / 0 3318 / 65536 |
+===========================+===============+====================================================+
| 1 910B3 | OK | 92.0 39 0 / 0 |
| 0 | 0000:C2:00.0 | 0 0 / 0 3314 / 65536 |
+===========================+===============+====================================================+
| 2 910B3 | OK | 102.0 40 0 / 0 |
| 0 | 0000:81:00.0 | 0 0 / 0 3314 / 65536 |
+===========================+===============+====================================================+
| 3 910B3 | OK | 99.8 40 0 / 0 |
| 0 | 0000:82:00.0 | 0 0 / 0 3314 / 65536 |
+===========================+===============+====================================================+
| 4 910B3 | OK | 98.6 45 0 / 0 |
| 0 | 0000:01:00.0 | 0 0 / 0 3314 / 65536 |
+===========================+===============+====================================================+
| 5 910B3 | OK | 99.7 44 0 / 0 |
| 0 | 0000:02:00.0 | 0 0 / 0 3314 / 65536 |
+===========================+===============+====================================================+
| 6 910B3 | OK | 103.8 45 0 / 0 |
| 0 | 0000:41:00.0 | 0 0 / 0 3314 / 65536 |
+===========================+===============+====================================================+
| 7 910B3 | OK | 98.2 44 0 / 0 |
| 0 | 0000:42:00.0 | 0 0 / 0 3315 / 65536 |
+===========================+===============+====================================================+
Quick Start: ModelScope Model + Dataset
If you want to quickly verify the environment with a ModelScope model and dataset, you can use this section to complete the full flow: train LoRA, find the latest checkpoint, Merge LoRA, run CLI inference, start a service, and validate it with curl. The example uses a small model and a small data sample so that it can run quickly. To use your own model or dataset, modify the ID variables at the beginning.
export MODEL_ID=Qwen/Qwen3-0.6B
export DATASET_ID=AI-ModelScope/alpaca-gpt4-data-zh
export WORK_DIR=output/npu-modelscope-qwen3-0_6b-lora
Train and save a LoRA checkpoint:
ASCEND_RT_VISIBLE_DEVICES=0 \
swift sft \
--model $MODEL_ID \
--dataset $DATASET_ID#1000 \
--split_dataset_ratio 0.01 \
--torch_dtype bfloat16 \
--tuner_type lora \
--target_modules all-linear \
--lora_rank 8 \
--lora_alpha 32 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--learning_rate 1e-4 \
--max_length 2048 \
--save_steps 100 \
--eval_steps 100 \
--save_total_limit 2 \
--logging_steps 10 \
--output_dir $WORK_DIR
After training finishes, checkpoints are saved under $WORK_DIR/*/checkpoint-*. Use the following commands to select the latest checkpoint and merge LoRA into a full model:
export CKPT_DIR=$(ls -dt $WORK_DIR/*/checkpoint-* | head -n 1)
ASCEND_RT_VISIBLE_DEVICES=0 \
swift export \
--adapters $CKPT_DIR \
--merge_lora true
export MERGED_DIR=${CKPT_DIR}-merged
You can verify inference either with the LoRA checkpoint directly or with the merged full model:
# Load the LoRA checkpoint directly
ASCEND_RT_VISIBLE_DEVICES=0 \
swift infer \
--adapters $CKPT_DIR \
--stream true \
--temperature 0 \
--max_new_tokens 512
# Load the merged full model
ASCEND_RT_VISIBLE_DEVICES=0 \
swift infer \
--model $MERGED_DIR \
--stream true \
--temperature 0 \
--max_new_tokens 512
To start an OpenAI-compatible deployment service, use the merged full model:
ASCEND_RT_VISIBLE_DEVICES=0 \
swift deploy \
--model $MERGED_DIR \
--host 0.0.0.0 \
--port 8000 \
--max_new_tokens 512 \
--served_model_name npu-modelscope-qwen3-0_6b
After the service starts, validate the API with curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "npu-modelscope-qwen3-0_6b",
"messages": [{"role": "user", "content": "Briefly introduce Ascend NPU in one sentence."}],
"max_tokens": 128,
"temperature": 0
}'
Training
The following examples introduce LoRA fine-tuning. For full-parameter fine-tuning, set --tuner_type full. For more training scripts, see examples/ascend/train. For general pre-training, SFT, LoRA, full-parameter training, and custom dataset usage, see Pre-training and Fine-tuning.
| Model size | NPU count | DeepSpeed type | Max memory usage |
|---|---|---|---|
| 7B | 1 | None | 1 * 28 GB |
| 7B | 4 | None | 4 * 22 GB |
| 7B | 4 | zero2 | 4 * 28 GB |
| 7B | 4 | zero3 | 4 * 22 GB |
| 7B | 8 | None | 8 * 22 GB |
| 14B | 1 | None | 1 * 45 GB |
| 14B | 8 | None | 8 * 51 GB |
| 14B | 8 | zero2 | 8 * 49 GB |
| 14B | 8 | zero3 | 8 * 31 GB |
Single-Card Training
Start single-card fine-tuning with the following command:
# Experiment environment: Ascend 910B3
# Memory requirement: 28 GB
# Runtime: 8 hours
ASCEND_RT_VISIBLE_DEVICES=0 \
swift sft \
--model Qwen/Qwen2-7B-Instruct \
--dataset AI-ModelScope/blossom-math-v2 \
--split_dataset_ratio 0.01 \
--num_train_epochs 5 \
--tuner_type lora \
--output_dir output \
--learning_rate 1e-4 \
--gradient_accumulation_steps 16 \
--save_steps 100 \
--eval_steps 100
Data Parallel Training
The following example uses 4 NPUs for DDP training.
# Experiment environment: 4 * Ascend 910B3
# Memory requirement: 4 * 22 GB
# Runtime: 2 hours
NPROC_PER_NODE=4 \
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
--model Qwen/Qwen2-7B-Instruct \
--dataset AI-ModelScope/blossom-math-v2 \
--split_dataset_ratio 0.01 \
--num_train_epochs 5 \
--tuner_type lora \
--output_dir output \
...
DeepSpeed Training
ZeRO2:
# Experiment environment: 4 * Ascend 910B3
# Memory requirement: 4 * 28GB
# Runtime: 3.5 hours
NPROC_PER_NODE=4 \
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
--model Qwen/Qwen2-7B-Instruct \
--dataset AI-ModelScope/blossom-math-v2 \
--split_dataset_ratio 0.01 \
--num_train_epochs 5 \
--tuner_type lora \
--output_dir output \
--deepspeed zero2 \
...
ZeRO3:
# Experiment environment: 4 * Ascend 910B3
# Memory requirement: 4 * 22 GB
# Runtime: 8.5 hours
NPROC_PER_NODE=4 \
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
--model Qwen/Qwen2-7B-Instruct \
--dataset AI-ModelScope/blossom-math-v2 \
--split_dataset_ratio 0.01 \
--num_train_epochs 5 \
--tuner_type lora \
--output_dir output \
--deepspeed zero3 \
...
Qwen3.5 Single-Node Multi-Card LoRA Example
The following is an NPU LoRA example for a newer model. It uses Qwen3.5-4B for demonstration. Four-card data parallelism is usually faster than single-card training. If you already have local model and dataset files, replace --model and --dataset with local paths.
# Experiment environment: 4 * Ascend 910B3
NPROC_PER_NODE=4 \
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
--model Qwen/Qwen3.5-4B \
--dataset AI-ModelScope/alpaca-gpt4-data-zh#2000 \
--split_dataset_ratio 0.01 \
--torch_dtype bfloat16 \
--num_train_epochs 1 \
--tuner_type lora \
--target_modules all-linear \
--lora_rank 8 \
--lora_alpha 32 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--learning_rate 1e-4 \
--max_length 2048 \
--group_by_length true \
--dataset_num_proc 4 \
--dataloader_num_workers 4 \
--save_steps 100 \
--eval_steps 100 \
--save_total_limit 2 \
--output_dir output/Qwen3.5-4B-NPU
When tuning parameters, focus on memory, throughput, and stability:
Reduce memory usage: first reduce
--max_length,--per_device_train_batch_size, and--lora_rank; if OOM still occurs, enable--deepspeed zero2/zero3. ZeRO can significantly reduce memory pressure but introduces communication and scheduling overhead.Improve throughput: increase
--per_device_train_batch_sizewhen memory allows, and use--gradient_accumulation_stepsto keep the global batch size. Increase--dataset_num_procif preprocessing is slow, and increase--dataloader_num_workersif data loading is the bottleneck.Control save overhead: do not set
--save_stepstoo small, because frequent checkpoint saving slows down training.--save_total_limit 2is usually enough to keep the best checkpoint and the last checkpoint.Improve stability: on NPUs, prefer
bfloat16. If you see abnormal loss or NaN, first lower the learning rate and batch size; if necessary, temporarily switch tofloat32for comparison.
For more parameter details, see Command-line Parameters.
NPU Model Patch Switch
ms-swift enables model-level patches by default in NPU environments to adapt some Transformers models to Ascend NPU operators and compatibility requirements. You usually do not need to disable them. If you suspect abnormal loss or forward errors are related to the NPU model patch and want to compare against native Transformers behavior, set:
swift sft ... --enable_npu_model_patch false
Model Saving, Merge LoRA, and Resume Training
Use --output_dir to set the output directory, --save_steps to control checkpoint save intervals, and --save_total_limit to control how many checkpoints to keep. After LoRA training, the checkpoint directory contains adapter weights, training arguments, and trainer state. A typical directory layout is:
output/Qwen3.5-4B-NPU/vx-xxx/
├── checkpoint-100/
├── checkpoint-200/
└── ...
If you only need inference or want to continue LoRA training, use the checkpoint directory directly. If you want an independent full-model directory for vLLM-Ascend deployment, offline distribution, or later quantization, run Merge LoRA:
ASCEND_RT_VISIBLE_DEVICES=0 \
swift export \
--adapters output/Qwen3.5-4B-NPU/vx-xxx/checkpoint-xxx \
--merge_lora true
The merged model is saved under checkpoint-xxx-merged by default. You can then load it like any regular model with --model checkpoint-xxx-merged.
If training is interrupted and you need to resume from a checkpoint, keep the original training arguments unchanged and add --resume_from_checkpoint:
NPROC_PER_NODE=4 \
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
swift sft \
--model Qwen/Qwen3.5-4B \
--dataset AI-ModelScope/alpaca-gpt4-data-zh#2000 \
--tuner_type lora \
--output_dir output/Qwen3.5-4B-NPU \
--resume_from_checkpoint output/Qwen3.5-4B-NPU/vx-xxx/checkpoint-xxx \
...
--resume_from_checkpoint restores model weights, optimizer state, random seeds, and training progress. If you only want to load model weights without restoring the optimizer state or data skipping state, also set --resume_only_model true. For details, see resume_from_checkpoint, resume_only_model, save_steps, and save_total_limit in Command-line Parameters.
Inference
Original model:
ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
--model Qwen/Qwen2-7B-Instruct \
--stream true --max_new_tokens 2048
After LoRA fine-tuning:
ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
--adapters xxx/checkpoint-xxx --load_data_args true \
--stream true --max_new_tokens 2048
For full-parameter training or a merged LoRA model, point --model to the full-weight directory:
ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
--model xxx/checkpoint-xxx-merged \
--stream true --max_new_tokens 2048
Deployment
Deployment with Native Transformers
Original model:
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model Qwen/Qwen2-7B-Instruct --max_new_tokens 2048
After LoRA fine-tuning:
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --adapters xxx/checkpoint-xxx --max_new_tokens 2048
# Deploy the full weights after Merge LoRA
ASCEND_RT_VISIBLE_DEVICES=0 swift export --adapters xx/checkpoint-xxx --merge_lora true
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model xxx/checkpoint-xxx-merged --max_new_tokens 2048
Deployment with vLLM-Ascend
Install from PyPI:
# Refer to the official vLLM-Ascend compatibility matrix; the following versions are verified in this document.
pip install vllm==0.14.0
pip install vllm-ascend==0.14.0rc1
Original model:
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy \
--model Qwen/Qwen2.5-7B-Instruct \
--infer_backend vllm \
--max_new_tokens 2048
After LoRA fine-tuning:
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy \
--adapters xxx/checkpoint-xxx \
--infer_backend vllm \
--max_new_tokens 2048
# Deploy the full weights after Merge LoRA
ASCEND_RT_VISIBLE_DEVICES=0 swift export \
--adapters xx/checkpoint-xxx \
--merge_lora true
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy \
--model xxx/checkpoint-xxx-merged \
--infer_backend vllm \
--max_new_tokens 2048
Evaluation
After training, inference, or deployment, you can evaluate the original model or fine-tuned checkpoint with SWIFT’s built-in EvalScope integration. For complete arguments and examples, see Evaluation.
Release
If you need to publish NPU-trained checkpoints, merged models, or quantized models to ModelScope/HuggingFace, use the push capability in swift export. For complete arguments and examples, see Export and Push.
FAQ
For general questions, see Frequently Asked Questions. This section records common NPU-specific issues and troubleshooting steps.
Q1: How do I confirm that the current environment detects NPUs correctly?
First confirm that you have run source /usr/local/Ascend/ascend-toolkit/set_env.sh, then run the environment check script in this document. Normally, is_torch_npu_available() should return True, torch.npu.device_count() should show the number of available NPUs, and you should be able to create a tensor on npu:0. If this fails, first check whether CANN, torch, and torch_npu match the recommended versions in this document.
Q2: How should I choose between FSDP, DeepSpeed, and Megatron-SWIFT?
For ordinary SFT, first refer to the verified FSDP1/FSDP2/deepspeed combinations in this document. For larger models or higher parallelism requirements, use Megatron-SWIFT and install MindSpeed, Megatron-LM, and mcore-bridge as described in the installation section. DeepSpeed can reduce memory pressure but may reduce speed, so compare it with FSDP when tuning performance.
Q3: Do I need to manually disable the NPU model patch?
Usually no. ms-swift enables model-level patches by default in NPU environments to adapt some Transformers models to Ascend NPU operators and compatibility requirements. Only when troubleshooting abnormal loss or forward errors, and when you suspect the issue is related to the NPU patch, should you temporarily set --enable_npu_model_patch false and compare against native Transformers behavior.
Q4: What should I know when using vLLM-Ascend for deployment or RL rollout?
Install the vllm and vllm-ascend versions recommended in this document, and prioritize model and algorithm combinations that have been verified in the compatibility tables. The sglang inference engine is not yet fully verified for NPU scenarios. For high-performance inference or RL rollout on NPUs, prefer vllm-ascend.
Q5: What happens if I forget to run source set_env.sh?
Common symptoms include is_torch_npu_available() returning False, torch.npu.device_count() returning 0, or runtime errors about missing CANN/HCCL shared libraries. After entering a new shell or container, run:
source /usr/local/Ascend/ascend-toolkit/set_env.sh
If NNAL/ATB or other components are installed, source their corresponding set_env.sh files according to your environment.
Q6: How do I diagnose a torch and torch_npu version mismatch?
Install according to the recommended versions in this document. When versions mismatch, common symptoms include import torch_npu failures, invisible NPU devices, operator registration failures, and C++ symbol/runtime errors. Check versions with:
python -c "import torch, torch_npu; print(torch.__version__); print(torch_npu.__version__)"
If versions do not match, uninstall and reinstall the full CANN/PyTorch/torch_npu stack consistently. Do not upgrade only one package.
Q7: What happens if ASCEND_RT_VISIBLE_DEVICES and NPROC_PER_NODE do not match?
For distributed training, they should match. For example, ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 usually corresponds to NPROC_PER_NODE=4. If the process count is larger than the number of visible devices, ranks may fail to bind devices, multiple processes may compete for the same device, initialization may hang, or HCCL may report errors. If the process count is smaller, only part of the visible NPUs will be used.
Q8: What should I check first when multi-card training hangs?
First confirm that every rank has started and that ASCEND_RT_VISIBLE_DEVICES matches NPROC_PER_NODE. Then identify whether logs stop during data preprocessing, model construction, weight loading, or HCCL initialization. For NPU/HCCL low-level logs, check:
ls ~/ascend/log/debug/plog
If the Python process is still running but produces no output for a long time, use pystack to inspect the stack of each rank and determine whether it is stuck in data loading, communication, or model forward/backward.
Q9: How do I initially troubleshoot HCCL connection or timeout issues?
Use npu-smi info and npu-smi info -t topo to check device health and topology, then check whether other jobs are occupying the same NPU group. For single-node training, first verify card IDs, process count, and visible devices. For multi-node training, also verify networking, rank configuration, communication ports, and environment variables on all nodes. If old training processes remain on the same machine, clean up the corresponding training processes and retry.
Q11: How should I choose between native transformers deployment and vLLM-Ascend deployment?
Native transformers deployment has better compatibility and is suitable for validating whether the model, adapter, template, and output are correct. vLLM-Ascend is better for high-throughput services, RL rollout, or OpenAI-compatible serving. If you encounter vLLM-Ascend version or operator issues, first confirm that the model itself works with the transformers backend, then switch to vLLM-Ascend to troubleshoot the performance backend.
Q12: What should I do if vLLM-Ascend reports device type mismatch or undefined symbol?
This is usually not caused by training script arguments. It often means the vllm-ascend wheel does not match the current hardware, PyTorch version, or C++ ABI. First check package build information and current versions:
python -c "import torch, vllm_ascend; print(torch.__version__); print(vllm_ascend.__file__)"
If the error message contains Current device type ... does not match the installed version's device type ..., undefined symbol, or similar text, reinstall torch, torch_npu, vllm, and vllm-ascend according to the device type (A2/A3/other) and the official compatibility matrix. Do not replace only one package.
Q13: Can FP8 or quantized models be trained directly on NPUs?
Do not assume they can. Before downloading or loading a large model, check whether config.json contains quantization_config, and check the actual dtype in safetensors. Quantization/QLoRA is still listed as not supported or not fully verified in the NPU support scope. If model weights are FP8 block quantized and your NPU software stack does not support that FP8 path, use BF16 weights first, or convert the model offline to BF16 before training/loading.
Q14: How do I troubleshoot Megatron-SWIFT importing the wrong Megatron/MindSpeed?
Before running Megatron-SWIFT, PYTHONPATH and MEGATRON_LM_PATH must point to the same Megatron-LM source tree. Otherwise Python may start successfully while importing a different Megatron/MindSpeed combination, which can make later errors look like model or argument issues.
export PYTHONPATH=$PYTHONPATH:<your_local_megatron_lm_path>
export MEGATRON_LM_PATH=<your_local_megatron_lm_path>
python -c "import megatron, os; print(megatron.__file__); print(os.environ.get('MEGATRON_LM_PATH'))"
If they do not match, fix the environment variables before continuing with model construction, weight loading, or parallel configuration troubleshooting.
NPU WeChat Group