# Metax支持 ## 1. 在 Metax 平台上使用 Swift 你可以选择构建自己的镜像,也可以直接拉取已有的预构建镜像。本文以拉取预构建镜像为例,演示如何在 Metax 上使用 ms-swift。 ### 1.1. 启动 ms-swift 容器 ```bash docker pull mx-devops-acr-cn-shanghai.cr.volces.com/opensource/public-ai-release/maca/ms-swift:3.10.3-maca.ai3.3.0.16-torch2.6-py310-ubuntu22.04-amd64 # 你可以根据需要调整 --privileged 参数,并仅挂载特定的 GPU 卡。 # 更多信息请参考我们的官方文档:https://developer.metax-tech.com # 必须通过 --device 挂载 Metax GPU 设备:--device=/dev/dri --device=/dev/mxcd docker run -it --net=host --uts=host --ipc=host --privileged=true --group-add video \ --shm-size 100gb --ulimit memlock=-1 \ --security-opt seccomp=unconfined --security-opt apparmor=unconfined \ --device=/dev/dri --device=/dev/mxcd \ -v /root/workspace:/external \ --name swift_test \ mx-devops-acr-cn-shanghai.cr.volces.com/opensource/public-ai-release/maca/ms-swift:3.10.3-maca.ai3.3.0.16-torch2.6-py310-ubuntu22.04-amd64 ``` ## 2. 环境检查 ### 2.1. 检查 Metax GPU 是否可用 得益于与 CUDA 的兼容性,我们可以像使用 NVIDIA GPU 一样检查 Metax 设备是否可用: ```python import torch print(torch.cuda.is_available()) # True ``` ### 2.2. 检查 GPU 之间的 P2P 连接拓扑 ```bash mx-smi topo -m # output =================== MetaX System Management Interface Log =================== Timestamp : Wed Feb 11 16:37:10 2026 Attached GPUs : 8 Device link type matrix GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 Node Affinity CPU Affinity GPU0 X MX MX MX NODE NODE NODE NODE 0 0-31,64-95 GPU1 MX X MX MX NODE NODE NODE NODE 0 0-31,64-95 GPU2 MX MX X MX NODE NODE NODE NODE 0 0-31,64-95 GPU3 MX MX MX X NODE NODE NODE NODE 0 0-31,64-95 GPU4 NODE NODE NODE NODE X MX MX MX 0 0-31,64-95 GPU5 NODE NODE NODE NODE MX X MX MX 0 0-31,64-95 GPU6 NODE NODE NODE NODE MX MX X MX 0 0-31,64-95 GPU7 NODE NODE NODE NODE MX MX MX X 0 0-31,64-95 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge MX = Connection traversing MetaXLink ETH = Connection traversing Eth NA = Connection type is unknown ``` ### 2.3. 查看 GPU 状态 ```bash mx-smi # output =================== MetaX System Management Interface Log =================== Timestamp : Wed Feb 11 09:55:49 2026 Attached GPUs : 8 +---------------------------------------------------------------------------------+ | MX-SMI 2.2.9 Kernel Mode Driver Version: 3.4.4 | | MACA Version: 3.3.0.15 BIOS Version: 1.30.0.0 | |------------------+-----------------+---------------------+----------------------| | Board Name | GPU Persist-M | Bus-id | GPU-Util sGPU-M | | Pwr:Usage/Cap | Temp Perf | Memory-Usage | GPU-State | |==================+=================+=====================+======================| | 0 MetaX C500 | 0 Off | 0000:0e:00.0 | 0% Disabled | | 57W / 350W | 35C P0 | 826/65536 MiB | Available | +------------------+-----------------+---------------------+----------------------+ | 1 MetaX C500 | 1 Off | 0000:0f:00.0 | 0% Disabled | | 58W / 350W | 37C P0 | 826/65536 MiB | Available | +------------------+-----------------+---------------------+----------------------+ | 2 MetaX C500 | 2 Off | 0000:10:00.0 | 0% Disabled | | 58W / 350W | 36C P0 | 826/65536 MiB | Available | +------------------+-----------------+---------------------+----------------------+ | 3 MetaX C500 | 3 Off | 0000:12:00.0 | 0% Disabled | | 60W / 350W | 35C P0 | 826/65536 MiB | Available | +------------------+-----------------+---------------------+----------------------+ | 4 MetaX C500 | 4 Off | 0000:35:00.0 | 0% Disabled | | 57W / 350W | 33C P0 | 826/65536 MiB | Available | +------------------+-----------------+---------------------+----------------------+ | 5 MetaX C500 | 5 Off | 0000:36:00.0 | 0% Disabled | | 56W / 350W | 34C P0 | 826/65536 MiB | Available | +------------------+-----------------+---------------------+----------------------+ | 6 MetaX C500 | 6 Off | 0000:37:00.0 | 0% Disabled | | 55W / 350W | 34C P0 | 826/65536 MiB | Available | +------------------+-----------------+---------------------+----------------------+ | 7 MetaX C500 | 7 Off | 0000:38:00.0 | 0% Disabled | | 56W / 350W | 36C P0 | 826/65536 MiB | Available | +------------------+-----------------+---------------------+----------------------+ +---------------------------------------------------------------------------------+ | Process: | | GPU PID Process Name GPU Memory | | Usage(MiB) | |=================================================================================| | no process found | +---------------------------------------------------------------------------------+ ``` ## 3. 运行示例 我们支持直接使用社区版 Swift,同时在镜像中 /workspace 目录下提供了经过更多优化的版本。强烈建议优先使用该目录下的软件包。 ### 3.1. 运行 Swift 示例 在大多数场景下,可直接运行 Swift 的训练示例: ```bash # We assume that the ms-swift code is under /workspace cd /workspace/ms-swift/ bash examples/train/full/train.sh ``` 运行输出示例(节选): ```bash # output: {'loss': 1.47077751, 'grad_norm': 10.5625, 'learning_rate': 2e-06, 'token_acc': 0.65511727, 'epoch': 0.01, 'global_step/max_steps': '1/94', 'percentage': '1.06%', 'elapsed_time': '2s', 'remaining_time': '4m 28s', 'memory(GiB)': 4.87, 'train_speed(iter/s)': 0.345807} {'loss': 1.58882141, 'grad_norm': 10.75, 'learning_rate': 1e-05, 'token_acc': 0.61763144, 'epoch': 0.05, 'global_step/max_steps': '5/94', 'percentage': '5.32%', 'elapsed_time': '10s', 'remaining_time': '3m 12s', 'memory(GiB)': 5.64, 'train_speed(iter/s)': 0.461462} {'loss': 1.56617603, 'grad_norm': 12.8125, 'learning_rate': 9.92e-06, 'token_acc': 0.61519274, 'epoch': 0.11, 'global_step/max_steps': '10/94', 'percentage': '10.64%', 'elapsed_time': '20s', 'remaining_time': '2m 52s', 'memory(GiB)': 5.64, 'train_speed(iter/s)': 0.485796} {'loss': 1.63347206, 'grad_norm': 13.6875, 'learning_rate': 9.69e-06, 'token_acc': 0.60373975, 'epoch': 0.16, 'global_step/max_steps': '15/94', 'percentage': '15.96%', 'elapsed_time': '30s', 'remaining_time': '2m 39s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.493855} {'loss': 1.60613976, 'grad_norm': 11.0, 'learning_rate': 9.32e-06, 'token_acc': 0.59997221, 'epoch': 0.21, 'global_step/max_steps': '20/94', 'percentage': '21.28%', 'elapsed_time': '39s', 'remaining_time': '2m 27s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.500516} {'loss': 1.45015478, 'grad_norm': 15.25, 'learning_rate': 8.8e-06, 'token_acc': 0.62373584, 'epoch': 0.27, 'global_step/max_steps': '25/94', 'percentage': '26.60%', 'elapsed_time': '49s', 'remaining_time': '2m 16s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.50548} {'loss': 1.39427547, 'grad_norm': 13.9375, 'learning_rate': 8.18e-06, 'token_acc': 0.6357994, 'epoch': 0.32, 'global_step/max_steps': '30/94', 'percentage': '31.91%', 'elapsed_time': '59s', 'remaining_time': '2m 5s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.508409} {'loss': 1.53672237, 'grad_norm': 11.125, 'learning_rate': 7.45e-06, 'token_acc': 0.61650612, 'epoch': 0.37, 'global_step/max_steps': '35/94', 'percentage': '37.23%', 'elapsed_time': '1m 8s', 'remaining_time': '1m 55s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.510425} {'loss': 1.54039021, 'grad_norm': 13.8125, 'learning_rate': 6.65e-06, 'token_acc': 0.61613974, 'epoch': 0.43, 'global_step/max_steps': '40/94', 'percentage': '42.55%', 'elapsed_time': '1m 18s', 'remaining_time': '1m 45s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.512302} {'loss': 1.40159426, 'grad_norm': 9.4375, 'learning_rate': 5.79e-06, 'token_acc': 0.64041773, 'epoch': 0.48, 'global_step/max_steps': '45/94', 'percentage': '47.87%', 'elapsed_time': '1m 27s', 'remaining_time': '1m 35s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.512983} {'loss': 1.54977188, 'grad_norm': 11.9375, 'learning_rate': 4.91e-06, 'token_acc': 0.61078816, 'epoch': 0.53, 'global_step/max_steps': '50/94', 'percentage': '53.19%', 'elapsed_time': '1m 37s', 'remaining_time': '1m 25s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.514489} {'loss': 1.6754509, 'grad_norm': 13.0625, 'learning_rate': 4.04e-06, 'token_acc': 0.58574393, 'epoch': 0.59, 'global_step/max_steps': '55/94', 'percentage': '58.51%', 'elapsed_time': '1m 46s', 'remaining_time': '1m 15s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.515752} {'loss': 1.37204351, 'grad_norm': 9.25, 'learning_rate': 3.19e-06, 'token_acc': 0.6391937, 'epoch': 0.64, 'global_step/max_steps': '60/94', 'percentage': '63.83%', 'elapsed_time': '1m 56s', 'remaining_time': '1m 5s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.516829} {'loss': 1.47697926, 'grad_norm': 11.375, 'learning_rate': 2.4e-06, 'token_acc': 0.62817259, 'epoch': 0.69, 'global_step/max_steps': '65/94', 'percentage': '69.15%', 'elapsed_time': '2m 5s', 'remaining_time': '55s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.517947} {'loss': 1.4336628, 'grad_norm': 8.125, 'learning_rate': 1.69e-06, 'token_acc': 0.63453862, 'epoch': 0.75, 'global_step/max_steps': '70/94', 'percentage': '74.47%', 'elapsed_time': '2m 14s', 'remaining_time': '46s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.518833} {'loss': 1.54315252, 'grad_norm': 9.625, 'learning_rate': 1.08e-06, 'token_acc': 0.60202073, 'epoch': 0.8, 'global_step/max_steps': '75/94', 'percentage': '79.79%', 'elapsed_time': '2m 24s', 'remaining_time': '36s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.519627} {'loss': 1.47180223, 'grad_norm': 9.5625, 'learning_rate': 6e-07, 'token_acc': 0.62211501, 'epoch': 0.85, 'global_step/max_steps': '80/94', 'percentage': '85.11%', 'elapsed_time': '2m 33s', 'remaining_time': '26s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.520284} {'loss': 1.44068375, 'grad_norm': 10.125, 'learning_rate': 2.5e-07, 'token_acc': 0.62673112, 'epoch': 0.91, 'global_step/max_steps': '85/94', 'percentage': '90.43%', 'elapsed_time': '2m 43s', 'remaining_time': '17s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.520331} {'loss': 1.44893646, 'grad_norm': 8.375, 'learning_rate': 5e-08, 'token_acc': 0.63837478, 'epoch': 0.96, 'global_step/max_steps': '90/94', 'percentage': '95.74%', 'elapsed_time': '2m 52s', 'remaining_time': '7s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.520707} {'train_runtime': 183.4332, 'train_samples_per_second': 8.177, 'train_steps_per_second': 0.512, 'train_loss': 1.50650934, 'token_acc': 0.6194337, 'epoch': 1.0, 'global_step/max_steps': '94/94', 'percentage': '100.00%', 'elapsed_time': '3m 3s', 'remaining_time': '0s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.512463} Train: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 94/94 [03:03<00:00, 1.95s/it] [INFO:swift] last_model_checkpoint: /workspace/ms-swift/output/v0-20260211-143035/checkpoint-94 [INFO:swift] best_model_checkpoint: None [INFO:swift] images_dir: /workspace/ms-swift/output/v0-20260211-143035/images [INFO:swift] End time of running main: 2026-02-11 14:34:09.521336 ``` ### 3.2. 使用 Megatron-LM 作为 Swift 后端 若希望使用 Megatron-LM 作为 Swift 的后端,需设置 `MEGATRON_LM_PATH` 环境变量: ```bash export MEGATRON_LM_PATH=/workspace/Megatron-LM-0.15.0 cd /workspace/ms-swift bash examples/megatron/pretrain.sh ``` ### 3.3. 使用其他版本的 ms-swift Metax 平台要求使用与 Maca 兼容的软件包。例如,编译可能依赖 torch2.8,因此需使用 torch2.8+maca3.3.x.x 版本。 默认情况下,安装会覆盖环境中已有的 PyTorch。因此,建议使用 --no-deps 参数进行安装: ```bash git clone -b ${SWIFT_VERSION} https://github.com/modelscope/ms-swift.git cd ms-swift pip install . --no-deps ``` 每次环境变更后,请检查 PyTorch 版本及其可用性: ```bash pip list |grep torch # output: # torch2.x.x+metax3.x.x.x ``` ```python import torch torch.cuda.is_available() ``` ### 3.4. Metax 与 NVIDIA CUDA 的差异 Metax 在大部分接口上与 NVIDIA 对齐,但在某些软件行为和环境变量上存在差异。 #### 3.4.1. MACA_MPS_MODE 默认情况下,MACA 不允许多个进程共享同一块 GPU。如果 GPU 已被占用,则无法启动新进程。 如需启用类似 MPS(Multi-Process Service)的功能,需设置:`MACA_MPS_MODE=1` ```bash # 运行其他脚本... export MACA_MPS_MODE=1 cd /workspace/ms-swift/ bash examples/train/full/train.sh ``` #### 3.4.2. MCCL_SOCKET_IFNAME GLOO_SOCKET_IFNAME & MCCL_IB_HCA 在多节点训练时,建议设置以下环境变量以确保节点间通信正常: > MCCL_SOCKET_IFNAME:用于 MCCL 通信的网络接口 > GLOO_SOCKET_IFNAME:用于 GLOO 通信的网络接口 > MCCL_IB_HCA:指定使用的 InfiniBand 设备 可通过 ifconfig 和 mx-smi 确定所用网卡和 IB 设备: ```bash ifconfig # output ens20f0np0: xxx inet: your node ip xxx ... ``` ```bash mx-smi topo -n # output mx-smi version: 2.2.9 =================== MetaX System Management Interface Log =================== Timestamp : Wed Feb 11 18:53:44 2026 Attached GPUs : 8 Device link type matrix GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 Node Affinity CPU Affinity GPU0 X MX MX MX NODE NODE NODE NODE PIX PIX NODE NODE SYS SYS 0 0-31,64-95 GPU1 MX X MX MX NODE NODE NODE NODE PIX PIX NODE NODE SYS SYS 0 0-31,64-95 GPU2 MX MX X MX NODE NODE NODE NODE PIX PIX NODE NODE SYS SYS 0 0-31,64-95 GPU3 MX MX MX X NODE NODE NODE NODE PIX PIX NODE NODE SYS SYS 0 0-31,64-95 GPU4 NODE NODE NODE NODE X MX MX MX NODE NODE PIX PIX SYS SYS 0 0-31,64-95 GPU5 NODE NODE NODE NODE MX X MX MX NODE NODE PIX PIX SYS SYS 0 0-31,64-95 GPU6 NODE NODE NODE NODE MX MX X MX NODE NODE PIX PIX SYS SYS 0 0-31,64-95 GPU7 NODE NODE NODE NODE MX MX MX X NODE NODE PIX PIX SYS SYS 0 0-31,64-95 NIC0 PIX PIX PIX PIX NODE NODE NODE NODE X PIX NODE NODE SYS SYS NIC1 PIX PIX PIX PIX NODE NODE NODE NODE PIX X NODE NODE SYS SYS NIC2 NODE NODE NODE NODE PIX PIX PIX PIX NODE NODE X PIX SYS SYS NIC3 NODE NODE NODE NODE PIX PIX PIX PIX NODE NODE PIX X SYS SYS NIC4 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX NIC5 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge MX = Connection traversing MetaXLink ETH = Connection traversing Eth NA = Connection type is unknown NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 # 根据拓扑信息可知: # 1. GPU0–GPU3 与 NIC0/NIC1(即 mlx5_0, mlx5_1)通信 # 2. GPU4–GPU7 与 NIC2/NIC3(即 mlx5_2, mlx5_3)通信 ``` 因此,推荐设置如下: `MCCL_SOCKET_IFNAME=ens20f0np0` `GLOO_SOCKET_IFNAME=ens20f0np0` `MCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3` ```bash # node 1 export MCCL_SOCKET_IFNAME=ens20f0np0 export GLOO_SOCKET_IFNAME=ens20f0np0 export MCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3 cd /workspace/ms-swift/ bash examples/train/multi-node/torchrun/train_node1.sh ``` ```bash # node 2 # 需修改脚本中的 master_addr 为节点1的IP export MCCL_SOCKET_IFNAME=ens20f0np0 export GLOO_SOCKET_IFNAME=ens20f0np0 export MCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3 cd /workspace/ms-swift/ bash examples/train/multi-node/torchrun/train_node2.sh ```