Metax Support

1. use swift with Metax

you can either build an image or pull an existing one. Here, we demonstrate how to use ms-swift on Metax by pulling a pre-built image as an example.

1.1. start ms-swift Container

docker pull mx-devops-acr-cn-shanghai.cr.volces.com/opensource/public-ai-release/maca/ms-swift:3.10.3-maca.ai3.3.0.16-torch2.6-py310-ubuntu22.04-amd64
# you may modify privileged option and mount only specific GPU cards.
# please refer to our documents on https://developer.metax-tech.com
# Metax GPUs must be mounted via --device=/dev/dri --device=/dev/mxcd
docker run  -it --net=host --uts=host --ipc=host --privileged=true --group-add video  \
    --shm-size 100gb --ulimit memlock=-1 \
    --security-opt seccomp=unconfined --security-opt apparmor=unconfined \
    --device=/dev/dri --device=/dev/mxcd \
    -v /root/workspace:/external \
    --name swift_test \
    mx-devops-acr-cn-shanghai.cr.volces.com/opensource/public-ai-release/maca/ms-swift:3.10.3-maca.ai3.3.0.16-torch2.6-py310-ubuntu22.04-amd64

2. Environment check

2.1. Check Metax available

Thanks to its compatibility with CUDA, we can use the same approach as NVIDIA to check the availability of Metax devices.

import torch
print(torch.cuda.is_available())
# True

2.2. Check the P2P connections

mx-smi topo -m
# output
=================== MetaX System Management Interface Log ===================
Timestamp                                         : Wed Feb 11 16:37:10 2026

Attached GPUs                                     : 8
Device link type matrix
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    Node Affinity  CPU Affinity
GPU0    X       MX      MX      MX      NODE    NODE    NODE    NODE    0              0-31,64-95
GPU1    MX      X       MX      MX      NODE    NODE    NODE    NODE    0              0-31,64-95
GPU2    MX      MX      X       MX      NODE    NODE    NODE    NODE    0              0-31,64-95
GPU3    MX      MX      MX      X       NODE    NODE    NODE    NODE    0              0-31,64-95
GPU4    NODE    NODE    NODE    NODE    X       MX      MX      MX      0              0-31,64-95
GPU5    NODE    NODE    NODE    NODE    MX      X       MX      MX      0              0-31,64-95
GPU6    NODE    NODE    NODE    NODE    MX      MX      X       MX      0              0-31,64-95
GPU7    NODE    NODE    NODE    NODE    MX      MX      MX      X       0              0-31,64-95

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  MX   = Connection traversing MetaXLink
  ETH  = Connection traversing Eth
  NA   = Connection type is unknown

2.3. check the status of the GPUs

mx-smi
# output
    =================== MetaX System Management Interface Log ===================
Timestamp                                         : Wed Feb 11 09:55:49 2026

Attached GPUs                                     : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.9                       Kernel Mode Driver Version: 3.4.4            |
| MACA Version: 3.3.0.15             BIOS Version: 1.30.0.0                       |
|------------------+-----------------+---------------------+----------------------|
| Board       Name | GPU   Persist-M | Bus-id              | GPU-Util      sGPU-M |
| Pwr:Usage/Cap    | Temp       Perf | Memory-Usage        | GPU-State            |
|==================+=================+=====================+======================|
| 0     MetaX C500 | 0           Off | 0000:0e:00.0        | 0%          Disabled |
| 57W / 350W       | 35C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 1     MetaX C500 | 1           Off | 0000:0f:00.0        | 0%          Disabled |
| 58W / 350W       | 37C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 2     MetaX C500 | 2           Off | 0000:10:00.0        | 0%          Disabled |
| 58W / 350W       | 36C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 3     MetaX C500 | 3           Off | 0000:12:00.0        | 0%          Disabled |
| 60W / 350W       | 35C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 4     MetaX C500 | 4           Off | 0000:35:00.0        | 0%          Disabled |
| 57W / 350W       | 33C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 5     MetaX C500 | 5           Off | 0000:36:00.0        | 0%          Disabled |
| 56W / 350W       | 34C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 6     MetaX C500 | 6           Off | 0000:37:00.0        | 0%          Disabled |
| 55W / 350W       | 34C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 7     MetaX C500 | 7           Off | 0000:38:00.0        | 0%          Disabled |
| 56W / 350W       | 36C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                        |
|  GPU                    PID         Process Name                 GPU Memory     |
|                                                                  Usage(MiB)     |
|=================================================================================|
|  no process found                                                               |
+---------------------------------------------------------------------------------+

3. run example

We support direct use of the community version. However, we also provide a more optimized version in the image under /workspace and strongly recommend using it.

3.1. run swift example

In most scenarios, we can run Swift’s examples directly.

# We assume that the ms-swift code is under /workspace
cd /workspace/ms-swift/
bash examples/train/full/train.sh
# output:
{'loss': 1.47077751, 'grad_norm': 10.5625, 'learning_rate': 2e-06, 'token_acc': 0.65511727, 'epoch': 0.01, 'global_step/max_steps': '1/94', 'percentage': '1.06%', 'elapsed_time': '2s', 'remaining_time': '4m 28s', 'memory(GiB)': 4.87, 'train_speed(iter/s)': 0.345807}
{'loss': 1.58882141, 'grad_norm': 10.75, 'learning_rate': 1e-05, 'token_acc': 0.61763144, 'epoch': 0.05, 'global_step/max_steps': '5/94', 'percentage': '5.32%', 'elapsed_time': '10s', 'remaining_time': '3m 12s', 'memory(GiB)': 5.64, 'train_speed(iter/s)': 0.461462}
{'loss': 1.56617603, 'grad_norm': 12.8125, 'learning_rate': 9.92e-06, 'token_acc': 0.61519274, 'epoch': 0.11, 'global_step/max_steps': '10/94', 'percentage': '10.64%', 'elapsed_time': '20s', 'remaining_time': '2m 52s', 'memory(GiB)': 5.64, 'train_speed(iter/s)': 0.485796}
{'loss': 1.63347206, 'grad_norm': 13.6875, 'learning_rate': 9.69e-06, 'token_acc': 0.60373975, 'epoch': 0.16, 'global_step/max_steps': '15/94', 'percentage': '15.96%', 'elapsed_time': '30s', 'remaining_time': '2m 39s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.493855}
{'loss': 1.60613976, 'grad_norm': 11.0, 'learning_rate': 9.32e-06, 'token_acc': 0.59997221, 'epoch': 0.21, 'global_step/max_steps': '20/94', 'percentage': '21.28%', 'elapsed_time': '39s', 'remaining_time': '2m 27s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.500516}
{'loss': 1.45015478, 'grad_norm': 15.25, 'learning_rate': 8.8e-06, 'token_acc': 0.62373584, 'epoch': 0.27, 'global_step/max_steps': '25/94', 'percentage': '26.60%', 'elapsed_time': '49s', 'remaining_time': '2m 16s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.50548}
{'loss': 1.39427547, 'grad_norm': 13.9375, 'learning_rate': 8.18e-06, 'token_acc': 0.6357994, 'epoch': 0.32, 'global_step/max_steps': '30/94', 'percentage': '31.91%', 'elapsed_time': '59s', 'remaining_time': '2m 5s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.508409}
{'loss': 1.53672237, 'grad_norm': 11.125, 'learning_rate': 7.45e-06, 'token_acc': 0.61650612, 'epoch': 0.37, 'global_step/max_steps': '35/94', 'percentage': '37.23%', 'elapsed_time': '1m 8s', 'remaining_time': '1m 55s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.510425}
{'loss': 1.54039021, 'grad_norm': 13.8125, 'learning_rate': 6.65e-06, 'token_acc': 0.61613974, 'epoch': 0.43, 'global_step/max_steps': '40/94', 'percentage': '42.55%', 'elapsed_time': '1m 18s', 'remaining_time': '1m 45s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.512302}
{'loss': 1.40159426, 'grad_norm': 9.4375, 'learning_rate': 5.79e-06, 'token_acc': 0.64041773, 'epoch': 0.48, 'global_step/max_steps': '45/94', 'percentage': '47.87%', 'elapsed_time': '1m 27s', 'remaining_time': '1m 35s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.512983}
{'loss': 1.54977188, 'grad_norm': 11.9375, 'learning_rate': 4.91e-06, 'token_acc': 0.61078816, 'epoch': 0.53, 'global_step/max_steps': '50/94', 'percentage': '53.19%', 'elapsed_time': '1m 37s', 'remaining_time': '1m 25s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.514489}
{'loss': 1.6754509, 'grad_norm': 13.0625, 'learning_rate': 4.04e-06, 'token_acc': 0.58574393, 'epoch': 0.59, 'global_step/max_steps': '55/94', 'percentage': '58.51%', 'elapsed_time': '1m 46s', 'remaining_time': '1m 15s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.515752}
{'loss': 1.37204351, 'grad_norm': 9.25, 'learning_rate': 3.19e-06, 'token_acc': 0.6391937, 'epoch': 0.64, 'global_step/max_steps': '60/94', 'percentage': '63.83%', 'elapsed_time': '1m 56s', 'remaining_time': '1m 5s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.516829}
{'loss': 1.47697926, 'grad_norm': 11.375, 'learning_rate': 2.4e-06, 'token_acc': 0.62817259, 'epoch': 0.69, 'global_step/max_steps': '65/94', 'percentage': '69.15%', 'elapsed_time': '2m 5s', 'remaining_time': '55s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.517947}
{'loss': 1.4336628, 'grad_norm': 8.125, 'learning_rate': 1.69e-06, 'token_acc': 0.63453862, 'epoch': 0.75, 'global_step/max_steps': '70/94', 'percentage': '74.47%', 'elapsed_time': '2m 14s', 'remaining_time': '46s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.518833}
{'loss': 1.54315252, 'grad_norm': 9.625, 'learning_rate': 1.08e-06, 'token_acc': 0.60202073, 'epoch': 0.8, 'global_step/max_steps': '75/94', 'percentage': '79.79%', 'elapsed_time': '2m 24s', 'remaining_time': '36s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.519627}
{'loss': 1.47180223, 'grad_norm': 9.5625, 'learning_rate': 6e-07, 'token_acc': 0.62211501, 'epoch': 0.85, 'global_step/max_steps': '80/94', 'percentage': '85.11%', 'elapsed_time': '2m 33s', 'remaining_time': '26s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.520284}
{'loss': 1.44068375, 'grad_norm': 10.125, 'learning_rate': 2.5e-07, 'token_acc': 0.62673112, 'epoch': 0.91, 'global_step/max_steps': '85/94', 'percentage': '90.43%', 'elapsed_time': '2m 43s', 'remaining_time': '17s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.520331}
{'loss': 1.44893646, 'grad_norm': 8.375, 'learning_rate': 5e-08, 'token_acc': 0.63837478, 'epoch': 0.96, 'global_step/max_steps': '90/94', 'percentage': '95.74%', 'elapsed_time': '2m 52s', 'remaining_time': '7s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.520707}
{'train_runtime': 183.4332, 'train_samples_per_second': 8.177, 'train_steps_per_second': 0.512, 'train_loss': 1.50650934, 'token_acc': 0.6194337, 'epoch': 1.0, 'global_step/max_steps': '94/94', 'percentage': '100.00%', 'elapsed_time': '3m 3s', 'remaining_time': '0s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.512463}
Train: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 94/94 [03:03<00:00,  1.95s/it]
[INFO:swift] last_model_checkpoint: /workspace/ms-swift/output/v0-20260211-143035/checkpoint-94
[INFO:swift] best_model_checkpoint: None
[INFO:swift] images_dir: /workspace/ms-swift/output/v0-20260211-143035/images
[INFO:swift] End time of running main: 2026-02-11 14:34:09.521336

3.2. run swift example with Megatron-LM

if you want to use Megatron-LM as Swift’s backend, you should set MEGATRON_LM_PATH to /workspace/Megatron-LM-0.15.0 or other versions.

export MEGATRON_LM_PATH=/workspace/Megatron-LM-0.15.0
cd /workspace/ms-swift
bash examples/megatron/pretrain.sh

3.3. use other versions of ms-swift

The Metax platform requires the use of MACA-compatible software packages. For instance, compiling depends on torch2.8. We need to use torch2.8+maca3.3.x.x. By default, the installation will overwrite the torch within the environment. Therefore, we recommend using the –no-deps parameter for installation

git clone -b ${SWIFT_VERSION} https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install . --no-deps

After each environment change, the torch and its availability should be checked

pip list |grep torch
# output:
# torch2.x.x+metax3.x.x.x
import torch
torch.cuda.is_available()

3.4. Differences between Metax and NVIDIA CUDA

We are largely aligned with NVIDIA, but there are some differences in certain software and environment variables.

3.4.1. MACA_MPS_MODE

By default, MACA does not allow multiple processes to run on a single GPU. Therefore, when the GPU is already occupied, you cannot launch another process. To enable this scenario, you need to set MACA_MPS_MODE=1

# run other scripts ...
export MACA_MPS_MODE=1
cd /workspace/ms-swift/
bash examples/train/full/train.sh

3.4.2. MCCL_SOCKET_IFNAME GLOO_SOCKET_IFNAME & MCCL_IB_HCA

When using MACA in a multi-node setup, you need to set the environment variables MCCL_SOCKET_IFNAME, GLOO_SOCKET_IFNAME, and MCCL_IB_HCA to ensure proper inter-node communication. We can use mx-smi and ifconfig to determine which InfiniBand devices and network device are being used.

ifconfig
# output
ens20f0np0: xxx
            inet: your node ip
            xxx
...
mx-smi topo -n
# output
mx-smi  version: 2.2.9

=================== MetaX System Management Interface Log ===================
Timestamp                                         : Wed Feb 11 18:53:44 2026

Attached GPUs                                     : 8
Device link type matrix
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    Node Affinity  CPU Affinity
GPU0    X       MX      MX      MX      NODE    NODE    NODE    NODE    PIX     PIX     NODE    NODE    SYS     SYS     0              0-31,64-95
GPU1    MX      X       MX      MX      NODE    NODE    NODE    NODE    PIX     PIX     NODE    NODE    SYS     SYS     0              0-31,64-95
GPU2    MX      MX      X       MX      NODE    NODE    NODE    NODE    PIX     PIX     NODE    NODE    SYS     SYS     0              0-31,64-95
GPU3    MX      MX      MX      X       NODE    NODE    NODE    NODE    PIX     PIX     NODE    NODE    SYS     SYS     0              0-31,64-95
GPU4    NODE    NODE    NODE    NODE    X       MX      MX      MX      NODE    NODE    PIX     PIX     SYS     SYS     0              0-31,64-95
GPU5    NODE    NODE    NODE    NODE    MX      X       MX      MX      NODE    NODE    PIX     PIX     SYS     SYS     0              0-31,64-95
GPU6    NODE    NODE    NODE    NODE    MX      MX      X       MX      NODE    NODE    PIX     PIX     SYS     SYS     0              0-31,64-95
GPU7    NODE    NODE    NODE    NODE    MX      MX      MX      X       NODE    NODE    PIX     PIX     SYS     SYS     0              0-31,64-95
NIC0    PIX     PIX     PIX     PIX     NODE    NODE    NODE    NODE    X       PIX     NODE    NODE    SYS     SYS
NIC1    PIX     PIX     PIX     PIX     NODE    NODE    NODE    NODE    PIX     X       NODE    NODE    SYS     SYS
NIC2    NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     NODE    NODE    X       PIX     SYS     SYS
NIC3    NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     NODE    NODE    PIX     X       SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     X       PIX
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     X

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  MX   = Connection traversing MetaXLink
  ETH  = Connection traversing Eth
  NA   = Connection type is unknown

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
# The output shows:
#  1. GPU0 to GPU3 communicate with NIC0 and NIC1, while GPU4 to GPU7 communicate with NIC2 and NIC3
#  2. NIC0 uses ib device:mlx5_0, NIC1 uses ib device:mlx5_1, NIC2 uses ib device:mlx5_2, NIC3 uses ib device:mlx5_3

Therefore: MCCL_SOCKET_IFNAME=ens20f0np0 GLOO_SOCKET_IFNAME=ens20f0np0 MCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3

# node 1
export MCCL_SOCKET_IFNAME=ens20f0np0
export GLOO_SOCKET_IFNAME=ens20f0np0
export MCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3
cd /workspace/ms-swift/
bash examples/train/multi-node/torchrun/train_node1.sh
# node 2
# Update the value of the master_addr parameter in the script.
export MCCL_SOCKET_IFNAME=ens20f0np0
export GLOO_SOCKET_IFNAME=ens20f0np0
export MCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3
cd /workspace/ms-swift/
bash examples/train/multi-node/torchrun/train_node2.sh