Metax Support

1. use swift with Metax

you can either build an image or pull an existing one. Here, we demonstrate how to use ms-swift on Metax by pulling a pre-built image as an example.

1.1. start ms-swift Container

docker pull mx-devops-acr-cn-shanghai.cr.volces.com/opensource/public-ai-release/maca/ms-swift:3.10.3-maca.ai3.3.0.16-torch2.6-py310-ubuntu22.04-amd64
# you may modify privileged option and mount only specific GPU cards.
# please refer to our documents on https://developer.metax-tech.com
# Metax GPUs must be mounted via --device=/dev/dri --device=/dev/mxcd
docker run  -it --net=host --uts=host --ipc=host --privileged=true --group-add video  \
    --shm-size 100gb --ulimit memlock=-1 \
    --security-opt seccomp=unconfined --security-opt apparmor=unconfined \
    --device=/dev/dri --device=/dev/mxcd \
    -v /root/workspace:/external \
    --name swift_test \
    mx-devops-acr-cn-shanghai.cr.volces.com/opensource/public-ai-release/maca/ms-swift:3.10.3-maca.ai3.3.0.16-torch2.6-py310-ubuntu22.04-amd64

2. Environment check

2.1. Check Metax available

Thanks to its compatibility with CUDA, we can use the same approach as NVIDIA to check the availability of Metax devices.

import torch
print(torch.cuda.is_available())
# True

2.2. Check the P2P connections

mx-smi topo -m
# output
=================== MetaX System Management Interface Log ===================
Timestamp                                         : Wed Feb 11 16:37:10 2026

Attached GPUs                                     : 8
Device link type matrix
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    Node Affinity  CPU Affinity
GPU0    X       MX      MX      MX      NODE    NODE    NODE    NODE    0              0-31,64-95
GPU1    MX      X       MX      MX      NODE    NODE    NODE    NODE    0              0-31,64-95
GPU2    MX      MX      X       MX      NODE    NODE    NODE    NODE    0              0-31,64-95
GPU3    MX      MX      MX      X       NODE    NODE    NODE    NODE    0              0-31,64-95
GPU4    NODE    NODE    NODE    NODE    X       MX      MX      MX      0              0-31,64-95
GPU5    NODE    NODE    NODE    NODE    MX      X       MX      MX      0              0-31,64-95
GPU6    NODE    NODE    NODE    NODE    MX      MX      X       MX      0              0-31,64-95
GPU7    NODE    NODE    NODE    NODE    MX      MX      MX      X       0              0-31,64-95

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  MX   = Connection traversing MetaXLink
  ETH  = Connection traversing Eth
  NA   = Connection type is unknown

2.3. check the status of the GPUs

mx-smi
# output
    =================== MetaX System Management Interface Log ===================
Timestamp                                         : Wed Feb 11 09:55:49 2026

Attached GPUs                                     : 8
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.9                       Kernel Mode Driver Version: 3.4.4            |
| MACA Version: 3.3.0.15             BIOS Version: 1.30.0.0                       |
|------------------+-----------------+---------------------+----------------------|
| Board       Name | GPU   Persist-M | Bus-id              | GPU-Util      sGPU-M |
| Pwr:Usage/Cap    | Temp       Perf | Memory-Usage        | GPU-State            |
|==================+=================+=====================+======================|
| 0     MetaX C500 | 0           Off | 0000:0e:00.0        | 0%          Disabled |
| 57W / 350W       | 35C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 1     MetaX C500 | 1           Off | 0000:0f:00.0        | 0%          Disabled |
| 58W / 350W       | 37C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 2     MetaX C500 | 2           Off | 0000:10:00.0        | 0%          Disabled |
| 58W / 350W       | 36C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 3     MetaX C500 | 3           Off | 0000:12:00.0        | 0%          Disabled |
| 60W / 350W       | 35C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 4     MetaX C500 | 4           Off | 0000:35:00.0        | 0%          Disabled |
| 57W / 350W       | 33C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 5     MetaX C500 | 5           Off | 0000:36:00.0        | 0%          Disabled |
| 56W / 350W       | 34C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 6     MetaX C500 | 6           Off | 0000:37:00.0        | 0%          Disabled |
| 55W / 350W       | 34C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+
| 7     MetaX C500 | 7           Off | 0000:38:00.0        | 0%          Disabled |
| 56W / 350W       | 36C          P0 | 826/65536 MiB       | Available            |
+------------------+-----------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                        |
|  GPU                    PID         Process Name                 GPU Memory     |
|                                                                  Usage(MiB)     |
|=================================================================================|
|  no process found                                                               |
+---------------------------------------------------------------------------------+

3. run example

We support direct use of the community version. However, we also provide a more optimized version in the image under /workspace and strongly recommend using it.

3.1. run swift example

In most scenarios, we can run Swift’s examples directly.

# We assume that the ms-swift code is under /workspace
cd /workspace/ms-swift/
bash examples/train/full/train.sh

# output:
{'loss': 1.47077751, 'grad_norm': 10.5625, 'learning_rate': 2e-06, 'token_acc': 0.65511727, 'epoch': 0.01, 'global_step/max_steps': '1/94', 'percentage': '1.06%', 'elapsed_time': '2s', 'remaining_time': '4m 28s', 'memory(GiB)': 4.87, 'train_speed(iter/s)': 0.345807}
{'loss': 1.58882141, 'grad_norm': 10.75, 'learning_rate': 1e-05, 'token_acc': 0.61763144, 'epoch': 0.05, 'global_step/max_steps': '5/94', 'percentage': '5.32%', 'elapsed_time': '10s', 'remaining_time': '3m 12s', 'memory(GiB)': 5.64, 'train_speed(iter/s)': 0.461462}
{'loss': 1.56617603, 'grad_norm': 12.8125, 'learning_rate': 9.92e-06, 'token_acc': 0.61519274, 'epoch': 0.11, 'global_step/max_steps': '10/94', 'percentage': '10.64%', 'elapsed_time': '20s', 'remaining_time': '2m 52s', 'memory(GiB)': 5.64, 'train_speed(iter/s)': 0.485796}
{'loss': 1.63347206, 'grad_norm': 13.6875, 'learning_rate': 9.69e-06, 'token_acc': 0.60373975, 'epoch': 0.16, 'global_step/max_steps': '15/94', 'percentage': '15.96%', 'elapsed_time': '30s', 'remaining_time': '2m 39s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.493855}
{'loss': 1.60613976, 'grad_norm': 11.0, 'learning_rate': 9.32e-06, 'token_acc': 0.59997221, 'epoch': 0.21, 'global_step/max_steps': '20/94', 'percentage': '21.28%', 'elapsed_time': '39s', 'remaining_time': '2m 27s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.500516}
{'loss': 1.45015478, 'grad_norm': 15.25, 'learning_rate': 8.8e-06, 'token_acc': 0.62373584, 'epoch': 0.27, 'global_step/max_steps': '25/94', 'percentage': '26.60%', 'elapsed_time': '49s', 'remaining_time': '2m 16s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.50548}
{'loss': 1.39427547, 'grad_norm': 13.9375, 'learning_rate': 8.18e-06, 'token_acc': 0.6357994, 'epoch': 0.32, 'global_step/max_steps': '30/94', 'percentage': '31.91%', 'elapsed_time': '59s', 'remaining_time': '2m 5s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.508409}
{'loss': 1.53672237, 'grad_norm': 11.125, 'learning_rate': 7.45e-06, 'token_acc': 0.61650612, 'epoch': 0.37, 'global_step/max_steps': '35/94', 'percentage': '37.23%', 'elapsed_time': '1m 8s', 'remaining_time': '1m 55s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.510425}
{'loss': 1.54039021, 'grad_norm': 13.8125, 'learning_rate': 6.65e-06, 'token_acc': 0.61613974, 'epoch': 0.43, 'global_step/max_steps': '40/94', 'percentage': '42.55%', 'elapsed_time': '1m 18s', 'remaining_time': '1m 45s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.512302}
{'loss': 1.40159426, 'grad_norm': 9.4375, 'learning_rate': 5.79e-06, 'token_acc': 0.64041773, 'epoch': 0.48, 'global_step/max_steps': '45/94', 'percentage': '47.87%', 'elapsed_time': '1m 27s', 'remaining_time': '1m 35s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.512983}
{'loss': 1.54977188, 'grad_norm': 11.9375, 'learning_rate': 4.91e-06, 'token_acc': 0.61078816, 'epoch': 0.53, 'global_step/max_steps': '50/94', 'percentage': '53.19%', 'elapsed_time': '1m 37s', 'remaining_time': '1m 25s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.514489}
{'loss': 1.6754509, 'grad_norm': 13.0625, 'learning_rate': 4.04e-06, 'token_acc': 0.58574393, 'epoch': 0.59, 'global_step/max_steps': '55/94', 'percentage': '58.51%', 'elapsed_time': '1m 46s', 'remaining_time': '1m 15s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.515752}
{'loss': 1.37204351, 'grad_norm': 9.25, 'learning_rate': 3.19e-06, 'token_acc': 0.6391937, 'epoch': 0.64, 'global_step/max_steps': '60/94', 'percentage': '63.83%', 'elapsed_time': '1m 56s', 'remaining_time': '1m 5s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.516829}
{'loss': 1.47697926, 'grad_norm': 11.375, 'learning_rate': 2.4e-06, 'token_acc': 0.62817259, 'epoch': 0.69, 'global_step/max_steps': '65/94', 'percentage': '69.15%', 'elapsed_time': '2m 5s', 'remaining_time': '55s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.517947}
{'loss': 1.4336628, 'grad_norm': 8.125, 'learning_rate': 1.69e-06, 'token_acc': 0.63453862, 'epoch': 0.75, 'global_step/max_steps': '70/94', 'percentage': '74.47%', 'elapsed_time': '2m 14s', 'remaining_time': '46s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.518833}
{'loss': 1.54315252, 'grad_norm': 9.625, 'learning_rate': 1.08e-06, 'token_acc': 0.60202073, 'epoch': 0.8, 'global_step/max_steps': '75/94', 'percentage': '79.79%', 'elapsed_time': '2m 24s', 'remaining_time': '36s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.519627}
{'loss': 1.47180223, 'grad_norm': 9.5625, 'learning_rate': 6e-07, 'token_acc': 0.62211501, 'epoch': 0.85, 'global_step/max_steps': '80/94', 'percentage': '85.11%', 'elapsed_time': '2m 33s', 'remaining_time': '26s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.520284}
{'loss': 1.44068375, 'grad_norm': 10.125, 'learning_rate': 2.5e-07, 'token_acc': 0.62673112, 'epoch': 0.91, 'global_step/max_steps': '85/94', 'percentage': '90.43%', 'elapsed_time': '2m 43s', 'remaining_time': '17s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.520331}
{'loss': 1.44893646, 'grad_norm': 8.375, 'learning_rate': 5e-08, 'token_acc': 0.63837478, 'epoch': 0.96, 'global_step/max_steps': '90/94', 'percentage': '95.74%', 'elapsed_time': '2m 52s', 'remaining_time': '7s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.520707}
{'train_runtime': 183.4332, 'train_samples_per_second': 8.177, 'train_steps_per_second': 0.512, 'train_loss': 1.50650934, 'token_acc': 0.6194337, 'epoch': 1.0, 'global_step/max_steps': '94/94', 'percentage': '100.00%', 'elapsed_time': '3m 3s', 'remaining_time': '0s', 'memory(GiB)': 6.5, 'train_speed(iter/s)': 0.512463}
Train: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 94/94 [03:03<00:00,  1.95s/it]
[INFO:swift] last_model_checkpoint: /workspace/ms-swift/output/v0-20260211-143035/checkpoint-94
[INFO:swift] best_model_checkpoint: None
[INFO:swift] images_dir: /workspace/ms-swift/output/v0-20260211-143035/images
[INFO:swift] End time of running main: 2026-02-11 14:34:09.521336

3.2. run swift example with Megatron-LM

if you want to use Megatron-LM as Swift’s backend, you should set MEGATRON_LM_PATH to /workspace/Megatron-LM-0.15.0 or other versions.

export MEGATRON_LM_PATH=/workspace/Megatron-LM-0.15.0
cd /workspace/ms-swift
bash examples/megatron/pretrain.sh

3.3. use other versions of ms-swift

The Metax platform requires the use of MACA-compatible software packages. For instance, compiling depends on torch2.8. We need to use torch2.8+maca3.3.x.x. By default, the installation will overwrite the torch within the environment. Therefore, we recommend using the –no-deps parameter for installation

git clone -b ${SWIFT_VERSION} https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install . --no-deps

After each environment change, the torch and its availability should be checked

pip list |grep torch
# output:
# torch2.x.x+metax3.x.x.x

import torch
torch.cuda.is_available()

3.4. Differences between Metax and NVIDIA CUDA

We are largely aligned with NVIDIA, but there are some differences in certain software and environment variables.

3.4.1. MACA_MPS_MODE

By default, MACA does not allow multiple processes to run on a single GPU. Therefore, when the GPU is already occupied, you cannot launch another process. To enable this scenario, you need to set MACA_MPS_MODE=1

# run other scripts ...
export MACA_MPS_MODE=1
cd /workspace/ms-swift/
bash examples/train/full/train.sh

3.4.2. MCCL_SOCKET_IFNAME GLOO_SOCKET_IFNAME & MCCL_IB_HCA

When using MACA in a multi-node setup, you need to set the environment variables MCCL_SOCKET_IFNAME, GLOO_SOCKET_IFNAME, and MCCL_IB_HCA to ensure proper inter-node communication. We can use mx-smi and ifconfig to determine which InfiniBand devices and network device are being used.

ifconfig
# output
ens20f0np0: xxx
            inet: your node ip
            xxx
...

mx-smi topo -n
# output
mx-smi  version: 2.2.9

=================== MetaX System Management Interface Log ===================
Timestamp                                         : Wed Feb 11 18:53:44 2026

Attached GPUs                                     : 8
Device link type matrix
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    Node Affinity  CPU Affinity
GPU0    X       MX      MX      MX      NODE    NODE    NODE    NODE    PIX     PIX     NODE    NODE    SYS     SYS     0              0-31,64-95
GPU1    MX      X       MX      MX      NODE    NODE    NODE    NODE    PIX     PIX     NODE    NODE    SYS     SYS     0              0-31,64-95
GPU2    MX      MX      X       MX      NODE    NODE    NODE    NODE    PIX     PIX     NODE    NODE    SYS     SYS     0              0-31,64-95
GPU3    MX      MX      MX      X       NODE    NODE    NODE    NODE    PIX     PIX     NODE    NODE    SYS     SYS     0              0-31,64-95
GPU4    NODE    NODE    NODE    NODE    X       MX      MX      MX      NODE    NODE    PIX     PIX     SYS     SYS     0              0-31,64-95
GPU5    NODE    NODE    NODE    NODE    MX      X       MX      MX      NODE    NODE    PIX     PIX     SYS     SYS     0              0-31,64-95
GPU6    NODE    NODE    NODE    NODE    MX      MX      X       MX      NODE    NODE    PIX     PIX     SYS     SYS     0              0-31,64-95
GPU7    NODE    NODE    NODE    NODE    MX      MX      MX      X       NODE    NODE    PIX     PIX     SYS     SYS     0              0-31,64-95
NIC0    PIX     PIX     PIX     PIX     NODE    NODE    NODE    NODE    X       PIX     NODE    NODE    SYS     SYS
NIC1    PIX     PIX     PIX     PIX     NODE    NODE    NODE    NODE    PIX     X       NODE    NODE    SYS     SYS
NIC2    NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     NODE    NODE    X       PIX     SYS     SYS
NIC3    NODE    NODE    NODE    NODE    PIX     PIX     PIX     PIX     NODE    NODE    PIX     X       SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     X       PIX
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     X

Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  MX   = Connection traversing MetaXLink
  ETH  = Connection traversing Eth
  NA   = Connection type is unknown

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
# The output shows:
#  1. GPU0 to GPU3 communicate with NIC0 and NIC1, while GPU4 to GPU7 communicate with NIC2 and NIC3
#  2. NIC0 uses ib device:mlx5_0, NIC1 uses ib device:mlx5_1, NIC2 uses ib device:mlx5_2, NIC3 uses ib device:mlx5_3

Therefore: MCCL_SOCKET_IFNAME=ens20f0np0 GLOO_SOCKET_IFNAME=ens20f0np0 MCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3

# node 1
export MCCL_SOCKET_IFNAME=ens20f0np0
export GLOO_SOCKET_IFNAME=ens20f0np0
export MCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3
cd /workspace/ms-swift/
bash examples/train/multi-node/torchrun/train_node1.sh

# node 2
# Update the value of the master_addr parameter in the script.
export MCCL_SOCKET_IFNAME=ens20f0np0
export GLOO_SOCKET_IFNAME=ens20f0np0
export MCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3
cd /workspace/ms-swift/
bash examples/train/multi-node/torchrun/train_node2.sh