# AMD GPU Support

## 1. Environment setup

### 1.1 Base environment

Pull the ms-swift image built for the AMD ROCm stack, then start the container with the commands below.

If you need a newer ms-swift version, upgrade with pip or install from source code (adding `--no-deps` is recommended to avoid pulling in dependency upgrades that may cause issues).

```bash
IMAGE_NAME=amdagi/modelscope:ubuntu22.04-rocm7.2.0-py312-torch2.10.0-vllm0.18.1-modelscope1.35.1-swift4.1.0
docker pull ${IMAGE_NAME}

CONTAINER_NAME=swift_test
docker run -it --network=host --ipc=host --privileged --group-add video \
    --device=/dev/dri --device=/dev/kfd \
    --shm-size 512G --ulimit memlock=-1 \
    --security-opt seccomp=unconfined --cap-add SYS_PTRACE \
    --name ${CONTAINER_NAME} \
    ${IMAGE_NAME} \
    /bin/bash
```

### 1.2 Environment check

- Confirm the availability of AMD devices for PyTorch in the container.

```bash
python -c "import torch;print(torch.cuda.is_available())"  # output: True
```

- Inspect GPU topology and NUMA: `rocm-smi --showtopo`

```
============================ ROCm System Management Interface ============================
WARNING: AMD GPU device(s) is/are in a low-power state. Check power control/runtime_status

================================ Weight between two GPUs =================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7
GPU0   0            15           15           15           15           15           15           15
GPU1   15           0            15           15           15           15           15           15
GPU2   15           15           0            15           15           15           15           15
GPU3   15           15           15           0            15           15           15           15
GPU4   15           15           15           15           0            15           15           15
GPU5   15           15           15           15           15           0            15           15
GPU6   15           15           15           15           15           15           0            15
GPU7   15           15           15           15           15           15           15           0

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7
GPU0   0            1            1            1            1            1            1            1
GPU1   1            0            1            1            1            1            1            1
GPU2   1            1            0            1            1            1            1            1
GPU3   1            1            1            0            1            1            1            1
GPU4   1            1            1            1            0            1            1            1
GPU5   1            1            1            1            1            0            1            1
GPU6   1            1            1            1            1            1            0            1
GPU7   1            1            1            1            1            1            1            0

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7
GPU0   0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI
GPU1   XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI
GPU2   XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI
GPU3   XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI
GPU4   XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI
GPU5   XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI
GPU6   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI
GPU7   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 0
GPU[0]          : (Topology) Numa Affinity: 0
GPU[1]          : (Topology) Numa Node: 0
GPU[1]          : (Topology) Numa Affinity: 0
GPU[2]          : (Topology) Numa Node: 0
GPU[2]          : (Topology) Numa Affinity: 0
GPU[3]          : (Topology) Numa Node: 0
GPU[3]          : (Topology) Numa Affinity: 0
GPU[4]          : (Topology) Numa Node: 1
GPU[4]          : (Topology) Numa Affinity: 1
GPU[5]          : (Topology) Numa Node: 1
GPU[5]          : (Topology) Numa Affinity: 1
GPU[6]          : (Topology) Numa Node: 1
GPU[6]          : (Topology) Numa Affinity: 1
GPU[7]          : (Topology) Numa Node: 1
GPU[7]          : (Topology) Numa Affinity: 1
================================== End of ROCm SMI Log ===================================
```

- Check GPU utilization and VRAM usage (`rocm-smi` or `rocm-smi -u --showmeminfo vram`):

```
# output of 'rocm-smi'
============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp        Power     Partitions          SCLK    MCLK    Fan  Perf  PwrCap  VRAM%  GPU%
              (DID,     GUID)  (Junction)  (Socket)  (Mem, Compute, ID)
==========================================================================================================================
0       2     0x74a2,   1017   43.0°C      155.0W    NPS1, SPX, 0        94Mhz   900Mhz  0%   auto  650.0W  0%     0%
1       3     0x74a2,   47713  41.0°C      155.0W    NPS1, SPX, 0        91Mhz   900Mhz  0%   auto  650.0W  0%     0%
2       4     0x74a2,   37449  45.0°C      159.0W    NPS1, SPX, 0        95Mhz   900Mhz  0%   auto  650.0W  0%     0%
3       5     0x74a2,   11217  41.0°C      155.0W    NPS1, SPX, 0        95Mhz   900Mhz  0%   auto  650.0W  0%     0%
4       6     0x74a2,   41880  44.0°C      160.0W    NPS1, SPX, 0        91Mhz   900Mhz  0%   auto  650.0W  0%     0%
5       7     0x74a2,   6656   42.0°C      157.0W    NPS1, SPX, 0        95Mhz   900Mhz  0%   auto  650.0W  0%     0%
6       8     0x74a2,   12840  45.0°C      160.0W    NPS1, SPX, 0        96Mhz   900Mhz  0%   auto  650.0W  0%     0%
7       9     0x74a2,   35760  43.0°C      158.0W    NPS1, SPX, 0        107Mhz  900Mhz  0%   auto  650.0W  0%     0%
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================

# output of 'rocm-smi -u --showmeminfo vram'
============================ ROCm System Management Interface ============================
=================================== % time GPU is busy ===================================
GPU[0]          : GPU use (%): 0
GPU[0]          : GFX Activity: 3862538534
GPU[1]          : GPU use (%): 0
GPU[1]          : GFX Activity: 4053246251
GPU[2]          : GPU use (%): 0
GPU[2]          : GFX Activity: 3114103535
GPU[3]          : GPU use (%): 0
GPU[3]          : GFX Activity: 4026776444
GPU[4]          : GPU use (%): 0
GPU[4]          : GFX Activity: 1224255679
GPU[5]          : GPU use (%): 0
GPU[5]          : GFX Activity: 1191191242
GPU[6]          : GPU use (%): 0
GPU[6]          : GFX Activity: 1184652679
GPU[7]          : GPU use (%): 0
GPU[7]          : GFX Activity: 2145209382
==========================================================================================
================================== Memory Usage (Bytes) ==================================
GPU[0]          : VRAM Total Memory (B): 206141652992
GPU[0]          : VRAM Total Used Memory (B): 297611264
GPU[1]          : VRAM Total Memory (B): 206141652992
GPU[1]          : VRAM Total Used Memory (B): 297623552
GPU[2]          : VRAM Total Memory (B): 206141652992
GPU[2]          : VRAM Total Used Memory (B): 297623552
GPU[3]          : VRAM Total Memory (B): 206141652992
GPU[3]          : VRAM Total Used Memory (B): 297623552
GPU[4]          : VRAM Total Memory (B): 206141652992
GPU[4]          : VRAM Total Used Memory (B): 297623552
GPU[5]          : VRAM Total Memory (B): 206141652992
GPU[5]          : VRAM Total Used Memory (B): 297623552
GPU[6]          : VRAM Total Memory (B): 206141652992
GPU[6]          : VRAM Total Used Memory (B): 297623552
GPU[7]          : VRAM Total Memory (B): 206141652992
GPU[7]          : VRAM Total Used Memory (B): 297623552
==========================================================================================
================================== End of ROCm SMI Log ===================================
```

## 2. Run examples

### 2.1 Full fine-tuning Qwen3.5 with Megatron-Swift

AMD GPUs often have large VRAM, so you can tune several knobs together to improve training throughput:

- **Parallelism tuning**: Large per-GPU memory lets you reduce communication from aggressive splits (prefer tuning PP/EP before TP).
- **Optimizer CPU offload**: If VRAM allows, disable with `--optimizer_cpu_offload false`.
- **Activation / gradient checkpointing**: If VRAM allows, use `--recompute_granularity none`, or `--recompute_granularity selective` with `--recompute_modules` for finer control.
- **MoE models**: Set `export NVTE_USE_GROUPED_GEMM_TRITON=1` to use grouped GEMM triton kernel.
- **Models with GatedDeltaNet**: Set `USE_MCORE_GDN=1` to use the Megatron-Core implementation.
- **Stability on some AMD GPUs**: Set `export HSA_NO_SCRATCH_RECLAIM=1` to avoid known issues and stabilize performance.

Single-node training:

```bash
export HSA_NO_SCRATCH_RECLAIM=1
export NVTE_USE_GROUPED_GEMM_TRITON=1

output_dir=${PWD}/megatron_output/Qwen3.5-35B-A3B
mkdir -p ${output_dir}
current_time=$(date "+%Y.%m.%d-%H.%M.%S")
log_file=${output_dir}/"1node_full_megatron_Qwen3.5-35B-A3B_${current_time}.log"

PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=8 \
MAX_PIXELS=1003520 \
VIDEO_MAX_PIXELS=50176 \
FPS_MAX_FRAMES=12 \
SKIP_MULTIMODAL_MTP_VALIDATION=1 \
USE_MCORE_GDN=1 \
megatron sft \
    --model Qwen/Qwen3.5-35B-A3B \
    --dataset 'AI-ModelScope/LongAlpaca-12k' \
    --save_safetensors true \
    --load_from_cache_file true \
    --tuner_type full \
    --add_non_thinking_prefix true \
    --split_dataset_ratio 0.01 \
    --tensor_model_parallel_size 1 \
    --pipeline_model_parallel_size 1 \
    --expert_model_parallel_size 8 \
    --sequence_parallel true \
    --moe_permute_fusion true \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 1e-6 \
    --moe_expert_capacity_factor 2 \
    --micro_batch_size 1 \
    --global_batch_size 8 \
    --recompute_granularity selective \
    --recompute_modules core_attn mlp moe \
    --gradient_accumulation_fusion false \
    --num_train_epochs 500 \
    --group_by_length true \
    --finetune true \
    --freeze_llm false \
    --freeze_vit false \
    --freeze_aligner false \
    --cross_entropy_loss_fusion true \
    --lr 1e-5 \
    --lr_warmup_fraction 0.05 \
    --min_lr 1e-6 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --eval_steps 500 \
    --save_steps 500 \
    --save_total_limit 10 \
    --logging_steps 1 \
    --max_length 16384 \
    --dataloader_num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --optimizer_cpu_offload false \
    --attention_backend flash \
    --padding_free false \
    --output_dir ${output_dir} \
    2>&1 | tee ${log_file}
```

Multi-node training:

```bash
export NNODES=2  # example: 2 nodes
export NODE_RANK=0  # 0 on master, 1 on workers
export MASTER_ADDR=<MASTER_NODE_IP>  # set to master node IP
export MASTER_PORT=29500  # communication port
export NCCL_SOCKET_IFNAME=ens50f1np1  # actual NIC name, check with ifconfig
export GLOO_SOCKET_IFNAME=ens50f1np1  # same as above
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3  # IB HCAs, check with ibv_devices
export NCCL_IB_GID_INDEX=3

# Main training script below: same as single-node script above
...
```

### 2.2 Reinforcement learning training for Qwen3.5 with Megatron-Swift

```bash
# Single-node training example
export HSA_NO_SCRATCH_RECLAIM=1
export NVTE_USE_GROUPED_GEMM_TRITON=1

SYSTEM_PROMPT="""You are a helpful math assistant. Solve the problem step by step and put your final answer within \\boxed{}."""

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
megatron rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen3.5-35B-A3B \
    --save_safetensors true \
    --enable_thinking false \
    --merge_lora true \
    --context_parallel_size 1 \
    --tensor_model_parallel_size 1 \
    --expert_model_parallel_size 8 \
    --pipeline_model_parallel_size 1 \
    --moe_permute_fusion true \
    --dataset open-r1/DAPO-Math-17k-Processed \
    --system "$SYSTEM_PROMPT" \
    --num_train_epochs 1 \
    --global_batch_size 64 \
    --micro_batch_size 1 \
    --steps_per_generation 2 \
    --num_generations 8 \
    --reward_funcs accuracy \
    --use_vllm true \
    --vllm_mode colocate \
    --vllm_gpu_memory_utilization 0.5 \
    --vllm_tensor_parallel_size 2 \
    --vllm_max_model_len 9192 \
    --max_length 1000 \
    --max_completion_length 8192 \
    --tuner_type lora \
    --target_modules all-linear \
    --lr 5e-5 \
    --bf16 true \
    --beta 0.00 \
    --epsilon 0.2 \
    --epsilon_high 0.28 \
    --dynamic_sample false \
    --overlong_filter true \
    --loss_type grpo \
    --sleep_level 1 \
    --offload_model true \
    --offload_bridge false \
    --offload_optimizer true \
    --logging_steps 1 \
    --recompute_granularity none \
    --gradient_accumulation_fusion false \
    --finetune \
    --dataloader_num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim \
    --no_save_rng \
    --save_steps 20 \
    --attention_backend flash \
    --moe_expert_capacity_factor 2 \
    --temperature 1.0 \
    --padding_free false \
    --sequence_parallel true \
    --log_completions true \
    --report_to tensorboard
```

## Known issues

- **Reinforcement learning**: If vLLM is the inference engine, use vLLM ≥ 0.11.0. It is recommended to use ROCm 7.0 or the image we provide to avoid the sleep mode memory leak issue.
  - When using [Ray Megatron](../../source/Instruction/Ray.md) instead of `torchrun` for multi-GPU/Node training, don't set `CUDA_VISIBLE_DEVICES`/`HIP_VISIBLE_DEVICES` etc. to avoid conflicts.
- **MoE training**: Set `NVTE_USE_GROUPED_GEMM_TRITON=1` and `--gradient_accumulation_fusion false` to reduce occasional GPU hangs.