# Ascend NPU For environment preparation of Megatron-SWIFT on Ascend NPU, please refer to [NPU Best Practices](../BestPractices/NPU-support.md). ## NPU Performance Data Collection NPU performance collection is conducted through the `torch_npu.profiler.profile` interface. To begin, create an instance of `torch_npu.profiler.profile`, then use the `start` and `stop` methods to control the performance data collection process. During this process, modifications to the ms-swift source code are required, specifically altering the `train` function in the `swift/megatron/trainers/base.py` file. Below is an example of the collection process: ```python import torch_npu ... experimental_config = torch_npu.profiler._ExperimentalConfig( profiler_level=torch_npu.profiler.ProfilerLevel.Level1, aic_metrics=torch_npu.profiler.AiCMetrics.PipeUtilization, ) prof = torch_npu.profiler.profile( activities=[ torch_npu.profiler.ProfilerActivity.CPU, torch_npu.profiler.ProfilerActivity.NPU ], schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=1, repeat=1, skip_first=6), on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./result"), profile_memory=False, # Close the collection of memory information with_stack=False, # Close the collection of stack information experimental_config=experimental_config) prof.start() # ms-swift code while state.iteration < args.train_iters: ... metric, grad_norm, update_successful = train_step(train_data_iterator) # collect performance data prof.step() ... prof.stop() ``` # NPU Accuracy Data Collection ### Installing msprobe ```shell pip install mindstudio-probe ``` ### Code Modification To support accuracy debugging with the msprobe tool, we need to modify the `_patch_word_embeddings` function in the `swift/megatron/model/mm_gpt_model.py` file. The main changes are to adjust the function parameters and internal implementation logic so that it can correctly patch the embedding layer. The specific modification content is as follows: Before modification: ```python def _patch_word_embeddings(self, kwargs): origin_forward = VocabParallelEmbedding.forward def forward(_self, input_): args = get_args() reduce_scatter_embeddings = _self.reduce_scatter_embeddings _self.reduce_scatter_embeddings = False input_ = torch.masked_fill(input_, input_ < 0, 0) res = origin_forward(_self, input_) _self.reduce_scatter_embeddings = reduce_scatter_embeddings packed_seq_params = kwargs.get('packed_seq_params') # ...other logic... return res VocabParallelEmbedding.forward = forward try: yield finally: VocabParallelEmbedding.forward = origin_forward def forward( self, input_ids: torch.Tensor, position_ids: torch.Tensor, attention_mask: torch.Tensor = None, decoder_input: torch.Tensor = None, labels: torch.Tensor = None, inference_params: InferenceParams = None, packed_seq_params: PackedSeqParams = None, **kwargs, ) -> torch.Tensor: if decoder_input is not None: pass elif self.pre_process: kwargs.update({'input_ids': input_ids, 'packed_seq_params': packed_seq_params}) with self._patch_word_embeddings(kwargs): decoder_input = self.language_model.embedding(input_ids=input_ids, position_ids=position_ids) # ...other logic... ``` After modification: ```python def _patch_word_embeddings(self, kwargs, emb): # Modification 1 origin_forward = emb.word_embeddings.forward # Modification 2 def forward(input_): # Modification 3 args = get_args() _self = emb.word_embeddings # Modification 4 reduce_scatter_embeddings = _self.reduce_scatter_embeddings _self.reduce_scatter_embeddings = False input_ = torch.masked_fill(input_, input_ < 0, 0) res = origin_forward(input_) # Modification 5 _self.reduce_scatter_embeddings = reduce_scatter_embeddings packed_seq_params = kwargs.get('packed_seq_params') # ...other logic... return res emb.word_embeddings.forward = forward # Modification 6 try: yield finally: emb.word_embeddings.forward = origin_forward # Modification 7 def forward( self, input_ids: torch.Tensor, position_ids: torch.Tensor, attention_mask: torch.Tensor = None, decoder_input: torch.Tensor = None, labels: torch.Tensor = None, inference_params: InferenceParams = None, packed_seq_params: PackedSeqParams = None, **kwargs, ) -> torch.Tensor: if decoder_input is not None: pass elif self.pre_process: kwargs.update({'input_ids': input_ids, 'packed_seq_params': packed_seq_params}) with self._patch_word_embeddings(kwargs, self.language_model.embedding): # Modification 8 decoder_input = self.language_model.embedding(input_ids=input_ids, position_ids=position_ids) # ...other logic... ``` Major changes include: 1. The `_patch_word_embeddings` method adds an `emb` parameter to receive the embedding module instance 2. Directly obtain `emb.word_embeddings.forward` instead of `VocabParallelEmbedding.forward` 3. The internal `forward` function signature changed from `(_self, input_)` to `(input_)` 4. Get `_self` through `emb.word_embeddings` inside the function 5. Pass `input_` directly when calling the original forward 6. Use `emb.word_embeddings.forward` for replacement and recovery operations (Modifications 6, 7) 7. Pass the `self.language_model.embedding` instance when calling `_patch_word_embeddings` Modify the train_step function in the file swift/megatron/trainers/base.py Before modification: ```python def train_step(self, forward_step_func, data_iterator, model, optimizer, opt_param_scheduler, config, *args, **kwargs): new_data_iterator = self._replace_data_iterator(data_iterator, model) return self._origin_train_step(forward_step_func, new_data_iterator, model, optimizer, opt_param_scheduler, config, *args, **kwargs) ``` After modification: ```python def train_step(self, forward_step_func, data_iterator, model, optimizer, opt_param_scheduler, config, *args, **kwargs): new_data_iterator = self._replace_data_iterator(data_iterator, model) from msprobe.pytorch import PrecisionDebugger debugger = PrecisionDebugger(dump_path='./dump_path', level='mix', model=model) debugger.start() try: origin_train_step_out = self._origin_train_step( forward_step_func, new_data_iterator, model, optimizer, opt_param_scheduler,config, *args, **kwargs) finally: debugger.stop() debugger.step() return origin_train_step_out ``` ### Enable Additionally, since msprobe does not support fusion computation, you need to add `--bias_dropout_fusion false`, `--bias_swiglu_fusion false`, `--cross_entropy_loss_fusion false` to the launch script. #### Example ```shell PYTORCH_NPU_ALLOC_CONF='expandable_segments:True' \ NPROC_PER_NODE=2 \ CUDA_VISIBLE_DEVICES=0,1 \ megatron sft \ --mcore_model Qwen2.5-7B-Instruct-mcore \ --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \ 'AI-ModelScope/alpaca-gpt4-data-en#500' \ 'swift/self-cognition#500' \ --tensor_model_parallel_size 2 \ ... --bias_dropout_fusion false \ --bias_swiglu_fusion false \ --cross_entropy_loss_fusion false ```