# Ascend NPU 关于Megatron-SWIFT在Ascend NPU上的环境准备,请参考[NPU最佳实践](../BestPractices/NPU-support.md)。 ## NPU 性能数据采集 NPU性能采集通过`torch_npu.profiler.profile`接口进行采集,创建torch_npu.profiler.profile对象,通过start和stop接口控制采集性能数据,采集过程需要修改依赖的megatron源码,修改Megatron-LM/megatron/training/training.py文件中的train函数,采集示例如下: ```python import torch_npu ... experimental_config = torch_npu.profiler._ExperimentalConfig( profiler_level=torch_npu.profiler.ProfilerLevel.Level1, aic_metrics=torch_npu.profiler.AiCMetrics.PipeUtilization, ) prof = torch_npu.profiler.profile( activities=[ torch_npu.profiler.ProfilerActivity.CPU, torch_npu.profiler.ProfilerActivity.NPU ], schedule=torch_npu.profiler.schedule(wait=0, warmup=0, active=1, repeat=1, skip_first=6), on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./result"), profile_memory=False, # 关闭采集内存信息 with_stack=False, # 关闭采集堆栈信息 experimental_config=experimental_config) prof.start() # megatron 逻辑 while iteration < args.train_iters: ... ( loss_dict, skipped_iter, should_checkpoint, should_exit, exit_code, grad_norm, num_zeros_in_grad, ) = train_step( forward_step_func, train_data_iterator, model, optimizer, opt_param_scheduler, config, forward_backward_func) # 性能数据采集 prof.step() ... prof.stop() ``` ## NPU 精度数据采集 ### 安装msprobe ```shell pip install mindstudio-probe ``` ### 代码修改 为了支持 msprobe 工具进行精度调试,我们需要修改 `swift/megatron/model/mm_gpt_model.py` 文件中的 `_patch_word_embeddings` 函数。主要改动是调整函数参数和内部实现逻辑,使其能够正确地对嵌入层进行patch 下面是具体的修改内容: 修改前: ```python def _patch_word_embeddings(self, kwargs): origin_forward = VocabParallelEmbedding.forward def forward(_self, input_): from ..trainers.utils import split_cp_inputs args = get_args() reduce_scatter_embeddings = _self.reduce_scatter_embeddings _self.reduce_scatter_embeddings = False input_ = torch.masked_fill(input_, input_ < 0, 0) res = origin_forward(_self, input_) _self.reduce_scatter_embeddings = reduce_scatter_embeddings packed_seq_params = kwargs.get('packed_seq_params') # ...其他逻辑... return res VocabParallelEmbedding.forward = forward try: yield finally: VocabParallelEmbedding.forward = origin_forward def forward( self, input_ids: torch.Tensor, position_ids: torch.Tensor, attention_mask: torch.Tensor = None, decoder_input: torch.Tensor = None, labels: torch.Tensor = None, inference_params: InferenceParams = None, packed_seq_params: PackedSeqParams = None, **kwargs, ) -> torch.Tensor: if decoder_input is not None: pass elif self.pre_process: kwargs.update({'input_ids': input_ids, 'packed_seq_params': packed_seq_params}) with self._patch_word_embeddings(kwargs): decoder_input = self.language_model.embedding(input_ids=input_ids, position_ids=position_ids) # ...其他逻辑... ``` 修改后: ```python def _patch_word_embeddings(self, kwargs, emb): # 修改1 origin_forward = emb.word_embeddings.forward # 修改2 def forward(input_): # 修改3 from ..trainers.utils import split_cp_inputs args = get_args() _self = emb.word_embeddings # 修改4 reduce_scatter_embeddings = _self.reduce_scatter_embeddings _self.reduce_scatter_embeddings = False input_ = torch.masked_fill(input_, input_ < 0, 0) res = origin_forward(input_) # 修改5 _self.reduce_scatter_embeddings = reduce_scatter_embeddings packed_seq_params = kwargs.get('packed_seq_params') # ...其他逻辑... return res emb.word_embeddings.forward = forward # 修改6 try: yield finally: emb.word_embeddings.forward = origin_forward # 修改7 def forward( self, input_ids: torch.Tensor, position_ids: torch.Tensor, attention_mask: torch.Tensor = None, decoder_input: torch.Tensor = None, labels: torch.Tensor = None, inference_params: InferenceParams = None, packed_seq_params: PackedSeqParams = None, **kwargs, ) -> torch.Tensor: if decoder_input is not None: pass elif self.pre_process: kwargs.update({'input_ids': input_ids, 'packed_seq_params': packed_seq_params}) with self._patch_word_embeddings(kwargs, self.language_model.embedding): # 修改8 decoder_input = self.language_model.embedding(input_ids=input_ids, position_ids=position_ids) # ...其他逻辑... ``` 主要变化包括: 1. `_patch_word_embeddings` 方法增加了 `emb` 参数,用于接收 embedding 模块实例 2. 直接获取 `emb.word_embeddings.forward` 而不是 `VocabParallelEmbedding.forward` 3. 内部 `forward` 函数签名从 `(_self, input_)` 改为 `(input_)` 4. 在函数内部通过 `emb.word_embeddings` 获取 `_self` 5. 调用原始 forward 时直接传入 `input_` 6. 使用 `emb.word_embeddings.forward` 进行替换和恢复操作(修改6、7) 7. 在调用 `_patch_word_embeddings` 时传入 `self.language_model.embedding` 实例 对文件swift/megatron/trainers/base.py中的train_step函数进行修改 修改前: ```python def train_step(self, forward_step_func, data_iterator, model, optimizer, opt_param_scheduler, config, *args, **kwargs): new_data_iterator = self._replace_data_iterator(data_iterator, model) return self._origin_train_step(forward_step_func, new_data_iterator, model, optimizer, opt_param_scheduler, config, *args, **kwargs) ``` 修改后: ```python def train_step(self, forward_step_func, data_iterator, model, optimizer, opt_param_scheduler, config, *args, **kwargs): new_data_iterator = self._replace_data_iterator(data_iterator, model) from msprobe.pytorch import PrecisionDebugger debugger = PrecisionDebugger(dump_path='./dump_path', level='mix', model=model) debugger.start() try: origin_train_step_out = self._origin_train_step( forward_step_func, new_data_iterator, model, optimizer, opt_param_scheduler,config, *args, **kwargs) finally: debugger.stop() debugger.step() return origin_train_step_out ``` ### 使能 另外,由于msprobe不支持融合计算,需要在启动脚本添加`--no_bias_dropout_fusion True`、`--no_bias_swiglu_fusion True`、`--cross_entropy_loss_fusion False` #### 示例 ```shell PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \ NPROC_PER_NODE=2 \ CUDA_VISIBLE_DEVICES=0,1 \ megatron sft \ --load Qwen2.5-7B-Instruct-mcore \ --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \ 'AI-ModelScope/alpaca-gpt4-data-en#500' \ 'swift/self-cognition#500' \ --tensor_model_parallel_size 2 \ ... --no_bias_dropout_fusion True \ --no_bias_swiglu_fusion True \ --cross_entropy_loss_fusion False ```