# GKD
If you are new to GKD, please refer to the [GKD Documentation](../Instruction/GKD.md) first.
GKD (Generalized Knowledge Distillation) is a training method that transfers knowledge from a teacher model to a student model by computing the Jensen-Shannon Divergence (JSD) loss between their output distributions.
## Feature Support
Megatron GKD currently supports the following features:
- **Training Modes**: Full parameter training and LoRA fine-tuning
- **Parallelism Strategies**: Context Parallel (CP), Pipeline Parallel (PP), Tensor Parallel (TP), and Expert Parallel (EP)
- **Model Support**: Compatible with LLMs and MLLMs in Megatron-SWIFT
- **Teacher Offload**: Supports offloading teacher model to CPU to save GPU memory
- **Online Generation**: Supports on-policy generation using vLLM for student model
## Parameters
### GKD-specific Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `--teacher_model` | str | - | Path or model ID of the teacher model
*Can be omitted when using `teacher_model_server` |
| `--teacher_model_server` | str | None | Teacher model service URL (`vllm serve` only), e.g. `http://localhost:8000` |
| `--gkd_logits_topk` | int | None | Number of Top-K logits; required when using external API |
| `--beta` | float | 0.5 | JSD divergence interpolation coefficient:
• 0.0: Forward KL
• 0.5: Symmetric JSD
• 1.0: Reverse KL |
| `--lmbda` | float | 0.5 | On-Policy learning probability:
• 0.0: Pure Off-Policy
• 1.0: Pure On-Policy |
| `--temperature` | float | 0.9 | Temperature for sampling and loss computation |
| `--sft_alpha` | float | 0 | Mix in a proportion of SFT loss; applied to non-student-generated completions |
| `--max_completion_length` | int | 512 | Maximum tokens for generation |
### Batch-related Parameters
Same as Megatron SFT, use the following parameters to control batch size:
| Parameter | Description |
|-----------|-------------|
| `--micro_batch_size` | Training batch size per DP group |
| `--global_batch_size` | Global batch size: `micro_batch_size × dp_size × gradient_accumulation_steps` |
## Reference
For more parameters, please refer to [Command-line Parameters](./Command-line-parameters.md)
For training scripts, please refer to [Megatron GKD Scripts](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/rlhf/gkd)
Training script using Teacher Server reference [here](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/rlhf/gkd/teacher_server.sh)