# ModelLink **Repository Path**: edwardvitor/ModelLink ## Basic Information - **Project Name**: ModelLink - **Description**: 昇腾大模型仓库 - **Primary Language**: Python - **License**: MIT - **Default Branch**: master - **Homepage**: https://gitee.com/ascend/ModelLink/ - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1210 - **Created**: 2024-07-29 - **Last Updated**: 2024-07-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

GitHub Documentation

简体中文 | English

ModelLink aims to provide end-to-end large language model solutions for Huawei Ascend chips, including models, algorithms, and downstream tasks. --- ## ModelLink Solution Overview ### Supported features Current ModelLink supported features for large model usage: * [Dataset Preparation for Pre-training](#jump11)/[Fine-tuning Instruction Dataset Preparation](#jump12) * [Pre-training](#jump13)/[Full-parameter Fine-tuning](#jump14)/[Low-parameter Fine-tuning](#jump15) * [Inference: human-machine dialogue](#jump16) * [Evaluation with numerous benchmarks](#jump17) * [Utilizing Acceleration Features (Acceleration Algorithms + Fusion Operators)](#jump18) * [Profiling data based on Ascend chips](#jump19) * [Convert ckpt between huggingface and megatron](#jump19) * [Enbale deterministic computing function for Ascend](#jump21) * [Enable high availability features based on Ascend chips ](#jump22) More novel and useful features are developing for LLMs training on Ascend ... --- ## ModelLink Maintenance Policies ModelLink release has the following five maintenance phases: | **Status** | **Duration** | **Description** | |-------------------|--------------|--------------------------------------------------------------------------------------------------------------------------------| | Planning | 1-3 months | Plan features. | | Development | 3 months | Develop features. | | Maintained | 6-12 months | Allow the incorporation of all resolved issues and release the version, Different versions of ModelLink adopt varying support policies. The maintenance periods for Regular Releases and Long-Term Support versions are 6 months and 12 months, respectively. | | Unmaintained | 0-3 months | Allow the incorporation of all resolved issues. No dedicated maintenance personnel are available. No version will be released. | | End Of Life (EOL) | N/A | Do not accept any modification to a branch. | The Released Version Maintenance Policy for ModelLink is as follows: | **ModelLink** | **Maintenance Policies** | **Status** | **Launch Date** | **Subsequent Status** | **EOL Date** | |---------------|--------------------|--------------|-----------------|-----------------------------------|-----------| | 1.0 | Regular Release | Maintained | 2024/03/30 | Unmaintained 2024/09/30 estimated | | | 1.1 | Regular Release | Maintained | 2024/06/30 | Unmaintained 2024/12/30 estimated | | --- ### Supporting version and models 【Please note the corresponding environment versions for model usage, as follows】 | Software | [Version](https://www.hiascend.com/zh/) | | :-----------------------: |:----------------------------------:| | Python | 3.8 | | driver | under development version | | firmware | under development version | | CANN | under development version | | torch | 2.1.0、2.2.0 | | torch_npu | under development version | 【Based on the current version of megatron, the performance statistics from our testing are as follows (Hardware info:Atlas 900 A2 PODc)】 For the supported following models, we provide training scripts and readme instructions in the examples folder, which contain detailed processes for model training, inference, and evaluation. The following is the measured performance after turning on the mc2 acceleration feature, which is only supported in the version above 24RC2, the repository code level is closed by default, if you want to use, please refer to the `Acceleration Features` chapter on the home page.
Model Parameters Sequence Cluster Mode Performance Reference Contributor
Aquila 7B 2K 1x8 BF16 2849 2874 【Ascend】
Aquila2 7B 2K 1x8 FP16 3323 2673 【Community】
34B 4K 2x8 BF16 854 732 【Community】
Baichuan 7B 4K 1x8 FP16 2685 2036 【Ascend】
13B 4K 1x8 FP16 1213 862 【Ascend】
Baichuan2 7B 4K 1x8 BF16 2664 3969 【Ascend】
13B 4K 1x8 BF16 1668 2062 【Ascend】
Bloom 7B1 2K 1x8 FP16 2034 2525 【Ascend】
176B 2K 12x8 BF16 100 107 【Ascend】
ChatGLM3 6B 8K 1x8 FP16 4297 4267 【Community】
CodeLlama 34B 4K 2x8 BF16 837 762 【Community】
InternLM 7B 2K 1x8 BF16 2776 2854 【Ascend】
65B 2K 4x8 BF16 341 414 【Ascend】
LLaMA 7B 2K 1x8 FP16 3600 3804 【Ascend】
13B 2K 1x8 FP16 1895 2012 【Ascend】
33B 2K 4x8 FP16 621 776 【Ascend】
65B 2K 4x8 BF16 348 426 【Ascend】
LLaMA2 7B 4K 1x8 BF16 4200 3850 【Ascend】
13B 4K 1x8 BF16 1990 1920 【Ascend】
34B 4K 2x8 BF16 749 796 【Ascend】
70B 4K 4x8 BF16 420 430 【Ascend】
LLaMA3 8B 8K 1x8 BF16 2483 2674 【Ascend】
70B 8K 8x8 BF16 283 355 【Ascend】
Qwen 7B 8K 1x8 BF16 2499 2867 【Ascend】
14B 2K 1x8 BF16 1560 1578 【Ascend】
72B 8K 16x8 BF16 285 345 【Ascend】
Qwen1.5 0.5B 8K 1x8 BF16 22834 25306 【Community】
1.8B 8K 1x8 BF16 13029 12181 【Community】
4B 8K 1x8 BF16 5033 5328 【Community】
7B 8K 1x8 BF16 2862 2621 【Community】
14B 8K 1x8 BF16 1717 1702 【Community】
32B 8K 4x8 BF16 751 708 【Community】
72B 8K 8x8 BF16 301 317 【Ascend】
Yi 34B 4K 2x8 BF16 809 730 【Community】
Mixtral 8x7B 32K 8x8 BF16 702 837 【Ascend】
Mistral 7B 32K 1x8 BF16 2806 2734 【Ascend】
Gemma 2B 8K 1x8 BF16 6821 7602 【Ascend】
7B 8K 1x8 BF16 2938 2607 【Ascend】
GPT3 175B 2K 16x8 FP16 153 -- 【Community】
15B 2K 1x8 FP16 1890 1840 【Community】
Grok1 40B 8K 2x8 BFP16 1646 2057 【Ascend】
--- ## Acceleration Features ModelLink supports various acceleration algorithms such as tensor parallelism, pipeline parallelism, context parallelism, sequence parallelism, recomputation, distributed optimizer, and more. The table below shows the enable switches corresponding to each acceleration feature:
Scenario Features Arguments Mcore Support Legacy Support
PTD Parallel Tensor Parallel --tensor-model-parallel-size Yes Yes
Pipeline Parallel --pipeline-model-parallel-size Yes Yes
Dynamic division for PP --num-layer-list Yes Yes
Sequence Parallel --sequence-parallel Yes Yes
Distributed Optimizer --use-distributed-optimizer Yes Yes
Context Parallel Context Parallel --context-parallel-size Yes No
Various Cp Algorithm --context-parallel-algo Yes No
Send/Recv Overlap --cp-send-recv-overlap Yes No
MOE Parallel MOE Parallel --expert-model-parallel-size Yes No
MOE permutation communication optimization --moe-permutation-async-comm Yes No
Memory Optimization Re-computation --recompute-granularity No Yes
Fused Kernel Flash Attention --use-flash-attn Yes Yes
Fused Rmsnorm --use-fused-rmsnorm Yes Yes
Fused Swiglu --use-fused-swiglu Yes Yes
Fused Rotary Position Embedding --use-fused-rotary-pos-emb Yes Yes
Sliding Window Attention --sliding-window Yes Yes
Communication Overlap Grad Reduce --overlap-grad-reduce Yes Yes
Overlap Param Gather --overlap-param-gather Yes No
MC2 --use-mc2 Yes Yes
```bash torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \ --tensor-model-parallel-size ${TP} \ --pipeline-model-parallel-size ${PP} \ --num-layer-list 1,2,2,2,1 \ --sequence-parallel \ --recompute-granularity full \ --recompute-method block \ --recompute-num-layers 72 \ --use-distributed-optimizer \ --use-flash-attn \ --use-fused-rmsnorm \ --use-fused-swiglu \ --overlap-grad-reduce \ --use-fused-rotary-pos-emb \ --use-mc2 \ --sliding-window 4096 \ ... \ ... ``` ```bash Note: To enable mc2, ensure the following: 1. The environment version matches the description on the repository homepage; 2. Comment out line 431 in the validate_args_decorator function within modellink\arguments.py: #args.use_mc2 = False ``` --- ## Analyze profiling data based on Ascend chips Modellink supports analyze profiling data based on Ascend chips, which is useful for modelling: ```bash --profile # enable profiling --profile-step-start 5 # the start step --profile-step-end 6 # the end step --profile-ranks 0 1 2 3 4 # ranks for profiling.The default value of -1 means to profile all ranks --profile-level level2 # level0, 1, 2 for data profiling --profile-with-cpu # profiling cpu information --profile-with-stack # profile stack information --profile-with-memory # profile memory information --profile-record-shapes # profile shape information --profile-save-path ./profile_dir # path to save data ``` ## Enable deterministic computing based on Ascend chips - add choice in script ```shell --use-deter-comp ``` - add environment variable ```shell export HCCL_DETERMINISTIC=True ``` --- ## Enable high availability features based on Ascend chips The motivation for the distributed optimizer is to save memory by distributing the optimizer state evenly across data parallel ranks.Based on this idea,a scheme is designed to divide the data parallel ranks into two replica data parallel ranks.The replica optimizer distributes the optimizer state evenly across replica data parallel ranks,so that the optimizer state is backed up. The following functions can be implemented based on the Huawei-developed HA framework: 1. During the training, the last checkpoint can be saved in the fault scenario, ensuring zero loss of the training result. 2. During the training, UCE fault detection of the HBM is supported, and online repair is completed to achieve step-level recomputation. When the HA feature is enabled, the static memory used by the replica optimizer increases., the theoretical number of bytes per parameter is (where 'd' is the data parallel size):: | | Non-distributed optim | Distributed optim | Replica optim | |----------------------------------| ------ | ------ |---------------| | fp16/bf16 param, fp16/bf16 grads | 20 | 4 + 16/d | 4 + 32/d | | fp16/bf16 param, fp32 grads | 18 | 6 + 12/d | Supporting | | fp32 param, fp32 grads | 16 | 8 + 8/d | Supporting | - add choice in script,and install Huawei-developed HA framework:mindio_ttp.whl - mindio_ttp related description:https://www.hiascend.com/document/detail/zh/mindx-dl/60rc1/mindio/mindiottp ```shell --enable-high-availability #enable high availability feature --enable-optimizer-state-local-copy #enable the function of saving the last step optimizer state. Memory will increase further. It can be disabled by default. ``` --- ## Acknowledgments ModelLink is jointly contributed by the following departments of Huawei Corporation: - Ascend Computing Product Unit - Algorithm Unit of Computing Product Unit - Research Unit of Computing Product Unit - Open Computing Kit of Computing Product Unit - General Development Department - Global Technical Service Department We appreciate every PR from community, and welcome to contribute to ModelLink. ## Appendix - Safety Statement: [Safety Statement](https://gitee.com/ascend/ModelLink/wikis/%E5%AE%89%E5%85%A8%E7%9B%B8%E5%85%B3/%E5%AE%89%E5%85%A8%E5%A3%B0%E6%98%8E)