# ModelLink **Repository Path**: edwardvitor/ModelLink ## Basic Information - **Project Name**: ModelLink - **Description**: 昇腾大模型仓库 - **Primary Language**: Python - **License**: MIT - **Default Branch**: master - **Homepage**: https://gitee.com/ascend/ModelLink/ - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1210 - **Created**: 2024-07-29 - **Last Updated**: 2024-07-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

简体中文 | English

ModelLink aims to provide end-to-end large language model solutions for Huawei Ascend chips, including models, algorithms, and downstream tasks. --- ## ModelLink Solution Overview ### Supported features Current ModelLink supported features for large model usage: * [Dataset Preparation for Pre-training](#jump11)/[Fine-tuning Instruction Dataset Preparation](#jump12) * [Pre-training](#jump13)/[Full-parameter Fine-tuning](#jump14)/[Low-parameter Fine-tuning](#jump15) * [Inference: human-machine dialogue](#jump16) * [Evaluation with numerous benchmarks](#jump17) * [Utilizing Acceleration Features (Acceleration Algorithms + Fusion Operators)](#jump18) * [Profiling data based on Ascend chips](#jump19) * [Convert ckpt between huggingface and megatron](#jump19) * [Enbale deterministic computing function for Ascend](#jump21) * [Enable high availability features based on Ascend chips ](#jump22) More novel and useful features are developing for LLMs training on Ascend ... --- ## ModelLink Maintenance Policies ModelLink release has the following five maintenance phases: | **Status** | **Duration** | **Description** | |-------------------|--------------|--------------------------------------------------------------------------------------------------------------------------------| | Planning | 1-3 months | Plan features. | | Development | 3 months | Develop features. | | Maintained | 6-12 months | Allow the incorporation of all resolved issues and release the version, Different versions of ModelLink adopt varying support policies. The maintenance periods for Regular Releases and Long-Term Support versions are 6 months and 12 months, respectively. | | Unmaintained | 0-3 months | Allow the incorporation of all resolved issues. No dedicated maintenance personnel are available. No version will be released. | | End Of Life (EOL) | N/A | Do not accept any modification to a branch. | The Released Version Maintenance Policy for ModelLink is as follows: | **ModelLink** | **Maintenance Policies** | **Status** | **Launch Date** | **Subsequent Status** | **EOL Date** | |---------------|--------------------|--------------|-----------------|-----------------------------------|-----------| | 1.0 | Regular Release | Maintained | 2024/03/30 | Unmaintained 2024/09/30 estimated | | | 1.1 | Regular Release | Maintained | 2024/06/30 | Unmaintained 2024/12/30 estimated | | --- ### Supporting version and models 【Please note the corresponding environment versions for model usage, as follows】 | Software | [Version](https://www.hiascend.com/zh/) | | :-----------------------: |:----------------------------------:| | Python | 3.8 | | driver | under development version | | firmware | under development version | | CANN | under development version | | torch | 2.1.0、2.2.0 | | torch_npu | under development version | 【Based on the current version of megatron, the performance statistics from our testing are as follows (Hardware info：Atlas 900 A2 PODc)】 For the supported following models, we provide training scripts and readme instructions in the examples folder, which contain detailed processes for model training, inference, and evaluation. The following is the measured performance after turning on the mc2 acceleration feature, which is only supported in the version above 24RC2, the repository code level is closed by default, if you want to use, please refer to the `Acceleration Features` chapter on the home page.

Model	Parameters	Sequence	Cluster	Mode	Performance	Reference	Contributor
Aquila	7B	2K	1x8	BF16	2849	2874	【Ascend】
Aquila2	7B	2K	1x8	FP16	3323	2673	【Community】
Aquila2	34B	4K	2x8	BF16	854	732	【Community】
Baichuan	7B	4K	1x8	FP16	2685	2036	【Ascend】
Baichuan	13B	4K	1x8	FP16	1213	862	【Ascend】
Baichuan2	7B	4K	1x8	BF16	2664	3969	【Ascend】
Baichuan2	13B	4K	1x8	BF16	1668	2062	【Ascend】
Bloom	7B1	2K	1x8	FP16	2034	2525	【Ascend】
Bloom	176B	2K	12x8	BF16	100	107	【Ascend】
ChatGLM3	6B	8K	1x8	FP16	4297	4267	【Community】
CodeLlama	34B	4K	2x8	BF16	837	762	【Community】
InternLM	7B	2K	1x8	BF16	2776	2854	【Ascend】
InternLM	65B	2K	4x8	BF16	341	414	【Ascend】
LLaMA	7B	2K	1x8	FP16	3600	3804	【Ascend】
	13B	2K	1x8	FP16	1895	2012	【Ascend】
	33B	2K	4x8	FP16	621	776	【Ascend】
	65B	2K	4x8	BF16	348	426	【Ascend】
LLaMA2	7B	4K	1x8	BF16	4200	3850	【Ascend】
	13B	4K	1x8	BF16	1990	1920	【Ascend】
	34B	4K	2x8	BF16	749	796	【Ascend】
	70B	4K	4x8	BF16	420	430	【Ascend】
LLaMA3	8B	8K	1x8	BF16	2483	2674	【Ascend】
LLaMA3	70B	8K	8x8	BF16	283	355	【Ascend】
Qwen	7B	8K	1x8	BF16	2499	2867	【Ascend】
	14B	2K	1x8	BF16	1560	1578	【Ascend】
	72B	8K	16x8	BF16	285	345	【Ascend】
Qwen1.5	0.5B	8K	1x8	BF16	22834	25306	【Community】
	1.8B	8K	1x8	BF16	13029	12181	【Community】
	4B	8K	1x8	BF16	5033	5328	【Community】
	7B	8K	1x8	BF16	2862	2621	【Community】
	14B	8K	1x8	BF16	1717	1702	【Community】
	32B	8K	4x8	BF16	751	708	【Community】
	72B	8K	8x8	BF16	301	317	【Ascend】
Yi	34B	4K	2x8	BF16	809	730	【Community】
Mixtral	8x7B	32K	8x8	BF16	702	837	【Ascend】
Mistral	7B	32K	1x8	BF16	2806	2734	【Ascend】
Gemma	2B	8K	1x8	BF16	6821	7602	【Ascend】
Gemma	7B	8K	1x8	BF16	2938	2607	【Ascend】
GPT3	175B	2K	16x8	FP16	153	--	【Community】
GPT3	15B	2K	1x8	FP16	1890	1840	【Community】
Grok1	40B	8K	2x8	BFP16	1646	2057	【Ascend】

--- ## Acceleration Features ModelLink supports various acceleration algorithms such as tensor parallelism, pipeline parallelism, context parallelism, sequence parallelism, recomputation, distributed optimizer, and more. The table below shows the enable switches corresponding to each acceleration feature:

Scenario	Features	Arguments	Mcore Support	Legacy Support
PTD Parallel	Tensor Parallel	--tensor-model-parallel-size	Yes	Yes
	Pipeline Parallel	--pipeline-model-parallel-size	Yes	Yes
	Dynamic division for PP	--num-layer-list	Yes	Yes
	Sequence Parallel	--sequence-parallel	Yes	Yes
	Distributed Optimizer	--use-distributed-optimizer	Yes	Yes
Context Parallel	Context Parallel	--context-parallel-size	Yes	No
	Various Cp Algorithm	--context-parallel-algo	Yes	No
	Send/Recv Overlap	--cp-send-recv-overlap	Yes	No
MOE Parallel	MOE Parallel	--expert-model-parallel-size	Yes	No
MOE Parallel	MOE permutation communication optimization	--moe-permutation-async-comm	Yes	No
Memory Optimization	Re-computation	--recompute-granularity	No	Yes
Fused Kernel	Flash Attention	--use-flash-attn	Yes	Yes
	Fused Rmsnorm	--use-fused-rmsnorm	Yes	Yes
	Fused Swiglu	--use-fused-swiglu	Yes	Yes
	Fused Rotary Position Embedding	--use-fused-rotary-pos-emb	Yes	Yes
	Sliding Window Attention	--sliding-window	Yes	Yes
Communication	Overlap Grad Reduce	--overlap-grad-reduce	Yes	Yes
	Overlap Param Gather	--overlap-param-gather	Yes	No
	MC2	--use-mc2	Yes	Yes

```bash torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \ --tensor-model-parallel-size ${TP} \ --pipeline-model-parallel-size ${PP} \ --num-layer-list 1,2,2,2,1 \ --sequence-parallel \ --recompute-granularity full \ --recompute-method block \ --recompute-num-layers 72 \ --use-distributed-optimizer \ --use-flash-attn \ --use-fused-rmsnorm \ --use-fused-swiglu \ --overlap-grad-reduce \ --use-fused-rotary-pos-emb \ --use-mc2 \ --sliding-window 4096 \ ... \ ... ``` ```bash Note: To enable mc2, ensure the following: 1. The environment version matches the description on the repository homepage; 2. Comment out line 431 in the validate_args_decorator function within modellink\arguments.py: #args.use_mc2 = False ``` --- ## Analyze profiling data based on Ascend chips Modellink supports analyze profiling data based on Ascend chips, which is useful for modelling: ```bash --profile # enable profiling --profile-step-start 5 # the start step --profile-step-end 6 # the end step --profile-ranks 0 1 2 3 4 # ranks for profiling.The default value of -1 means to profile all ranks --profile-level level2 # level0, 1, 2 for data profiling --profile-with-cpu # profiling cpu information --profile-with-stack # profile stack information --profile-with-memory # profile memory information --profile-record-shapes # profile shape information --profile-save-path ./profile_dir # path to save data ``` ## Enable deterministic computing based on Ascend chips - add choice in script ```shell --use-deter-comp ``` - add environment variable ```shell export HCCL_DETERMINISTIC=True ``` --- ## Enable high availability features based on Ascend chips The motivation for the distributed optimizer is to save memory by distributing the optimizer state evenly across data parallel ranks.Based on this idea,a scheme is designed to divide the data parallel ranks into two replica data parallel ranks.The replica optimizer distributes the optimizer state evenly across replica data parallel ranks,so that the optimizer state is backed up. The following functions can be implemented based on the Huawei-developed HA framework: 1. During the training, the last checkpoint can be saved in the fault scenario, ensuring zero loss of the training result. 2. During the training, UCE fault detection of the HBM is supported, and online repair is completed to achieve step-level recomputation. When the HA feature is enabled, the static memory used by the replica optimizer increases., the theoretical number of bytes per parameter is (where 'd' is the data parallel size):： | | Non-distributed optim | Distributed optim | Replica optim | |----------------------------------| ------ | ------ |---------------| | fp16/bf16 param, fp16/bf16 grads | 20 | 4 + 16/d | 4 + 32/d | | fp16/bf16 param, fp32 grads | 18 | 6 + 12/d | Supporting | | fp32 param, fp32 grads | 16 | 8 + 8/d | Supporting | - add choice in script，and install Huawei-developed HA framework:mindio_ttp.whl - mindio_ttp related description:https://www.hiascend.com/document/detail/zh/mindx-dl/60rc1/mindio/mindiottp ```shell --enable-high-availability #enable high availability feature --enable-optimizer-state-local-copy #enable the function of saving the last step optimizer state. Memory will increase further. It can be disabled by default. ``` --- ## Acknowledgments ModelLink is jointly contributed by the following departments of Huawei Corporation: - Ascend Computing Product Unit - Algorithm Unit of Computing Product Unit - Research Unit of Computing Product Unit - Open Computing Kit of Computing Product Unit - General Development Department - Global Technical Service Department We appreciate every PR from community, and welcome to contribute to ModelLink. ## Appendix - Safety Statement: [Safety Statement](https://gitee.com/ascend/ModelLink/wikis/%E5%AE%89%E5%85%A8%E7%9B%B8%E5%85%B3/%E5%AE%89%E5%85%A8%E5%A3%B0%E6%98%8E)