日本熟妇hd丰满老熟妇,中文字幕一区二区三区在线不卡 ,亚洲成片在线观看,免费女同在线一区二区

應用實踐:Transformer模型訓練加速

PAI-Rapidformer提供了豐富的模型訓練加速方法,您只需要安裝Rapidformer專屬鏡像,即可通過黑盒或者白盒化的方式對模型訓練進行優化。本文為您介紹如何使用Rapidformer優化PyTorch版的Transformer模型訓練。

前提條件

背景信息

Rapidformer可通過黑盒或者白盒化的方式對模型訓練進行加速:

黑盒化加速:加速微調Huggingface模型

  1. 將您的數據集注冊進HuggingFace,或查找使用已有的數據集,后續通過--dataset-name開關傳遞給Rapidformer。

  2. 將您的模型注冊進HuggingFace,或使用已有的模型,后續通過--pretrained-model-name-or-path開關傳遞給Rapidformer。

  3. 配置Rapidformer的啟動訓練CLI,示例如下。

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    rapidformer --task sequence_classification \ #任務名
                --pretrained-model-name-or-path 'bert-base-cased' \  #已注冊模型名
                --data-path glue \                      #已注冊的數據路徑名
                --data-name mrpc \                      #已注冊的數據文件名
                --epochs 3 \                               #訓練迭代輪次
                --micro-batch-size 16 \                    #每個gpu上的batch size
                --global-batch-size 64 \                   #分布式訓練總的batch size
                --lr 2e-5 \                                #學習率
                --lr-decay-style linear \                  #學習率衰減策略
                --lr-warmup-iters 100 \                    #學習率warmup步數
                --weight-decay 1e-2 \                      #lr系數
                --clip-grad 1.0 \                          #梯度clip系數
                --seed 42 \                                #隨機種子
                --mixed-precision \                        #開啟混合精度訓練
                --onnx-runtime-training \                  #開啟計算圖優化
                --zero-1-memory-optimization \             #開啟優化器狀態切分優化

    各參數的詳細介紹請參見參數配置指導

黑盒化加速:加速預訓練Huggingface模型

  1. 制作mmap類型的預訓練數據集。

    操作詳情請參見Megatron數據處理腳本,mmap數據集制作腳本請參考如下命令示例。

    python preprocess_data.py \
      --input book_wiki_owtv2_small.json  \
      --output-prefix gpt_small \
      --vocab gpt2-vocab.json \
      --dataset-impl mmap \
      --tokenizer-type GPT2BPETokenizer \
      --merge-file gpt2-merges.txt \
      --append-eod
  2. 將您的模型注冊進HuggingFace,或使用已有的模型,后續通過--pretrained-model-name-or-path開關傳遞給Rapidformer。

  3. 配置Rapidformer的啟動訓練CLI,示例如下。

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    rapidformer --task pretraining \
           --pretrained-model-name-or-path 'bert-base-uncased' \
           --num-layers 12 \
           --hidden-size 768 \
           --num-attention-heads 12 \
           --micro-batch-size 16 \
           --global-batch-size 128 \               #開啟梯度累積
           --seq-length 512 \
           --tokenizer-type BertWordPieceLowerCase \
           --max-position-embeddings 512 \
           --train-iters 100 \
           --data-path book_wiki_owtv2_small_text_sentence \
           --vocab-file bert-en-uncased-vocab.txt  \
           --data-impl mmap \
           --split 980,20 \
           --lr 1e-3 \
           --lr-decay-style linear \
           --min-lr 0.0 \
           --lr-decay-iters 2000 \
           --weight-decay 1e-2 \
           --clip-grad 1.0 \
           --lr-warmup-fraction .01 \
           --mixed-precision \                    #開啟混合精度訓練
           --onnx-runtime-training \              #開啟計算圖優化
           --fsdp-memory-optimization \           #開啟模型狀態切分優化

    各參數的詳細介紹請參見參數配置指導

白盒化加速:基于Finetuner代碼模版的Huggingface模型微調

下面介紹利用Rapidformer提供的Finetuner代碼模版快速構建Huggingface微調任務。在代碼模版中有四個函數需要關注:

  • 制作數據的train_valid_test_datasets_provider

  • 構造模型、優化器、學習率調節器的model_optimizer_lr_scheduler_provider

  • 前向運算邏輯的run_forward_step

  • 進行邊traineval計算精度的run_compute_metrics

這四個函數詳細介紹請參見Rapidformer API,以下對這四個函數的輸入輸出做簡要的介紹。

class MyFintuner(Finetuner):

    def __init__(self, engine):
        super().__init__(engine=engine)

    # 獲取訓練/驗證/測試數據集
    # 輸入:無
    # 輸出:三個對象以及一個對象函數
    def train_valid_test_datasets_provider(self):

        return train_dataset, valid_dataset, test_dataset, collate_f

    # 創建模型/優化器/學習率規劃器
    # 輸入:無
    # 輸出:三個對象
    def model_optimizer_lr_scheduler_provider(self):

        return model, optimizer, lr_scheduer

    #編寫前向邏輯
    # 輸入:batch 或者 iterator,model
    # 輸出:loss
    def run_forward_step(self, batch_or_iterator, model):
        return loss

    #編寫驗證集評估邏輯, 微調專用
    # 輸入:model,驗證集數據加載器
    # 輸出:metric對象
    def run_compute_metrics(self, model, eval_dataloader):
        return metric
                

熟悉以上自定義的代碼模版后,請先參考黑盒化加速:加速微調Huggingface模型示例,準備好數據集和模型,再進行以下步驟。

  1. 導入Rapidformer以及Huggingface的接口。

    from transformers/easytexmier import AutoConfig,BertForSequenceClassification
    from datasets import load_dataset, load_metric
    from rapidformer import RapidformerEngine
    from rapidformer import get_args
    from rapidformer import get_logger
    from rapidformer import get_timers
    from rapidformer import Finetuner
    from rapidformer import Pretrainer
    from rapidformer import build_train_valid_test_datasets_for_huggingface
  2. 完善代碼模版中的四個函數,如下所示。

    class MyFintuner(Finetuner):
        def __init__(self,engine):
            super().__init__(engine=engine)
    
        def train_valid_test_datasets_provider(self):
            tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
    
            def tokenize_function(examples):
                # max_length=None => use the model max length (it's actually the default)
                outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
                return outputs
    
            datasets = load_dataset(args.dataset_path, args.dataset_name)
            # Apply the method we just defined to all the examples in all the splits of the dataset
            tokenized_datasets = datasets.map(
                tokenize_function,
                batched=True,
                remove_columns=["idx", "sentence1", "sentence2"],
            )
            tokenized_datasets.rename_column_("label", "labels")
    
            train_dataset = tokenized_datasets["train"]
            valid_dataset = tokenized_datasets['validation']
            test_dataset = tokenized_datasets['test']
    
            def collate_fn(examples):
                return tokenizer.pad(examples, padding="longest", return_tensors="pt")
    
            return train_dataset, valid_dataset, test_dataset, collate_fn
    
        def model_optimizer_lr_scheduler_provider(self):
            args = get_args()
            model = BertForSequenceClassification.from_pretrained(args.load)
            return model, None, None
    
        def run_forward_step(self, batch, model):
            output_tensor = model(**batch)
            return output_tensor.loss
    
        # after each epoch run metric on eval dataset
        def run_compute_metrics(self, model, eval_dataloader):
            model = model[0]
            metric = load_metric(args.dataset_path, args.dataset_name)
            for step, batch in enumerate(eval_dataloader):
                with torch.no_grad():
                    outputs = model(**batch)
                predictions = outputs.logits.argmax(dim=-1)
    
                metric.add_batch(
                    predictions=self.gather(predictions),
                    references=self.gather(batch["labels"]),
                )
    
            eval_metric = metric.compute()
            return eval_metric
                            
  3. 初始化Rapidformer引擎,創建trainer對象,調用finetune()方法,然后保存成文件并命名為rapidformer_finetune_huggingface_bert_trainer.py

    engine = RapidformerEngine()
    trainer = MyFintuner(engine=engine)
    trainer.train()
  4. 基于CLI準備啟動腳本,設置--user-scriptrapidformer_finetune_huggingface_bert_trainer.py,并設置加速開關。

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    rapidformer --user-script rapidformer_finetune_huggingface_bert_trainer.py
                --task sequence_classification \
                --pretrained-model-name-or-path 'bert-base-cased' \
                --data-path glue \
                --data-name mrpc \
                --epochs 3 \
                --micro-batch-size 16 \
                --global-batch-size 16 \
                --lr 2e-5 \
                --lr-decay-style linear \
                --lr-warmup-iters 100 \
                --weight-decay 1e-2 \
                --clip-grad 1.0 \
                --mixed-precision                                 #開啟混合精度訓練
                --zero-3-memory-optimization \                    #開啟模型狀態切分
                --onnx-runtime-training \                         #開啟計算圖優化

白盒化加速:基于Pretrainer代碼模版的Huggingface模型預訓練

利用Rapidformer提供的Pretrainer代碼模版快速構建Huggingface模型預訓練任務時,在代碼模版中有以下幾個函數需要關注:

  • 制作數據的train_valid_test_datasets_provider

  • 構造模型、優化器、學習率調節器的model_optimizer_lr_scheduler_provider

  • 前向運算邏輯的run_forward_step

這幾個函數詳細介紹請參見Rapidformer API,輸入輸出的簡要介紹請參見白盒化加速:基于Finetuner代碼模版的Huggingface模型微調

熟悉以上自定義的代碼模版后,請先參考黑盒化加速:加速微調Huggingface模型示例,準備好數據集和模型,再進行以下步驟。

  1. 導入Rapidformer以及Huggingface的接口。

    說明

    由于預訓練利用iterator讀取數據,這里需要導入mpu來做數據并行。

    from megatron import mpu
    from transformers import BertConfig, BertForPreTraining
    from rapidformer import RapidformerEngine, get_args, PreTrainer
    from rapidformer import build_train_valid_test_datasets_for_huggingface
  2. 繼承Pretrainer,完善預訓練的代碼,如下所示。

    class MyBertPreTrainer(PreTrainer):
    
        def __init__(self,engine):
            super().__init__(engine=engine)
    
        def train_valid_test_datasets_provider(self, train_val_test_num_samples):
            args = get_args()
    
            train_ds, valid_ds, test_ds = build_train_valid_test_datasets_for_huggingface(
                data_prefix=args.data_path,
                data_impl=args.data_impl,
                splits_string=args.split,
                train_valid_test_num_samples=train_val_test_num_samples,
                max_seq_length=args.seq_length,
                masked_lm_prob=args.mask_prob,
                short_seq_prob=args.short_seq_prob,
                seed=args.seed,
                skip_warmup=(not args.mmap_warmup),
                binary_head=True)
    
            return train_ds, valid_ds, test_ds
    
        def model_optimizer_lr_scheduler_provider(self):
            args = get_args()
            model = AutoModelForPreTraining.from_pretrained(args.pretrained_model_name_or_path)
            return model, None, None
    
        def run_forward_step(self, data_iterator, model):
            # Items and their type.
            keys = ['input_ids', 'attention_mask', 'token_type_ids', 'labels', 'next_sentence_label']
            datatype = torch.int64
    
            # Broadcast data.
            if data_iterator is not None:
                data = next(data_iterator)
            else:
                data = None
            data_b = mpu.broadcast_data(keys, data, datatype)
            input_ids = data_b['input_ids'].long()
            attention_mask = data_b['attention_mask'].long()
            token_type_ids = data_b['token_type_ids'].long()
            labels = data_b['labels'].long()
            next_sentence_label = data_b['next_sentence_label'].long()
            output_tensor = model(input_ids=input_ids, attention_mask=attention_mask,
                                  token_type_ids=token_type_ids, labels=labels, next_sentence_label=next_sentence_label)
    
            return output_tensor['loss']
  3. 初始化Rapidformer引擎,創建trainer對象,調用pretrain()方法,然后保存成文件并命名為rapidformer_pretrain_huggingface_bert_trainer.py

    engine = RapidformerEngine()
    trainer = MyBertPreTrainer(engine=engine)
    trainer.train()
  4. 基于CLI準備啟動腳本,并設置加速開關。

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    DATA_PATH=book_wiki_owtv2_small_text_sentence
    
    rapidformer --user-script rapidformer_pretrain_huggingface_bert_trainer.py \
           --pretrained-model-name-or-path 'bert-base-uncased' \
           --num-layers 12 \
           --hidden-size 768 \
           --num-attention-heads 12 \
           --micro-batch-size 16 \
           --global-batch-size 64 \
           --seq-length 512 \
           --tokenizer-type BertWordPieceLowerCase \
           --max-position-embeddings 512 \
           --train-iters 100 \
           --data-path $DATA_PATH \
           --vocab-file bert-en-uncased-vocab.txt  \
           --data-impl mmap \                               #開啟數據加速
           --split 980,20 \
           --lr 1e-3 \
           --lr-decay-style linear \
           --weight-decay 1e-2 \
           --clip-grad 1.0 \
           --lr-warmup-fraction .01 \
           --zero-3-memory-optimization \                    #開啟模型狀態切分
           --onnx-runtime-training \                         #開啟計算圖優化
           --mixed-precision                                 #混合精度訓練

白盒化加速:用戶自定義TrainerHuggingface模型微調

針對用戶自定義Trainer的程序,Rapidformer提供非常有限的加速能力,比如Apex優化器、模型狀態切分、計算圖優化等。由于混合精度訓練涉及到對用戶訓練過程較多的修改,因此我們推薦您使用上面提供的基于代碼模版的方法來實施對訓練程序的加速。以下針對一個典型的huggingface微調代碼進行侵入式的加速。

huggingface微調代碼示例如下。

import torch
from datasets import load_dataset, load_metric
from torch.utils.data import DataLoader
from transformers import (
    AdamW,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
    BertForSequenceClassification,

)

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
datasets = load_dataset("glue", "mrpc")
metric = load_metric("glue", "mrpc")

def tokenize_function(examples):
    # max_length=None => use the model max length (it's actually the default)
    outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
    return outputs

tokenized_datasets = datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=["idx", "sentence1", "sentence2"],
)

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)

optimizer = AdamW(params=model.parameters(), lr=args.lr, correct_bias=True)

lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=args.lr_warmup_iters,
    num_training_steps=args.train_iters
)

device = torch.device("cuda", args.local_rank)

for epoch in range(args.epochs):
    model.train()
    for step, batch in enumerate(train_dataloader):
        batch.to(device)
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    model.eval()
    for step, batch in enumerate(eval_dataloader):
        batch.to(device)
        with torch.no_grad():
            outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)
            metric.add_batch(
                    predictions=engine.gather(predictions),
                    references=engine.gather(batch["labels"]))

     eval_metric = metric.compute()
     print("epoch {}: {}".format(epoch, eval_metric))

這段代碼存在一些問題,比如不支持數據并行訓練、優化器也比較慢、不支持混合精度訓練等。以下借助Rapidformer提供的API來對這段示例自定義代碼進行改造。

  1. 支持數據并行。

    首先創建一個finetuner對象,然后調用finetuner.build_data_loader方法返回數據加載器。該加載器支持數據并行并自動將data發送到GPU設備,這意味著可以在原始代碼中去掉batch.to(device)

    + from rapidformer import RapidformerEngine
    + engine = RapidformerEngine()
    + finetuner = Finetuner(engine=engine)
    
    - train_dataloader = DataLoader(tokenized_datasets["train"])
    - eval_dataloader = DataLoader(tokenized_datasets["train"])
    
    + train_dataloader = finetuner.build_data_loader(tokenized_datasets["train"])
    + eval_dataloader = finetuner.build_data_loader(tokenized_datasets["validation"])
  2. 在數據并行的基礎上,使用Apex優化器。

    將優化器換成更快的apex fused adam,去掉原來的optimizer,換成rapidformer提供的fused adam。具體方法是調用engine.compose來對模型、優化器、學習率規劃器進行封裝。

    + from rapidformer import RapidformerEngine
    + engine = RapidformerEngine()
    + finetuner = Finetuner(engine=engine)
    
    - optimizer = AdamW(params=model.parameters(), lr=args.lr, correct_bias=True)
    - lr_scheduler = get_linear_schedule_with_warmup(optimizer=optimizer,
        num_warmup_steps=args.lr_warmup_iters,
        num_training_steps=args.train_iters
    )
    
    
    + lr_scheduler = partial(
            get_linear_schedule_with_warmup,
            num_warmup_steps=args.lr_warmup_iters,
            num_training_steps=args.train_iters
        )
    
    + model, optimizer, lr_scheduler = engine.compose(model_obj=model,
          lr_scheduler_fn=lr_scheduler)
    說明

    在數據并行的基礎上,使用Apex優化器和混合精度時,混合精度訓練涉及到對訓練流程的修改、model切換到fp16、loss scaling等。對無trainer的前端程序改造成本比較大,因此可使用基于Trainer的解決方案。有Rapidformerfintuner的加持,能做的加速方案就比較多了,除了整合前面的數據并行和apex、pytorch混合精度訓練,還提供了megatron optimizer混合精度訓練、fairscaledeepspeed的顯存優化加速等。

白盒化加速:基于Pretrainer代碼模版的Megatron模型預訓練

熟悉了上面的白盒化加速:用戶自定義TrainerHuggingface模型微調實踐,您可以進一步更加靈活的繞過Data、Model Hub,在函數train_valid_test_datasets_provider中編寫自定義數據的創建邏輯, 在函數model_optimizer_lr_scheduler_provider中編寫自定義創建模型的邏輯,同時在run_forward_step中自定義的前向邏輯。

  1. 制作mmap類型的預訓練數據集。

    操作詳情請參見Megatron數據處理腳本,mmap數據集制作腳本請參考如下命令示例。

    python preprocess_data.py \
      --input /apsarapangu/disk2/jerry.lp/pretrain_datasets/en/book_wiki_owtv2_small.json  \
      --output-prefix /apsarapangu/disk2/jerry.lp/pretrain_datasets/en/gpt_small \
      --vocab gpt2-vocab.json \
      --dataset-impl mmap \
      --tokenizer-type GPT2BPETokenizer \
      --merge-file gpt2-merges.txt \
      --append-eod
  2. 繼承Pretrainer,完善預訓練的代碼中的數據自定義函數train_valid_test_datasets_provider

    您可以不依賴于任何第三方庫來編寫自定義的邏輯,用來生成train、valid、test數據集,您的數據集應該繼承自torch.utils.data.Dataset

    from rapidformer import RapidformerEngine, get_args, PreTrainer
    
    class MegatronGPTPreTrainer(PreTrainer):
        def __init__(self,
                     engine,
                     ):
            super().__init__(engine=engine)
    
        def train_valid_test_datasets_provider(self, train_val_test_num_samples):
            args = get_args()
    
            train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
                data_prefix=args.data_path,
                data_impl=args.data_impl,
                splits_string=args.split,
                train_valid_test_num_samples=train_val_test_num_samples,
                seq_length=args.seq_length,
                seed=args.seed,
                skip_warmup=(not args.mmap_warmup))
    
            return train_ds, valid_ds, test_ds
  3. 繼承Pretrainer,完善預訓練的代碼中的模型自定義函數model_optimizer_lr_scheduler_provider

    您可以不依賴于任何第三方庫來編寫自定義的邏輯,用來生成自定義模型對象。您的模型應該是繼承自torch.nn.Module

    from rapidformer import RapidformerEngine, get_args, PreTrainer
    from yourmodel import GPTModel
    
    class MegatronGPTPreTrainer(PreTrainer):
        def __init__(self,
                     engine,
                     ):
            super().__init__(engine=engine)
    
        def model_optimizer_lr_scheduler_provider(self):
            model = GPTModel()
            return model, None, None
  4. 繼承Pretrainer,完善預訓練的代碼中的前向自定義函數run_forward_step

    from rapidformer import RapidformerEngine, get_args, PreTrainer
    
    class MyGPTPreTrainer(PreTrainer):
        def __init__(self,
                     engine,
                     ):
            super().__init__(engine=engine)
    
    
        def run_forward_step(self, data_iterator, model):
            """Forward step."""
            args = get_args()
    
            tokenizer = get_tokenizer()
    
            # Items and their type.
            keys = ['text']
            datatype = torch.int64
    
            # Broadcast data.
            if data_iterator is not None:
                data = next(data_iterator)
            else:
                data = None
            data_b = mpu.broadcast_data(keys, data, datatype)
    
            # Unpack.
            tokens_ = data_b['text'].long()
            labels = tokens_[:, 1:].contiguous()
            tokens = tokens_[:, :-1].contiguous()
    
            # Get the masks and postition ids.
            attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
                tokens,
                tokenizer.eod,
                args.reset_position_ids,
                args.reset_attention_mask,
                args.eod_mask_loss)
    
            output_tensor = model(tokens, position_ids, attention_mask,
                                  labels=labels)
    
            losses = output_tensor.float()
            loss_mask = loss_mask.view(-1).float()
            loss = torch.sum(losses.view(-1) * loss_mask) / loss_mask.sum()
    
            return loss
    
    
                            
  5. 初始化Rapidformer引擎,創建trainer對象,調用pretrain()方法。然后保存成文件并命名為rapidformer_pretrain_megatron_gpt_trainer.py

    engine = RapidformerEngine()
    trainer = MyGPTPreTrainer(engine=engine)
    trainer.train()
  6. 準備啟動腳本,設置加速開關。

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    DATA_PATH=book_wiki_owtv2_small_text_sentence
    PRETRAINED_CHECKPOINT=
    
    rapidformer --user-script rapidformer_pretrain_megatron_gpt_trainer.py \
           --tensor-model-parallel-size 2 \          #開啟算子拆分優化
           --pipeline-model-parallel-size 2 \        #開啟流水并行優化
           --num-layers 12 \
           --hidden-size 768 \
           --num-attention-heads 12 \
           --micro-batch-size 16 \
           --global-batch-size 128 \                  #開啟梯度累積優化
           --seq-length 512 \
           --tokenizer-type GPT2BPETokenizer \
           --max-position-embeddings 512 \
           --train-iters 100 \
           --data-path $DATA_PATH \
           --vocab-file gpt2-vocab.json \
           --merge-file gpt2-merges.txt \
           --data-impl mmap \                         #開啟數據加速
           --split 980,20 \
           --lr 1e-3 \
           --lr-decay-style linear \
           --weight-decay 1e-2 \
           --clip-grad 1.0 \
           --lr-warmup-fraction .01 \
           --log-interval 1 \
           --zero-2-memory-optimization \              #開啟模型狀態切分
           --checkpoint-activations \                  #開啟梯度檢查點
           --mixed-precision                           #開啟混合精度訓練