日本熟妇hd丰满老熟妇,中文字幕一区二区三区在线不卡 ,亚洲成片在线观看,免费女同在线一区二区

將大語言模型轉(zhuǎn)化為推理服務(wù)

大語言模型LLM(Large Language Model)指參數(shù)數(shù)量達到億級別的神經(jīng)網(wǎng)絡(luò)語言模型,例如GPT-3、GPT-4、PaLM、PaLM2等。當(dāng)您需要處理大量自然語言數(shù)據(jù)或希望建立復(fù)雜的語言理解系統(tǒng)時,可以將大語言模型轉(zhuǎn)化為推理服務(wù),通過API輕松集成先進的NLP能力(例如文本分類、情感分析、機器翻譯等)到您的應(yīng)用程序中。通過服務(wù)化LLM,您可以避免昂貴的基礎(chǔ)設(shè)施成本,快速響應(yīng)市場變化,并且由于模型運行在云端,還可以隨時擴展服務(wù)以應(yīng)對用戶請求的高峰,從而提高運營效率。

前提條件

步驟一:構(gòu)建自定義運行時

構(gòu)建自定義運行時,提供帶有提示調(diào)整配置的HuggingFace LLM。此示例中的默認(rèn)值設(shè)置為預(yù)先構(gòu)建的自定義運行時鏡像和預(yù)先構(gòu)建的提示調(diào)整配置。

  1. 實現(xiàn)一個繼承自MLServer MLModel的類。

    peft_model_server.py文件包含了如何提供帶有提示調(diào)整配置的HuggingFace LLM的所有代碼。_load_model函數(shù)是該文件中的一部分,用于選擇已訓(xùn)練的PEFT提示調(diào)整配置的預(yù)訓(xùn)練LLM模型。_load_model函數(shù)還定義了分詞器,以便對推理請求中的原始字符串輸入進行編碼和解碼,而無需用戶預(yù)處理其輸入為張量字節(jié)。

    展開查看peft_model_server.py

    from typing import List
    
    from mlserver import MLModel, types
    from mlserver.codecs import decode_args
    
    from peft import PeftModel, PeftConfig
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
    import os
    
    class PeftModelServer(MLModel):
        async def load(self) -> bool:
            self._load_model()
            self.ready = True
            return self.ready
    
        @decode_args
        async def predict(self, content: List[str]) -> List[str]:
            return self._predict_outputs(content)
    
        def _load_model(self):
            model_name_or_path = os.environ.get("PRETRAINED_MODEL_PATH", "bigscience/bloomz-560m")
            peft_model_id = os.environ.get("PEFT_MODEL_ID", "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM")
            self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, local_files_only=True)
            config = PeftConfig.from_pretrained(peft_model_id)
            self.model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
            self.model = PeftModel.from_pretrained(self.model, peft_model_id)
            self.text_column = os.environ.get("DATASET_TEXT_COLUMN_NAME", "Tweet text")
            return
    
        def _predict_outputs(self, content: List[str]) -> List[str]:
            output_list = []
            for input in content:
                inputs = self.tokenizer(
                    f'{self.text_column} : {input} Label : ',
                    return_tensors="pt",
                )
                with torch.no_grad():
                    inputs = {k: v for k, v in inputs.items()}
                    outputs = self.model.generate(
                        input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=10, eos_token_id=3
                    )
                    outputs = self.tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)
                output_list.append(outputs[0])
            return output_list
    
  2. 構(gòu)建Docker鏡像。

    實現(xiàn)模型類之后,您需要將其依賴項(包括MLServer)打包到一個支持ServingRuntime資源的鏡像中。您可以參考如下Dockerfile進行鏡像構(gòu)建。

    展開查看Dockerfile

    # TODO: choose appropriate base image, install Python, MLServer, and
    # dependencies of your MLModel implementation
    FROM python:3.8-slim-buster
    RUN pip install mlserver peft transformers datasets
    # ...
    
    # The custom `MLModel` implementation should be on the Python search path
    # instead of relying on the working directory of the image. If using a
    # single-file module, this can be accomplished with:
    COPY --chown=${USER} ./peft_model_server.py /opt/peft_model_server.py
    ENV PYTHONPATH=/opt/
    
    # environment variables to be compatible with ModelMesh Serving
    # these can also be set in the ServingRuntime, but this is recommended for
    # consistency when building and testing
    ENV MLSERVER_MODELS_DIR=/models/_mlserver_models \
        MLSERVER_GRPC_PORT=8001 \
        MLSERVER_HTTP_PORT=8002 \
        MLSERVER_LOAD_MODELS_AT_STARTUP=false \
        MLSERVER_MODEL_NAME=peft-model
    
    # With this setting, the implementation field is not required in the model
    # settings which eases integration by allowing the built-in adapter to generate
    # a basic model settings file
    ENV MLSERVER_MODEL_IMPLEMENTATION=peft_model_server.PeftModelServer
    
    CMD mlserver start ${MLSERVER_MODELS_DIR}
    
  3. 創(chuàng)建新的ServingRuntime資源。

    1. 使用以下內(nèi)容,保存為sample-runtime.yaml, 創(chuàng)建一個新的ServingRuntime資源,并將其指向您剛創(chuàng)建的鏡像。

      展開查看YAML

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: peft-model-server
        namespace: modelmesh-serving
      spec:
        supportedModelFormats:
          - name: peft-model
            version: "1"
            autoSelect: true
        multiModel: true
        grpcDataEndpoint: port:8001
        grpcEndpoint: port:8085
        containers:
          - name: mlserver
            image:  registry.cn-beijing.aliyuncs.com/test/peft-model-server:latest
            env:
              - name: MLSERVER_MODELS_DIR
                value: "/models/_mlserver_models/"
              - name: MLSERVER_GRPC_PORT
                value: "8001"
              - name: MLSERVER_HTTP_PORT
                value: "8002"
              - name: MLSERVER_LOAD_MODELS_AT_STARTUP
                value: "true"
              - name: MLSERVER_MODEL_NAME
                value: peft-model
              - name: MLSERVER_HOST
                value: "127.0.0.1"
              - name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
                value: "-1"
              - name: PRETRAINED_MODEL_PATH
                value: "bigscience/bloomz-560m"
              - name: PEFT_MODEL_ID
                value: "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM"
              # - name: "TRANSFORMERS_OFFLINE"
              #   value: "1"  
              # - name: "HF_DATASETS_OFFLINE"
              #   value: "1"    
            resources:
              requests:
                cpu: 500m
                memory: 4Gi
              limits:
                cpu: "5"
                memory: 5Gi
        builtInAdapter:
          serverType: mlserver
          runtimeManagementPort: 8001
          memBufferBytes: 134217728
          modelLoadingTimeoutMillis: 90000
      
    2. 執(zhí)行以下命令,部署ServingRuntime資源。

      kubectl apply -f sample-runtime.yaml

      創(chuàng)建完成后,您可以在ModelMesh部署中看到新的自定義運行時。

步驟二:部署LLM服務(wù)

為了使用新創(chuàng)建的運行時部署模型,您需要創(chuàng)建一個InferenceService資源來提供模型服務(wù)。該資源是KServe和ModelMesh用于管理模型的主要接口,代表了模型在推理中的邏輯端點。

  1. 使用以下內(nèi)容,創(chuàng)建一個InferenceService資源來提供模型服務(wù)。

    展開查看YAML

    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: peft-demo
      namespace: modelmesh-serving
      annotations:
        serving.kserve.io/deploymentMode: ModelMesh
    spec:
      predictor:
        model:
          modelFormat:
            name: peft-model
          runtime: peft-model-server
          storage:
            key: localMinIO
            path: sklearn/mnist-svm.joblib
    

    在YAML中,InferenceService命名為peft-demo,并聲明其模型格式為peft-model,與之前創(chuàng)建的示例自定義運行時使用相同的格式。還傳遞了一個可選字段runtime,明確告訴ModelMesh使用peft-model-server運行時來部署此模型。

  2. 執(zhí)行以下命令,部署InferenceService資源。

    kubectl apply -f ${實際YAML名稱}.yaml

步驟三:運行推理服務(wù)

使用curl命令,發(fā)送推理請求到上面部署的LLM模型服務(wù)。

MODEL_NAME="peft-demo"
ASM_GW_IP="ASM網(wǎng)關(guān)IP地址"
curl -X POST -k http://${ASM_GW_IP}:8008/v2/models/${MODEL_NAME}/infer -d @./input.json

curl命令中的input.json表示請求數(shù)據(jù):

{
    "inputs": [
        {
          "name": "content",
          "shape": [1],
          "datatype": "BYTES",
          "contents": {"bytes_contents": ["RXZlcnkgZGF5IGlzIGEgbmV3IGJpbm5pbmcsIGZpbGxlZCB3aXRoIG9wdGlvbnBpZW5pbmcgYW5kIGhvcGU="]}
        }
    ]
}

bytes_contents對應(yīng)字符串“Every day is a new beginning, filled with opportunities and hope”的Base64編碼。

JSON響應(yīng)如下所示:

{
 "modelName": "peft-demo__isvc-5c5315c302",
 "outputs": [
  {
   "name": "output-0",
   "datatype": "BYTES",
   "shape": [
    "1",
    "1"
   ],
   "parameters": {
    "content_type": {
     "stringParam": "str"
    }
   },
   "contents": {
    "bytesContents": [
     "VHdlZXQgdGV4dCA6IEV2ZXJ5IGRheSBpcyBhIG5ldyBiaW5uaW5nLCBmaWxsZWQgd2l0aCBvcHRpb25waWVuaW5nIGFuZCBob3BlIExhYmVsIDogbm8gY29tcGxhaW50"
    ]
   }
  }
 ]
}

bytesContents進行Base64解碼后的內(nèi)容如下,表明上述大語言模型LLM的模型服務(wù)請求符合預(yù)期。

Tweet text : Every day is a new binning, filled with optionpiening and hope Label : no complaint