亚洲高清国产拍精品闺蜜合租,亚洲日本va中文字幕亚洲,亚洲国产中文综合精品2020

本文以PyTorch官方提供的Resnet50模型為例，說明如何通過PyTorch Profiler發現模型的性能瓶頸，進而使用TensorRT優化模型，然后使用Triton Inference Server部署優化后的模型。

背景信息

Nvidia TensorRT是一個加速深度學習模型推理的SDK，包含可以降低推理時延、提高吞吐量的優化器和運行時。Triton Inference Server則是Nvidia官方推出的一個開源模型推理框架，可以支持PyTorch、Tensorflow、TensorRT、ONNX等主流的模型。

深度學習模型在訓練完成，準備部署上線時，通常需要對模型進行性能分析和優化，以便降低推理時延、提高吞吐量。同時可以減少模型占用的顯存，通過共享GPU，提高GPU的使用率。

本文以PyTorch官方提供的Resnet50模型為例，通過對下圖dog.jpg識別，說明如何通過使用PyTorch Profiler發現模型的性能瓶頸，進而使用TensorRT優化模型，然后使用Triton Inference Server部署優化后的模型。 dog

前提條件

已創建包含GPU的Kubernetes集群。具體操作，請參見使用Kubernetes默認GPU調度。
集群節點可以訪問公網。具體操作，請參見為已有集群開啟公網訪問能力。
已安裝Arena工具。具體操作，請參見安裝Arena。
已為集群配置了Arena使用的PVC。更多信息，請參見配置NAS共享存儲。

步驟一：使用PyTorch模型的Profiler能力

PyTorch自1.8.1版本開始提供了Profiler能力，可以幫助分析模型訓練、推理過程中的性能瓶頸。并能與Tensorboard集成，方便查看分析報告。

執行以下命令，生成Profiler日志。

說明

with open("imagenet_classes.txt") as f:中的imagenet_classes.txt文件，請參見imagenet_classes。

import torch
from torchvision import models
import torchvision.transforms as T
from PIL import Image
import time

#圖片預處理。
def preprocess_image(img_path):
    transform = T.Compose([
    T.Resize(224),
    T.CenterCrop(224),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),])

    #讀取圖片。
    input_image = Image.open(img_path)
    #圖片格式轉換。
    input_data = transform(input_image)

    batch_data = torch.unsqueeze(input_data, 0)

    return batch_data


#預測結果后處理。
def postprocess(output_data):
    #讀取Imagenet分類數據。
    with open("imagenet_classes.txt") as f: 
        classes = [line.strip() for line in f.readlines()]
    #通過Softmax得到可讀的預測結果。
    confidences = torch.nn.functional.softmax(output_data, dim=1)[0] * 100
    _, indices = torch.sort(output_data, descending=True)
    i = 0
    #打印預測結果。
    while confidences[indices[0][i]] > 0.5:
        class_idx = indices[0][i]
        print(
            "class:",
            classes[class_idx],
            ", confidence:",
            confidences[class_idx].item(),
            "%, index:",
            class_idx.item(),
        )
        i += 1


def main():
    model = models.resnet50(pretrained=True)

    input = preprocess_image("dog.jpg").cuda()

    model.eval()
    model.cuda()

    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./logs'),
        profile_memory=True,
        record_shapes=True,
        with_stack=True
    ) as profiler:
        start = time.time()
        output = model(input)
        cost = time.time() - start
        print(f"predict_cost = {cost}")

        postprocess(output)
        profiler.step()


if __name__ == '__main__':
    main()

使用Tensorboard查看分析報告。
執行以下命令安裝PyTorch Profiler Tensorboard Plugin并在本地啟動Tensorboard。
```
pip install torch_tb_profiler
tensorboard --logdir ./logs --port 6006
```
在瀏覽器地址欄輸入localhost:6006查看Tensorboard分析結果。
使用Tensorboard可以查看GPU Kernel、PyTorch Op、Trace Timeline等分析結果，進而給出優化建議。
從Tensorboard分析結果可得：
- 該Resnet50模型的GPU利用率比較低，可以考慮通過增大Batch size的方式提高利用率。
- 大部分時間消耗在GPU Kernel加載上，可以通過降低精度的方式提高推理速度。

步驟二：優化PyTorch模型

通過TensorRT優化模型時，需要先把PyTorch模型轉為ONNX，再把ONNX轉換成TensorRT Engine。

執行以下命令，導出ONNX。

#加載預訓練模型。
model = models.resnet50(pretrained=True)

#預處理階段。
input = preprocess_image("dog.jpg").cuda()

#推理階段。
model.eval()
model.cuda()

#轉換為ONNX。
ONNX_FILE_PATH = "resnet50.onnx"
torch.onnx.export(model, input, ONNX_FILE_PATH, input_names=["input"], output_names=["output"], export_params=True)
onnx_model = onnx.load(ONNX_FILE_PATH)

#檢查模型是否轉換良好。
onnx.checker.check_model(onnx_model)

print("Model was successfully converted to ONNX format.")
print("It was saved to", ONNX_FILE_PATH)

構建并導出TensorRT Engine。

重要

構建TensorRT Engine時，依賴的TensorRT及CUDA版本，要與步驟四：部署優化后的模型使用Triton部署模型推理任務時的版本保持一致，同時也要和ECS實例的GPU Driver、CUDA版本一致。

建議直接使用Nvidia提供的TensorRT鏡像，這里使用的鏡像版本為nvcr.io/nvidia/tensorrt:21.05-py3，對應的Triton鏡像為nvcr.io/nvidia/tritonserver:21.05-py3。

def build_engine(onnx_file_path, save_engine=False):
    if os.path.exists(TRT_ENGINE_PATH):
        #如果存在序列化引擎，請加載該引擎，而不是構建新引擎。
        print("Reading engine from file {}".format(TRT_ENGINE_PATH))
        with open(TRT_ENGINE_PATH, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
            engine = runtime.deserialize_cuda_engine(f.read())
            context = engine.create_execution_context()
            return engine, context

    #初始化TensorRT引擎并解析ONNX模型。
    builder = trt.Builder(TRT_LOGGER)

    explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    network = builder.create_network(explicit_batch)
    parser = trt.OnnxParser(network, TRT_LOGGER)

    #允許TensorRT使用高達1 GB的GPU內存進行策略選擇。
    builder.max_workspace_size = 1 << 30
    #本示例批處理中只有一個圖像。
    builder.max_batch_size = 1
    #請盡可能使用FP16模式。
    if builder.platform_has_fast_fp16:
        builder.fp16_mode = True

    #解析ONNX。
    with open(onnx_file_path, 'rb') as model:
        print('Beginning ONNX file parsing')
        parser.parse(model.read())
    print('Completed parsing of ONNX file')

    #生成針對目標平臺優化的TensorRT引擎。
    print('Building an engine...')
    engine = builder.build_cuda_engine(network)
    context = engine.create_execution_context()
    print("Completed creating Engine")

    with open(TRT_ENGINE_PATH, "wb") as f:
        print("Save engine to {}".format(TRT_ENGINE_PATH))
        f.write(engine.serialize())

    return engine, context

對比原來的PyTorch模型與經過TensorRT優化后模型的Latency及Size。

執行以下命令，計算原始PyTorch模型在GPU上的推理耗時。

model = models.resnet50(pretrained=True)
input = preprocess_image("dog.jpg").cuda()
model.eval()
model.cuda()
start = time.time()
output = model(input)
cost = time.time() - start
print(f"pytorch predict_cost = {cost}")

執行以下命令，計算優化后的TensorRT Engine在GPU上的推理耗時。

#初始化TensorRT引擎并解析ONNX模型。
engine, context = build_engine(ONNX_FILE_PATH)
#獲取輸入和輸出的大小，并分配輸入數據和輸出數據所需的內存。
for binding in engine:
    if engine.binding_is_input(binding):  # we expect only one input
        input_shape = engine.get_binding_shape(binding)
        input_size = trt.volume(input_shape) * engine.max_batch_size * np.dtype(np.float32).itemsize  # in bytes
        device_input = cuda.mem_alloc(input_size)
    else:  #輸出。
        output_shape = engine.get_binding_shape(binding)
        #創建頁面鎖定的內存緩沖區（即不會交換到磁盤）。
        host_output = cuda.pagelocked_empty(trt.volume(output_shape) * engine.max_batch_size, dtype=np.float32)
        device_output = cuda.mem_alloc(host_output.nbytes)
#創建一個Stream，在其中復制輸入或輸出并運行推斷。
stream = cuda.Stream()
#預處理輸入數據。
host_input = np.array(preprocess_image("dog.jpg").numpy(), dtype=np.float32, order='C')
cuda.memcpy_htod_async(device_input, host_input, stream)
#運行推理。
start = time.time()
context.execute_async(bindings=[int(device_input), int(device_output)], stream_handle=stream.handle)
cuda.memcpy_dtoh_async(host_output, device_output, stream)
stream.synchronize()
cost = time.time() - start
print(f"tensorrt predict_cost = {cost}")

通過計算原始PyTorch模型和計算優化后的TensorRT Engine在GPU上的推理耗時，以及查看優化前后模型文件的Size，可得到以下指標值：

指標項	PyTorch Model	TensorRT Engine
Latency	16 ms	3 ms
Size	98 MB	50 MB

對比原來的PyTorch模型與經過TensorRT優化后模型的Latency及Size值可得，Latency降低為原來的20%左右，同時模型體積也縮小了一半。

步驟三：模型性能壓測

在使用Triton Inference Server部署優化后的模型前，可以使用Triton提供的Model Analyzer工具對模型進行壓測，分析Latency、吞吐量、GPU顯存占用等是否符合預期。關于Model Analyzer工具的更多信息，請參見model_analyzer。

執行以下命令，對模型進行分析。

執行命令后，會在當前目錄生成一個output_model_repository文件夾，包含Profile結果。

model-analyzer profile -m /triton_repository/ \
    --profile-models resnet50_trt \
    --run-config-search-max-concurrency 2 \
    --run-config-search-max-instance-count 2 \
    --run-config-search-preferred-batch-size-disable true

執行以下命令，生成分析報告。
為了便于查看分析結果，可以使用model-analyzer的analyze對上一步的分析數據生成PDF格式的文件。
```
mkdir analysis_results
model-analyzer analyze --analysis-models resnet50_trt -e analysis_results
```
查看分析報告。
圖 1. 兩種最佳配置的吞吐量與延遲曲線
圖 2. 兩種最佳配置的GPU內存與延遲曲線
圖 3. 兩種模型的性能

步驟四：部署優化后的模型

如果模型的性能符合預期，就可以通過Arena把優化后的模型部署在ACK集群中。

編寫config.pbtxt。

name: "resnet50_trt"
platform: "tensorrt_plan"
max_batch_size: 1
default_model_filename: "resnet50.trt"
input [
    {
        name: "input"
        format: FORMAT_NCHW
        data_type: TYPE_FP32
        dims: [ 3, 224, 224 ]
    }
]
output [
    {
        name: "output",
        data_type: TYPE_FP32,
        dims: [ 1000 ]
    }
]

執行以下命令，使用Arena部署模型推理任務。
如果使用GPU共享的方式部署，顯存大小的設置（--gpumemory）可以參考步驟三：模型性能壓測，分析報告中建議的顯存大小，該模型顯存可設置為2 GB。
```
arena serve triton \
  --name=resnet50 \
  --gpus=1 \
  --replicas=1 \
  --image=nvcr.io/nvidia/tritonserver:21.05-py3 \
  --data=model-pvc:/data \
  --model-repository=/data/profile/pytorch \
  --allow-metrics=true
```

執行以下命令，查看推理服務狀態。

arena serve list

預期輸出：

NAME      TYPE    VERSION       DESIRED  AVAILABLE  ADDRESS         PORTS                   GPU
resnet50  Triton  202111121515  1        1          172.16.169.126  RESTFUL:8000,GRPC:8001  1

使用GRPC Client調用部署在ACK集群中的推理服務。

img_file = "dog.jpg"
service_grpc_endpoint = "172.16.248.19:8001"

#創建用于與服務器通信的grpc_stub。
channel = grpc.insecure_channel(service_grpc_endpoint)
grpc_stub = service_pb2_grpc.GRPCInferenceServiceStub(channel)

#確保模型符合需求，并獲得需要預處理模型的一些屬性。
metadata_request = service_pb2.ModelMetadataRequest(
    name=model_name, version=model_version)
metadata_response = grpc_stub.ModelMetadata(metadata_request)
config_request = service_pb2.ModelConfigRequest(name=model_name,
                                                version=model_version)
config_response = grpc_stub.ModelConfig(config_request)
input_name, output_name, c, h, w, format, dtype = parse_model(
    metadata_response, config_response.config)
request = requestGenerator(input_name, output_name, c, h, w, format, dtype, batch_size, img_file)
start = time.time()
response = grpc_stub.ModelInfer(request)
cost = time.time() - start
print("predict cost: {}".format(cost))

查看監控數據。

可以通過8002端口，調用/metrics接口查看指標，本示例為172.16.169.126:8002/metrics。

#HELP nv_inference_request_success Number of successful inference requests, all batch sizes
#TYPE nv_inference_request_success counter
nv_inference_request_success{gpu_uuid="GPU-0e9fdafb-5adb-91cd-26a8-308d34357efc",model="resnet50_trt",version="1"} 4.000000
#HELP nv_inference_request_failure Number of failed inference requests, all batch sizes
# TYPE nv_inference_request_failure counter
nv_inference_request_failure{gpu_uuid="GPU-0e9fdafb-5adb-91cd-26a8-308d34357efc",model="resnet50_trt",version="1"} 0.000000
#HELP nv_inference_count Number of inferences performed
#TYPE nv_inference_count counter
nv_inference_count{gpu_uuid="GPU-0e9fdafb-5adb-91cd-26a8-308d34357efc",model="resnet50_trt",version="1"} 4.000000
#HELP nv_inference_exec_count Number of model executions performed
#TYPE nv_inference_exec_count counter
nv_inference_exec_count{gpu_uuid="GPU-0e9fdafb-5adb-91cd-26a8-308d34357efc",model="resnet50_trt",version="1"} 4.000000
#HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds
#TYPE nv_inference_request_duration_us counter
nv_inference_request_duration_us{gpu_uuid="GPU-0e9fdafb-5adb-91cd-26a8-308d34357efc",model="resnet50_trt",version="1"} 7222.000000
#HELP nv_inference_queue_duration_us Cumulative inference queuing duration in microseconds
#TYPE nv_inference_queue_duration_us counter
nv_inference_queue_duration_us{gpu_uuid="GPU-0e9fdafb-5adb-91cd-26a8-308d34357efc",model="resnet50_trt",version="1"} 116.000000
#HELP nv_inference_compute_input_duration_us Cumulative compute input duration in microseconds
#TYPE nv_inference_compute_input_duration_us counter
nv_inference_compute_input_duration_us{gpu_uuid="GPU-0e9fdafb-5adb-91cd-26a8-308d34357efc",model="resnet50_trt",version="1"} 1874.000000
#HELP nv_inference_compute_infer_duration_us Cumulative compute inference duration in microseconds
#TYPE nv_inference_compute_infer_duration_us counter
nv_inference_compute_infer_duration_us{gpu_uuid="GPU-0e9fdafb-5adb-91cd-26a8-308d34357efc",model="resnet50_trt",version="1"} 5154.000000
#HELP nv_inference_compute_output_duration_us Cumulative inference compute output duration in microseconds
#TYPE nv_inference_compute_output_duration_us counter
nv_inference_compute_output_duration_us{gpu_uuid="GPU-0e9fdafb-5adb-91cd-26a8-308d34357efc",model="resnet50_trt",version="1"} 66.000000
#HELP nv_gpu_utilization GPU utilization rate [0.0 - 1.0)
#TYPE nv_gpu_utilization gauge
nv_gpu_utilization{gpu_uuid="GPU-0e9fdafb-5adb-91cd-26a8-308d34357efc"} 0.000000
#HELP nv_gpu_memory_total_bytes GPU total memory, in bytes
#TYPE nv_gpu_memory_total_bytes gauge
nv_gpu_memory_total_bytes{gpu_uuid="GPU-0e9fdafb-5adb-91cd-26a8-308d34357efc"} 16945512448.000000
#HELP nv_gpu_memory_used_bytes GPU used memory, in bytes
#TYPE nv_gpu_memory_used_bytes gauge
nv_gpu_memory_used_bytes{gpu_uuid="GPU-0e9fdafb-5adb-91cd-26a8-308d34357efc"} 974913536.000000
#HELP nv_gpu_power_usage GPU power usage in watts
#TYPE nv_gpu_power_usage gauge
nv_gpu_power_usage{gpu_uuid="GPU-0e9fdafb-5adb-91cd-26a8-308d34357efc"} 55.137000
#HELP nv_gpu_power_limit GPU power management limit in watts
#TYPE nv_gpu_power_limit gauge
nv_gpu_power_limit{gpu_uuid="GPU-0e9fdafb-5adb-91cd-26a8-308d34357efc"} 300.000000
#HELP nv_energy_consumption GPU energy consumption in joules since the Triton Server started
#TYPE nv_energy_consumption counter
nv_energy_consumption{gpu_uuid="GPU-0e9fdafb-5adb-91cd-26a8-308d34357efc"} 9380.053000

Triton Inference Server內置了對Prometheus支持，您也可通過配置Grafana展示監控數據。具體操作，請參見對接Grafana。

說明

云原生AI套件已實現模型分析優化，可以簡化您的模型性能分析、優化操作工作，詳情請參見模型分析優化。

日本熟妇hd丰满老熟妇,中文字幕一区二区三区在线不卡 ,亚洲成片在线观看,免费女同在线一区二区

PyTorch模型性能優化示例