人妻少妇精品视频专区,国语高清自产拍免费,精品人妻中文字幕有码在线

大部分PyTorch用戶會使用TensorRT Plugin實現檢測模型的后處理部分，以支持整個模型導出到TensorRT。Blade擁有良好的可擴展性，如果您已經自己實現了TensorRT Plugin，也可以結合Blade協同優化。本文介紹如何使用Blade對已經實現了TensorRT Plugin機制的檢測模型進行優化。

背景信息

TensorRT是NVIDIA GPU平臺進行推理優化的利器，Blade底層優化深度采納了TensorRT的優化手段。相比而言，Blade有機融合了計算圖優化、TensorRT/oneDNN等Vendor優化庫、AI編譯優化、Blade手工優化算子庫、Blade混合精度及Blade EasyCompression等多種優化技術。

RetinaNet是一種One-Stage RCNN類型的檢測網絡，基本結構由一個Backbone、多個子網及NMS后處理組成。許多訓練框架中均實現了RetinaNet，典型的框架有Detectron2。之前介紹了如何通過scripting_with_instances方式導出RetinaNet（Detectron2）模型并使用Blade快速完成模型優化，詳情請參見RetinaNet優化案例1：使用Blade優化RetinaNet（Detectron2）模型。

然而，對于大部分PyTorch用戶而言，先導出ONNX再使用TensorRT部署是常見且熟悉的使用方式。但是ONNX導出和TensorRT對ONNX Opset的支持均有限，導致很多情況下導出ONNX并使用TensorRT優化的過程并不具備魯棒性。特別是對于Detection網絡的后處理部分，難以直接導出ONNX并使用TensorRT優化。除此之外，實際場景中檢測模型的后處理部分代碼實現通常不高效，因此，許多用戶會使用TensorRT提供的Plugin機制實現后處理部分，以支持整個模型導出到TensorRT。

相比而言，Blade結合TorchScript Custom C++ Operators的優化方式比使用TensorRT提供的Plugin機制實現后處理部分更加簡便，詳情請參見RetinaNet優化案例2：結合Blade和Custom C++ Operator優化模型。此外，Blade擁有良好的可擴展性，如果您已經自己實現了TensorRT Plugin，也可以結合Blade協同優化。

使用限制

本文使用的環境需要滿足以下版本限制：

系統環境：Linux系統中使用Python 3.6及其以上版本、GCC 5.4及其以上版本、Nvidia Tesla T4、CUDA 10.2、CuDNN 8.0.5.39、TensorRT 7.2.2.3。
框架：PyTorch 1.8.1及其以上版本、Detectron2 0.4.1及其以上版本。
推理優化工具：Blade 3.16.0及其以上版本（動態鏈接TensorRT版本）。

操作流程

結合Blade和TensorRT Plugin優化模型的流程如下：

步驟一：創建帶有TensorRT Plugin的PyTorch模型
使用TensorRT Plugin實現RetinaNet的后處理部分。
步驟二：調用Blade優化模型
調用blade.optimize接口優化模型，并保存優化后的模型。
步驟三：加載運行優化后的模型
經過對優化前后的模型進行性能測試，如果對結果滿意，可以加載優化后的模型進行推理。

步驟一：創建帶有TensorRT Plugin的PyTorch模型

Blade能夠和TensorRT擴展機制協同優化，以下介紹如何使用TensorRT擴展實現RetinaNet的后處理部分。關于開發和編譯TensorRT Plugin的教程請參見NVIDIA Deep Learning TensorRT Documentation。本文使用的RetinaNet后處理部分的程序邏輯來自NVIDIA開源社區，詳情請參見Retinanet-Examples。本文抽取了核心的代碼用于說明開發實現Custom Operator的流程。

下載示例代碼并解壓。

wget -nv https://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/tutorials/retinanet_example/retinanet-examples.tar.gz -O retinanet-examples.tar.gz
tar xvfz retinanet-examples.tar.gz 1>/dev/null

編譯TensorRT Plugin。

示例代碼中包含了RetinaNet后處理的decode和nms的TensorRT Plugin實現及注冊。PyTorch官方文檔中（詳情請參見EXTENDING TORCHSCRIPT WITH CUSTOM C++ OPERATORS）提供了三種編譯Custom Operators的方式：Building with CMake、Building with JIT Compilation及Building with Setuptools。這三種編譯方式適用于不同場景，您可以根據自己的需求進行選擇。本文為了簡便，采用Building with JIT Compilation方式，示例代碼如下所示。

說明

編譯之前，您需要配置好TensorRT、CUDA、CUDNN等依賴庫。

import torch.utils.cpp_extension
import os

codebase="retinanet-examples"
sources=['csrc/plugins/plugin.cpp',
         'csrc/cuda/decode.cu',
         'csrc/cuda/nms.cu',]
sources = [os.path.join(codebase,src) for src in sources]
torch.utils.cpp_extension.load(
    name="plugin",
    sources=sources,
    build_directory=codebase,
    extra_include_paths=['/usr/local/TensorRT/include/', '/usr/local/cuda/include/', '/usr/local/cuda/include/thrust/system/cuda/detail'],
    extra_cflags=['-std=c++14', '-O2', '-Wall'],
    extra_ldflags=['-L/usr/local/TensorRT/lib/', '-lnvinfer'],
    extra_cuda_cflags=[
        '-std=c++14', '--expt-extended-lambda',
        '--use_fast_math', '-Xcompiler', '-Wall,-fno-gnu-unique',
        '-gencode=arch=compute_75,code=sm_75',],
    is_python_module=False,
    with_cuda=True,
    verbose=False,
)

封裝RetinaNet卷積模型部分。

將RetinaNet模型部分單獨封裝為一個RetinaNetBackboneAndHeads Module。

import torch
from typing import List
from torch import Tensor
from torch.testing import assert_allclose
from detectron2 import model_zoo

# 這個類封裝了RetinaNet的backbone和rpn heads部分。
class RetinaNetBackboneAndHeads(torch.nn.Module):

    def __init__(self, model):
        super().__init__()
        self.model = model

    def preprocess(self, img):
        batched_inputs = [{"image": img}]
        images = self.model.preprocess_image(batched_inputs)
        return images.tensor

    def forward(self, images):
        features = self.model.backbone(images)
        features = [features[f] for f in self.model.head_in_features]
        cls_heads, box_heads = self.model.head(features)
        cls_heads = [cls.sigmoid() for cls in cls_heads]
        box_heads = [b.contiguous() for b in box_heads]
        return cls_heads, box_heads

retinanet_model = model_zoo.get("COCO-Detection/retinanet_R_50_FPN_3x.yaml", trained=True).eval()
retinanet_bacbone_heads = RetinaNetBackboneAndHeads(retinanet_model)

使用TensorRT Plugin構建RetinaNet后處理網絡。如果您已經創建過TensorRT Engine，可以跳過此步驟。

創建TensorRT Engine。

為了使TensorRT Plugin生效，需要實現以下功能：

通過ctypes.cdll.LoadLibrary動態加載編譯好的plugin.so。
build_retinanet_decode通過tensorrt Python API構建后處理網絡并將其Build成為Engine。

示例代碼如下。

import os
import numpy as np
import tensorrt as trt

import ctypes
# 加載TensorRT Plugin動態鏈接庫。
codebase="retinanet-examples"
ctypes.cdll.LoadLibrary(os.path.join(codebase, 'plugin.so'))

TRT_LOGGER = trt.Logger()
trt.init_libnvinfer_plugins(TRT_LOGGER, "")
PLUGIN_CREATORS = trt.get_plugin_registry().plugin_creator_list

# 獲取TensorRT Plugin的函數。
def get_trt_plugin(plugin_name, field_collection):
    plugin = None
    for plugin_creator in PLUGIN_CREATORS:
        if plugin_creator.name != plugin_name:
            continue
        if plugin_name == "RetinaNetDecode":
            plugin = plugin_creator.create_plugin(
                name=plugin_name, field_collection=field_collection
            )
        if plugin_name == "RetinaNetNMS":
            plugin = plugin_creator.create_plugin(
                name=plugin_name, field_collection=field_collection
            )
    assert plugin is not None, "plugin not found"
    return plugin

# 構建TensorRT網絡的函數。
def build_retinanet_decode(example_outputs,
        input_image_shape,
        anchors_list,
        test_score_thresh = 0.05,
        test_nms_thresh = 0.5,
        test_topk_candidates = 1000,
        max_detections_per_image = 100,
    ):
    builder = trt.Builder(TRT_LOGGER)
    EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    network = builder.create_network(EXPLICIT_BATCH)
    config = builder.create_builder_config()
    config.max_workspace_size = 3 ** 20

    cls_heads, box_heads = example_outputs
    profile = builder.create_optimization_profile()
    decode_scores = []
    decode_boxes = []
    decode_class = []

    input_blob_names = []
    input_blob_types = []
    def _add_input(head_tensor, head_name):
        input_blob_names.append(head_name)
        input_blob_types.append("Float")
        head_shape = list(head_tensor.shape)[-3:]
        profile.set_shape(
             head_name, [1] + head_shape, [20] + head_shape, [1000] + head_shape)
        return network.add_input(
            name=head_name, dtype=trt.float32, shape=[-1] + head_shape
        )

    # Build network inputs.
    cls_head_inputs = []
    cls_head_strides = [input_image_shape[-1] // cls_head.shape[-1] for cls_head in cls_heads]
    for idx, cls_head in enumerate(cls_heads):
        cls_head_name = "cls_head" + str(idx)
        cls_head_inputs.append(_add_input(cls_head, cls_head_name))

    box_head_inputs = []
    for idx, box_head in enumerate(box_heads):
        box_head_name = "box_head" + str(idx)
        box_head_inputs.append(_add_input(box_head, box_head_name))

    output_blob_names = []
    output_blob_types = []
    # Build decode network.
    for idx, anchors in enumerate(anchors_list):
        field_coll = trt.PluginFieldCollection([
            trt.PluginField("topk_candidates", np.array([test_topk_candidates], dtype=np.int32), trt.PluginFieldType.INT32),
            trt.PluginField("score_thresh", np.array([test_score_thresh], dtype=np.float32), trt.PluginFieldType.FLOAT32),
            trt.PluginField("stride", np.array([cls_head_strides[idx]], dtype=np.int32), trt.PluginFieldType.INT32),
            trt.PluginField("num_anchors", np.array([anchors.numel()], dtype=np.int32), trt.PluginFieldType.INT32),
            trt.PluginField("anchors", anchors.contiguous().cpu().numpy().astype(np.float32), trt.PluginFieldType.FLOAT32),]
        )
        decode_layer = network.add_plugin_v2(
            inputs=[cls_head_inputs[idx], box_head_inputs[idx]],
            plugin=get_trt_plugin("RetinaNetDecode", field_coll),
        )
        decode_scores.append(decode_layer.get_output(0))
        decode_boxes.append(decode_layer.get_output(1))
        decode_class.append(decode_layer.get_output(2))

    # Build NMS network.
    scores_layer = network.add_concatenation(decode_scores)
    boxes_layer = network.add_concatenation(decode_boxes)
    class_layer = network.add_concatenation(decode_class)
    field_coll = trt.PluginFieldCollection([
            trt.PluginField("nms_thresh", np.array([test_nms_thresh], dtype=np.float32), trt.PluginFieldType.FLOAT32),
            trt.PluginField("max_detections_per_image", np.array([max_detections_per_image], dtype=np.int32), trt.PluginFieldType.INT32),]
        )
    nms_layer = network.add_plugin_v2(
       inputs=[scores_layer.get_output(0), boxes_layer.get_output(0), class_layer.get_output(0)],
       plugin=get_trt_plugin("RetinaNetNMS", field_coll),
    )
    nms_layer.get_output(0).name = "scores"
    nms_layer.get_output(1).name = "boxes"
    nms_layer.get_output(2).name = "classes"
    nms_outputs = [network.mark_output(nms_layer.get_output(k)) for k in range(3)]
    config.add_optimization_profile(profile)
    cuda_engine = builder.build_engine(network, config)
    assert cuda_engine is not None
    return cuda_engine

根據RetinaNetBackboneAndHeads的實際結果輸出個數、輸出類型及輸出Shape創建的cuda_engine。

import numpy as np
from detectron2.data.detection_utils import read_image

!wget http://images.cocodataset.org/val2017/000000439715.jpg -q -O input.jpg
img = read_image('./input.jpg')
img = torch.from_numpy(np.ascontiguousarray(img.transpose(2, 0, 1)))

example_inputs = retinanet_bacbone_heads.preprocess(img)
example_outputs = retinanet_bacbone_heads(example_inputs)

cell_anchors = [c.contiguous() for c in retinanet_model.anchor_generator.cell_anchors]
cuda_engine = build_retinanet_decode(
            example_outputs, example_inputs.shape, cell_anchors)

通過Blade擴展支持混合使用PyTorch和TensorRT Engine的模型。

以下代碼中通過RetinaNetWrapper、RetinaNetBackboneAndHeads及RetinaNetPostProcess重新組合了Backbone、Heads及TensorRT Plugin后處理部分。

import blade.torch

# 使用Blade TensorRT擴展支持的后處理部分。
class RetinaNetPostProcess(torch.nn.Module):
    def __init__(self, cuda_engine):
        super().__init__()
        blob_names = [cuda_engine.get_binding_name(idx) for idx in range(cuda_engine.num_bindings)]
        input_blob_names = blob_names[:-3]
        input_blob_types = ["Float"] * len(input_blob_names)
        output_blob_names = blob_names[-3:]
        output_blob_types = ["Float"] * len(output_blob_names)

        self.trt_ext_plugin = torch.classes.torch_addons.TRTEngineExtension(
            bytes(cuda_engine.serialize()),
            (input_blob_names, output_blob_names, input_blob_types, output_blob_types),
        )

    def forward(self, inputs: List[Tensor]):
        return self.trt_ext_plugin.forward(inputs)

# 混合使用PyTorch和TensorRT Engine。
class RetinaNetWrapper(torch.nn.Module):

    def __init__(self, model, trt_postproc):
        super().__init__()
        self.backbone_and_heads = model
        self.trt_postproc = torch.jit.script(trt_postproc)

    def forward(self, images):
        cls_heads, box_heads = self.backbone_and_heads(images)
        return self.trt_postproc(cls_heads + box_heads)

trt_postproc = RetinaNetPostProcess(cuda_engine)
retinanet_mix_trt = RetinaNetWrapper(retinanet_bacbone_heads, trt_postproc)

# 可以導出和保存為TorchScript。
retinanet_script = torch.jit.trace(retinanet_mix_trt, (example_inputs, ), check_trace=False)
torch.jit.save(retinanet_script, 'retinanet_script.pt')
torch.save(example_inputs, 'example_inputs.pth')
outputs = retinanet_script(example_inputs)

新組裝的torch.nn.Module擁有以下特點：

使用了Blade的TensorRT擴展支持torch.classes.torch_addons.TRTEngineExtension接口。
支持TorchScript模型導出，上述代碼中使用了torch.jit.trace進行導出。
支持TorchScript格式保存模型。

步驟二：調用Blade優化模型

調用Blade優化接口。

調用blade.optimize接口對模型進行優化，代碼示例如下。關于blade.optimize接口詳情，請參見優化PyTorch模型。

import blade
import blade.torch
import ctypes
import torch
import os

codebase="retinanet-examples"
ctypes.cdll.LoadLibrary(os.path.join(codebase, 'plugin.so'))

blade_config = blade.Config()
blade_config.gpu_config.disable_fp16_accuracy_check = True

script_model = torch.jit.load('retinanet_script.pt')
example_inputs = torch.load('example_inputs.pth')
test_data = [(example_inputs,)] # PyTorch的輸入數據是List of Tuple。
with blade_config:
    optimized_model, opt_spec, report = blade.optimize(
        script_model,  # 上一步導出的TorchScript模型。
        'o1',  # 開啟Blade O1級別的優化。
        device_type='gpu',  # 目標設備為GPU。
        test_data=test_data,  # 給定一組測試數據，用于輔助優化及測試。
    )

打印優化報告并保存模型。

Blade優化后的模型仍然是一個TorchScript模型。完成優化后，您可以通過如下代碼打印優化報告并保存優化模型。

# 打印優化結果報表。
print("Report: {}".format(report))
# 保存優化后的模型。
torch.jit.save(optimized_model, 'optimized.pt')

打印的優化報告如下所示，關于優化報告中的字段詳情請參見優化報告。

Report: {
  "software_context": [
    {
      "software": "pytorch",
      "version": "1.8.1+cu102"
    },
    {
      "software": "cuda",
      "version": "10.2.0"
    }
  ],
  "hardware_context": {
    "device_type": "gpu",
    "microarchitecture": "T4"
  },
  "user_config": "",
  "diagnosis": {
    "model": "unnamed.pt",
    "test_data_source": "user provided",
    "shape_variation": "undefined",
    "message": "Unable to deduce model inputs information (data type, shape, value range, etc.)",
    "test_data_info": "0 shape: (1, 3, 480, 640) data type: float32"
  },
  "optimizations": [
    {
      "name": "PtTrtPassFp16",
      "status": "effective",
      "speedup": "4.37",
      "pre_run": "40.59 ms",
      "post_run": "9.28 ms"
    }
  ],
  "overall": {
    "baseline": "40.02 ms",
    "optimized": "9.27 ms",
    "speedup": "4.32"
  },
  "model_info": {
    "input_format": "torch_script"
  },
  "compatibility_list": [
    {
      "device_type": "gpu",
      "microarchitecture": "T4"
    }
  ],
  "model_sdk": {}
}

對優化前后的模型進行性能測試。

性能測試的代碼示例如下所示。

import time

@torch.no_grad()
def benchmark(model, inp):
    for i in range(100):
        model(inp)
    torch.cuda.synchronize()
    start = time.time()
    for i in range(200):
        model(inp)
    torch.cuda.synchronize()
    elapsed_ms = (time.time() - start) * 1000
    print("Latency: {:.2f}".format(elapsed_ms / 200))

# 對優化前的模型測速。
benchmark(script_model, example_inputs)
# 對優化后的模型測速。
benchmark(optimized_model, example_inputs)

本次測試的參考結果值如下。

Latency: 40.71
Latency: 9.35

上述結果表示同樣執行200輪，優化前后的模型平均延時分別是40.71 ms和9.35 ms。

步驟三：加載運行優化后的模型

可選：在試用階段，您可以設置如下的環境變量，防止因為鑒權失敗而程序退出。
```
export BLADE_AUTH_USE_COUNTING=1
```
獲取鑒權。
```
export BLADE_REGION=<region>
export BLADE_TOKEN=<token>
```
您需要根據實際情況替換以下參數：
- <region>：Blade支持的地域，需要加入Blade用戶群獲取該信息，用戶群的二維碼詳情請參見獲取Token。
- <token>：鑒權Token，需要加入Blade用戶群獲取該信息，用戶群的二維碼詳情請參見獲取Token。

加載運行優化后的模型。

Blade優化后的模型仍然是TorchScript，因此您無需切換環境即可加載優化后的結果。

import blade.runtime.torch
import torch

from torch.testing import assert_allclose
import ctypes
import os

codebase="retinanet-examples"
ctypes.cdll.LoadLibrary(os.path.join(codebase, 'plugin.so'))

optimized_model = torch.jit.load('optimized.pt')
example_inputs = torch.load('example_inputs.pth')

with torch.no_grad():
    pred = optimized_model(example_inputs)

日本熟妇hd丰满老熟妇,中文字幕一区二区三区在线不卡 ,亚洲成片在线观看,免费女同在线一区二区

RetinaNet優化案例3：結合Blade和TensorRT Plugin優化模型