tensorRT優化yolact++

本文轉載自查看原文 2022-03-03 17:14 837 模型優化

1.簡介

tensorRT簡介

NVIDIA TensorRT 是一個用於深度學習推理的 SDK 。 TensorRT 提供了 API 和解析器，可以從所有主要的深度學習框架中導入經過訓練的模型。然后，它生成可在數據中心以及汽車和嵌入式環境中部署的優化運行時引擎。

TensorRT 旨在幫助部署這些用例的深度學習。通過對每個主要框架的支持， TensorRT 通過強大的優化、降低精度的使用和高效的內存使用，幫助以低延遲處理大量數據。
官方文檔
- tensorRT官方文檔
- tensorRT官方github地址
步驟
1. 將預訓練圖像分割 PyTorch 模型轉化為 ONNX 。
2. 將 ONNX 模型導入 TensorRT 。
3. 應用優化並生成引擎。
4. 對 GPU 進行推斷。
說明

導入 ONNX 模型包括從磁盤上保存的文件加載它，並將其從本機框架或格式轉換為 TensorRT 網絡。 ONNX 是表示深度學習模型的標准，使它們能夠在框架之間傳輸。

許多框架，如 CAFE2 ， ChanER ， CNTK ， PaddlePaddle ， PyTorch 和 MXNET 支持 ONNX 格式。接下來，根據輸入模型、目標平台和其他指定的配置參數，構建優化的 TensorRT 引擎。最后一步是向 TensorRT 引擎提供輸入數據以執行推理。

2.安裝環境

環境要求（本次優化環境配置）

各版本TensorRT官方推薦配置
- Ubuntu18.04
- GeForce RTX 2080TI
- Driver Version 450.51.06
- NVIDIA-SMI 450.51.06
- CUDA Version: 11.0
- python3.6.8
- Cmake3.13.0及以上
- CUDA toolkit 11.0.221
- CUDNN8.05
- TensorRT8.0-EA（Early Access）
- onnx1.6
- onnx-tensorrt8.0-EA

2.1.創建虛擬環境

創建虛擬環境

1.	# 創建虛擬環境  
2.	ubuntu@ubuntu:/usr/local$ conda create --name conda-liyy python=3.6.8  
3.	# 查看虛擬環境  
4.	ubuntu@ubuntu:/usr/local$ conda env list  
5.	# 激活虛擬環境  
6.	ubuntu@ubuntu:/usr/local$ conda activate conda-liyy

這里為了方便，直接使用阿里雲安裝路徑，pytorch1.71+cu110 + torch-addons==3.14.1+cu110

pip install torch==1.7.1+cu110 torch-addons==3.14.1+cu110 torchvision==0.8.2+cu110 -f https://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/release/repo_ext.html

2.2.安裝CUDA

CUDA簡介

CUDA的全稱是Computer Unified Device Architecture（計算機統一設備架構），它是NVIDIA在2007年推向市場的並行計算架構。CUDA作為NVIDIA圖形處理器的通用計算引擎，提供給我們利用NVIDIA顯卡進行GPU（General Purpose Graphics Process Unit）開發的全套工具。
Cuda11.0下載地址（wget可以使用-P參數指定下載路徑如：wget –P 目錄地址）
- CUDA11.0下載地址
示例圖片

安裝時幾個重要參數如下圖，

（注：在選擇是否重新安裝驅動時，請先確定驅動版本是否符合CUDA-Toolkit版本要求，如不符合要求再重新安裝，否則不要選擇重新安裝驅動）

進入/usr/local目錄，創建軟鏈接，更改cuda版本（確保/usr/local/cuda/bin和/usr/local/cuda/lib分別在$PATH和$LD_LIBRARY_PATH中），使用nvcc –V查看版本是否更改

1.	# 刪除以前的cuda軟鏈接  
2.	sudo rm –rf cuda  
3.	# 將cuda-10.0軟鏈接到cuda  
4.	sudo ln –s cuda-11.0 ./cuda 
5.	# 添加路徑
6.	export PATH=$PATH:/usr/local/cuda/bin
7.	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64

2.3.安裝CUDNN

CUDNN簡介

NVIDIA CUDA 深層神經網絡庫(cuDNN)是一個 gpu 加速的深層神經網絡原語庫。cuDNN為標准例程提供了高度調優的實現，例如前向和后向卷積、池化、規范化和激活層。

全世界的深度學習研究人員和框架開發人員都依賴於高性能 GPU 加速的 cuDNN。它允許他們專注於訓練神經網絡和開發軟件應用程序，而不是花時間在低級別的 GPU 性能調優上。cuDNN加速了廣泛使用的深度學習框架，包括 Caffe2、 Chainer、 Keras、 MATLAB、 MxNet、 PyTorch 和 TensorFlow。
CUDNN下載地址
- CUDNN下載地址
在/usr/local中創建一個cudnn-8.0.5文件夾，下載對應的cudnn版本（下載地址在下方），使用scp命令將文件傳輸到服務器/tmp文件夾下（scp命令在windows shell中可能傳輸失敗，如果使用scp命令請使用git bash），將cudnn文件從/tmp文件夾移動到cudnn-8.0.5文件夾，並進行解壓

注：這里只是示例，請使用與cuda**相匹配的cudnn

進入cudnn-8.0.5/cuda文件夾將cudnn.h文件和cudnn*其他文件移到cuda-11.0文件夾中，完成安裝

1.	# 復制文件  
2.	sudo cp ./include/cudnn.h /usr/local/cuda-11.0/include/   
3.	sudo cp ./lib64/libcudnn* /usr/local/cuda-11.0/lib64/  
4.	# 修改權限  
5.	sudo chmod a+r /usr/local/cuda-11.0/include/cudnn.h   
6.	sudo chmod a+r /usr/local/cuda-11.0/lib64/libcudnn*  
7.	# 查詢版本信息  
8.	cat /usr/local/cuda-11.0/include/cudnn.h | grep CUDNN_MAJOR -A 2

查看CUDNN版本

1.	cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2  
2.	# 如沒有查出可先在/usr/local/文件夾下，找出cudnn版本路徑，然后替換上面cat后路徑查看  
3.	find ./ -name cudnn_version.h 2>&1  
4.	cat /usr/local/cudnn_8.05/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

2.4.安裝tensorRT

tensorRT下載地址（2021年7月份剛剛發布了TensorRT8.0版本，有GA(Generally Available)和EA(Early Access)兩個版本，但是考慮到后面的onnx版本對應的是EA版本，所以這里我們下載EA版本）

tensorRT下載地址
下載如下圖（下載tar格式，在python中使用tensorrt時需要安裝whl包，方便后面操作）

安裝tesorrt

# 使用scp命令將目錄下的tensorrt文件傳到服務器tmp文件夾
K0802389@JYDZ106321 MINGW64 /d/work/source
$ scp ./TensorRT-8.0.0.3.Linux.x86_64-gnu.cuda-11.0.cudnn8.2.tar.gz ubuntu@172.28.1.243:/tmp
# 到/usr/local文件夾
(base) ubuntu@ubuntu:~$ cd /usr/local
# 創建TensorRT-8.0.0.3文件夾
(base) ubuntu@ubuntu:/usr/local$ sudo mkdir TensorRT-8.0.0.3
(base) ubuntu@ubuntu:/usr/local$ cd TensorRT-8.0.0.3/
# 將tensorrt移動到文件夾下
(base) ubuntu@ubuntu:/usr/local/TensorRT-8.0.0.3$ sudo mv /tmp/TensorRT-8.0.0.3.Linux.x86_64-gnu.cuda-11.0.cudnn8.2.tar.gz  ./
# 安裝tensorRT
(base) ubuntu@ubuntu:/usr/local/TensorRT-8.0.0.3$ sudo tar -xzvf TensorRT-8.0.0.3.Linux.x86_64-gnu.cuda-11.0.cudnn8.2.tar.gz
# 編輯~/.bashrc文件，將文件TensorRT-8.0.0.3/TensorRT-8.0.0.3/lib添加到系統路徑，使用source命令使.bashrc文件內容生效
(base) ubuntu@ubuntu:/usr/local/TensorRT-8.0.0.3/TensorRT-8.0.0.3/lib$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/TensorRT-8.0.0.3/TensorRT-8.0.0.3/lib
# 刷新配置文件使配置生效
(base) ubuntu@ubuntu:/usr/local/TensorRT-8.0.0.3/TensorRT-8.0.0.3/lib$ source ~/.bashrc

安裝python版本tensorRT的wheel文件

# cd到tensorrt文件夾下的python文件夾
(base) ubuntu@ubuntu:/usr/local/TensorRT-8.0.0.3/TensorRT-8.0.0.3$ cd python
# 進入自己創建的虛擬環境
(base) ubuntu@ubuntu:/usr/local/TensorRT-8.0.0.3/TensorRT-8.0.0.3/python$ conda activate conda-blade-pytorch1.7.1
# 安裝python-tensorrt，由於環境使用的是python3.6，所以選擇cp36
(conda-blade-pytorch1.7.1) ubuntu@ubuntu:/usr/local/TensorRT-8.0.0.3/TensorRT-8.0.0.3/python$ pip install tensorrt-8.0.0.3-cp36-none-linux_x86_64.whl
# 安裝uff，如果是tensorflow模型的轉化，會用到
(conda-blade-pytorch1.7.1) ubuntu@ubuntu:/usr/local/TensorRT-8.0.0.3/TensorRT-8.0.0.3/uff$ pip install uff-0.6.9-py2.py3-none-any.whl
# 安裝graphsurgeon，主要功能是搜索和操作，支持自定義插件，此包用於tensorflow
# 官方文檔：https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/graphsurgeon/graphsurgeon.html
(conda-blade-pytorch1.7.1) ubuntu@ubuntu:/usr/local/TensorRT-8.0.0.3/TensorRT-8.0.0.3/graphsurgeon$ pip install graphsurgeon-0.4.5-py2.py3-none-any.whl
# 安裝支持onnx的graphsurgeon，可用於pytorch
(conda-blade-pytorch1.7.1) ubuntu@ubuntu:/usr/local/TensorRT-8.0.0.3/TensorRT-8.0.0.3/onnx_graphsurgeon$ pip install onnx_graphsurgeon-0.2.6-py2.py3-none-any.whl

2.5.安裝onnx

注意：onnx和ennx-tensorrt的安裝都需要在自己的虛擬環境中進行，即要使用conda activate命令進入

onnx格式簡介

ONNX全稱是Open Neural Network Exchange，不同深度學習框架可以將模型保存為ONNX格式，從而實現模型在不同框架之間的轉換。
ONNX中，每一個計算流圖都定義為由節點組成的列表，每個節點是一個OP，可能有一個或多個輸入與輸出，並由這些節點構建有向無環圖。
目前，ONNX已支持當前主要的各種深度學習框架，有些框架如PyTorch是官方集成了ONNX，有些需要第三方支持，即便像darknet這種小眾的框架，也可以手動構建ONNX圖來將模型轉為ONNX格式。
onnx是一個獨立的庫，解析器依賴這里的onnx庫，即libonnx_proto.a，下載地址如下：
- onnx庫下載地址

在安裝onnx之前需要依賴protobuf，protobuf是類似json的一個二進制形式的數據傳輸格式，比json傳輸效率更高

sudo apt-get install autoconf automake libtool curl
# 下載protobuf
git clone https://github.com/google/protobuf.git
# 將依賴的其他項目的文件下載下來
cd protobuf
git submodule update --init --recursive
# 配置編譯
./autogen.sh
./configure
make all -j16
# 安裝
sudo make install
# 刷新動態鏈接庫
sudo ldconfig
# 查看當前版本
protoc --version

編譯onnx，由於安裝的onnx-tensorrt8.0EA使用的是1.6.0版本的onnx（其他版本可能導致編譯不通過）

# 下載源碼
git clone https://github.com/onnx/onnx.git
# 拉取依賴
cd onnx
git submodule update --init --recursive
# 編譯
mkdir build && cd build
cmake .. -DONNX_NAMESPACE=onnx2trt_onnx
make -j16
# 安裝
sudo make install

2.6.安裝onnx-tensorrt

插件作用

在將模型轉為onnx格式后，onnx需要這個包將onnx中的參數與tensorrt中的參數對應起來，從而將可以將onnx格式的模型文件轉化為trt格式文件
onnx-tensort下載地址

onnx-tensorrt github地址
安裝onnx-tensorrt之前確保cmake版本在3.13.0以上，如果不滿足，參考以下鏈接
- 升級cmake

安裝onnx-tensorrt

# 下載
git clone https://github.com/onnx/onnx-tensorrt.git
cd onnx
# 切換到8.0-EA分支
git branch -a
git checkout -b 8.0-EA origin/8.0-EA
# 下載依賴
git submodule update --init --recursive
mkdir build && cd build
# 這里的TENSORRT_ROOT是TensorRT的安裝目錄，即TensorRT下載到的目錄
# cmake .. -DProtobuf_INCLUDE_DIR=/data1/xuduo/optimize/yolact_dir_0804/protobuf/src -DTENSORRT_ROOT=/usr/local/TensorRT-8.0.0.3/TensorRT-8.0.0.3
cmake .. -DTENSORRT_ROOT=/usr/local/TensorRT-8.0.0.3/TensorRT-8.0.0.3 && make -j16
// Ensure that you update your LD_LIBRARY_PATH to pick up the location of the newly built library:
sudo make install

3.模型文件轉為onnx格式

yolact官方github地址（請按照官方流程訓練出一個模型后再往下繼續）
- yolact
YOLACT++需要編譯可變卷積，所以我們還不能直接使用，且我們上步所安裝的pytorch和torch-addons的cuda版本是11.0，所以不能直接進行編譯，需要下載特定的DCNv2，執行make.sh完成編譯（如編譯失敗，請檢查pytorch\cuda\cudnn\gcc版本是否滿足要求）下載地址如下：
- 適用於CUDA11.0的DCNv2改寫
由於將yolact模型轉為onnx格式時，onnx不支持DCNv2卷積，所以需要對DCNv2.py進行改寫后再執行make.sh編譯，參考文檔地址：
- 博客地址
- onnx插件github地址

dcn_v2.py（如果是c++，則需要編寫一個onnx插件文件，詳細實現見參考地址）

# dcn_v2.py
class _DCNv2(Function):

    ###########################  修改的部分 ########################
    # 這個函數相當於增加了onnx對DCNv2的支持
    @staticmethod
    def symbolic(g, input, offset,mask, weight, bias, stride, padding, dilation, deformable_groups):
        # dilation: [1, 1],padding: [1, 1],stride: [1, 1],deformable_groups: 1
        # 這里的_i代表是int類型，_s代表string類型
        return g.op("DCNv2", input, offset,mask, weight, bias, name_s="DCNv2",
                    dilation_i = dilation[0],
                    padding_i = padding[0],
                    stride_i = stride[0],
                    deformable_groups_i = deformable_groups)
            
    @staticmethod
    def forward(ctx, input, offset, mask, weight, bias,
                stride, padding, dilation, deformable_groups):
        ctx.stride = _pair(stride)
        ctx.padding = _pair(padding)
        ctx.dilation = _pair(dilation)
        ctx.kernel_size = _pair(weight.shape[2:4])
        ctx.deformable_groups = deformable_groups
        output = _backend.dcn_v2_forward(input, weight, bias,
                                         offset, mask,
                                         ctx.kernel_size[0], ctx.kernel_size[1],
                                         ctx.stride[0], ctx.stride[1],
                                         ctx.padding[0], ctx.padding[1],
                                         ctx.dilation[0], ctx.dilation[1],
                                         ctx.deformable_groups)
        ctx.save_for_backward(input, offset, mask, weight, bias)
        return output

將pytorch的pth模型文件轉為onnx文件

import os
os.environ["CUDA_VISIBLE_DEVICES"] = '5'
import torch
from yolact import Yolact
from pathlib import Path
from utils.augmentations import BaseTransform, FastBaseTransform, Resize
import cv2
device = 'cuda'
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
dir_path = os.path.dirname(__file__)
pth_path = os.path.join(dir_path,"yolact_pth_dir/yolact_plus_resnet50_252_129000.pth")
img_path = os.path.join(dir_path,"img")
yolact_net = Yolact()
yolact_net.load_weights(pth_path)
yolact_net = yolact_net.to(device)
print(pth_path)
print(yolact_net)
batch_size = 1  #批處理大小
input_shape = (3, 550, 550)   #輸入數據,改成自己的輸入shape

# #set the model to inference mode
yolact_net.eval()
export_onnx_file = "/data1/xuduo/optimize/yolact_dir_0804/yolact.onnx"		# 目的ONNX文件名
for p in Path(img_path).glob("*"):
    path = str(p)
    print(path)
    frame = torch.from_numpy(cv2.imread(path)).cuda().float()
    batch = FastBaseTransform()(frame.unsqueeze(0))
    preds = yolact_net(batch)
    torch.onnx.export(yolact_net,
                    batch,
                    export_onnx_file,
                    export_params=True,
                    keep_initializers_as_inputs=True,
                    opset_version=11,
                      # 關閉檢查，不然可能會報錯DCNv2
                    enable_onnx_checker=False 
                    )

可能存在報錯，修改如下三個地方，不要self.detect后處理部分，因為onnx格式不支持python字典形式

# yoalct.py
def forward(self, x):
	...
    else:
	...
    if cfg.use_objectness_score:
    objectness = torch.sigmoid(pred_outs['conf'][:, :, 0])

    pred_outs['conf'][:, :, 1:] = (objectness > 0.10)[..., None] \
    * F.softmax(pred_outs['conf'][:, :, 1:], dim=-1)

    else:
    pred_outs['conf'] = F.softmax(pred_outs['conf'], -1)
    #TODO 修改：直接返回結果，不進行self.detect后處理
    return pred_outs
    # return self.detect(pred_outs, self)
    
def make_priors(self, conv_h, conv_w, device):
    ...
    # TODO 修改：先轉成Tensor后再to(device)操作
    # print(prior_data.device)
    # self.priors = torch.Tensor(prior_data, device=device).view(-1, 4).detach()
    self.priors = torch.Tensor(prior_data).view(-1, 4)
    self.priors = self.priors.to(device).detach()
    
    
# layers/functions/detection.py
def traditional_nms(self, boxes, masks, scores, iou_threshold=0.5, conf_thresh=0.05):
    ...
    preds = torch.cat([boxes[conf_mask], cls_scores[:, None]], dim=1).cpu().detach().numpy()
    keep = cnms(preds, iou_threshold)
    # TODO 修改：先轉為Tensor之后再to(device)
    # keep = torch.Tensor(keep, device=boxes.device).long()
    keep = torch.Tensor(keep).long()
    keep = keep.to(boxes.device)
    idx_lst.append(idx[keep])
    cls_lst.append(keep * 0 + _cls)
    scr_lst.append(cls_scores[keep])

4.生成優化trt模型

4.1.編譯tensorrt開源代碼

下載tensorrt源碼（這里是tensorrt開源部分，開源部分的編譯需要2.4小節中tensort核心庫的支持，核心庫代碼未開源）

tensorrt github地址

# 拉取代碼
git clone https://github.com/liyuyuan6969/TensorRT.git
# 查看代碼版本並切換
cd TensorRT
git branch -a 
git checkout -b rel-8.0 origin/release/8.0
# 拉取子模塊
git submodule update --init --recursive
# 將tensorrt核心庫放在TensorRT目錄下
cp /usr/local/TensorRT-8.0.0.3/TensorRT-8.0.0.3 ./
# 修改CmakeList.txt中的版本內容
# 將CUDA\CUDNN\PROTOBUF的版本分別改為自己本地的版本，查看版本命令見第二章安裝
# TensorRT/CMakeList.txt
#set(DEFAULT_CUDA_VERSION 11.3.1)
#set(DEFAULT_CUDNN_VERSION 8.2)
#set(DEFAULT_PROTOBUF_VERSION 3..0)
#set(DEFAULT_CUB_VERSION 1.8.0)
# 改成如下自己安裝的版本：
set(DEFAULT_CUDA_VERSION 11.0.221)
set(DEFAULT_CUDNN_VERSION 8.05)
set(DEFAULT_PROTOBUF_VERSION 3.16.0)

# 設置環境變量，使得編譯時能夠找到,pwd表示當前文件夾
export TRT_SOURCE=`pwd`
export TRT_RELEASE=`pwd`/TensorRT-8.0.0.3
export TENSORRT_LIBRARY_INFER=$TRT_RELEASE/targets/x86_64-linux-gnu/lib/libnvinfer.so
export TENSORRT_LIBRARY_INFER_PLUGIN=$TRT_RELEASE/targets/x86_64-linux-gnu/lib/libnvinfer_plugin.so
export TENSORRT_LIBRARY_MYELIN=$TRT_RELEASE/targets/x86_64-linux-gnu/lib/libmyelin.so
# 初次編譯，保證環境正常，TRT_RELEASE參數為tensorrt核心庫目錄，DTRT_LIB_DIR為核心庫目錄下的lib文件夾，DTRT_OUT_DIR為輸出目錄
mkdir build && cd build
cmake .. -DTRT_LIB_DIR=$TRT_RELEASE/lib -DTRT_OUT_DIR=`pwd`/out
make -j16

（可略過）make會下載protobuf源碼並編譯，比較耗時。可以提前下載並跳過校驗步驟

可以提前下載protobuf-cpp-3.16.0.tar.gz到/absolute_path/env目錄
ln -sf $TRT_SOURCE/absolute_path/env/protobuf-cpp-3.16.0.tar.gz $TRT_SOURCE/build/third_party.protobuf/src

編輯$TRT_SOURCE/build/third_party.protobuf/src/third_party.protobuf-stamp/download-third_party.protobuf.cmake文件，將if(EXISTS)直到文件的最后一行都注釋掉

# $TRT_SOURCE/build/third_party.protobuf/src/third_party.protobuf-stamp/download-third_party.protobuf.cmake
# 從這里開始到文件末尾全部注釋掉
if(EXISTS "/data1/xuduo/optimize/yolact_dir_0804/TensorRT/build/third_party.protobuf/src/protobuf-cpp-3.16.0.tar.gz")
# 增加return()或將這一段if else直接注釋
	return()
	
  check_file_hash(has_hash hash_is_good)
   if(has_hash)
     if(hash_is_good)
       message(STATUS "File already exists and hash match (skip download):
   file='/data1/xuduo/optimize/yolact_dir_0804/TensorRT/build/third_party.protobuf/src/protobuf-cpp-3.16.0.tar.gz'
   =''"
   )
       return()
     else()
       message(STATUS "File already exists but hash mismatch. Removing...")
       file(REMOVE "/data1/xuduo/optimize/yolact_dir_0804/TensorRT/build/third_party.protobuf/src/protobuf-cpp-3.16.0.tar.gz")
     endif()
   else()
     message(STATUS "File already exists but no hash specified (use URL_HASH):   file='/data1/xuduo/optimize/yolact_dir_0804/TensorRT/build/third_party.protobuf/src/protobuf-cpp-3.16.0.tar.gz'
 Old file will be removed and new file downloaded from URL."
     )
     file(REMOVE "/data1/xuduo/optimize/yolact_dir_0804/TensorRT/build/third_party.protobuf/src/protobuf-cpp-3.16.0.tar.gz")
   endif()
 endif()

4.2.編譯TensorRT插件

在plugin文件夾中新建DCNv2Plugin文件夾，將下載的.cpp和.cu文件放在DCNv2Plugin下，添加TensorRT插件，.cu結尾的是cuda文件，調用cuda的方法寫在.cpp的enqueue函數下進行調用
- 插件下載地址
由於我們的插件是7.0版本的，TensorRT是8.0版本的，所以要進行如下改寫
- 在DCNv2Plugin.hpp和DCNv2Plugin.cpp有override重寫關鍵字的函數前添加noexpect關鍵字，因為TensorRT8.0中源碼都有添加，繼承的類也需要添加
- 修改DCNv2Plugin.cpp中nvinfer1::TensorFormat::kNCHW為nvinfer1::TensorFormat::kLINEAR，具體用法見API文檔
  - tensortrt API文檔
對於頭文件中各種函數的作用見如下文章
- 頭文件中各函數的作用

在Plugin/CMakeLists.txt中添加DCNv2Plugin

# Plugin/CMakeLists.txt
# 增加DCNv2Plugin
set(PLUGIN_LISTS
    batchedNMSPlugin
    batchTilePlugin
    coordConvACPlugin
    cropAndResizePlugin
    detectionLayerPlugin
    efficientNMSPlugin
    flattenConcat
    generateDetectionPlugin
    gridAnchorPlugin
    groupNormalizationPlugin
    instanceNormalizationPlugin
    leakyReluPlugin
    multilevelCropAndResizePlugin
    multilevelProposeROI
    nmsPlugin
    normalizePlugin
    nvFasterRCNN
    priorBoxPlugin
    proposalLayerPlugin
    proposalPlugin
    pyramidROIAlignPlugin
    regionPlugin
    reorgPlugin
    resizeNearestPlugin
    scatterPlugin
    specialSlicePlugin
    splitPlugin
    # 新增
    DCNv2Plugin
    )

修改TensorRT/plugin/InferPlugin.cpp中的內容，初始化DCNv2Plugin插件

//TensorRT/plugin/InferPlugin.cpp
//頭文件中插入DCNv2Plugin.cpp
#include "DCNv2Plugin.hpp"

extern "C"
{
    bool initLibNvInferPlugins(void* logger, const char* libNamespace)
    {
        initializePlugin<nvinfer1::plugin::BatchTilePluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::BatchedNMSPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::BatchedNMSDynamicPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::CoordConvACPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::CropAndResizePluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::CropAndResizeDynamicPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::DetectionLayerPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::EfficientNMSONNXPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::EfficientNMSPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::FlattenConcatPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::GenerateDetectionPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::GridAnchorPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::GridAnchorRectPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::InstanceNormalizationPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::LReluPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::MultilevelCropAndResizePluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::MultilevelProposeROIPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::NMSPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::NMSDynamicPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::NormalizePluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::PriorBoxPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::ProposalLayerPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::ProposalPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::ProposalDynamicPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::PyramidROIAlignPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::RegionPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::ReorgPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::ResizeNearestPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::RPROIPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::ScatterNDPluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::SpecialSlicePluginCreator>(logger, libNamespace);
        initializePlugin<nvinfer1::plugin::SplitPluginCreator>(logger, libNamespace);
        // 插入DCNv2PluginCreator
        initializePlugin<nvinfer1::plugin::DCNv2PluginCreator>(logger, libNamespace);
        return true;
    }
} // extern "C"

進入TensorRT/build重新編譯，生成新的libnvinfer_plugin.so

# 進行build目錄
cd TensorRT/build
# 刪除所有文件，不然可能報錯
rm -rf *
# 重新執行編譯，DTRT_LIB_DIR參數為TRT安裝目錄下的lib，TRT_RELEASE為TensorRT的安裝目錄，因為我們在上面將TensorRT-8.0.0.3移動到了TensorRT源碼目錄下，所以這里可以直接按照4.1章節設置
cmake .. -DTRT_LIB_DIR=$TRT_RELEASE/lib -DTRT_OUT_DIR=`pwd`/out
make -j16

將/usr/local/TensorRT-8.0.0.3/TensorRT-8.0.0.3/lib目錄的libnvinfer_plugin.so文件夾更名為libnvinfer_plugin_bk.so作為備份，將新的libnvinfer_plugin.so移動到此目錄下，建立軟連接

# 進入安裝TensorRT文件目錄
cd /usr/local/TensorRT-8.0.0.3/TensorRT-8.0.0.3/lib
# 將原來的libnvinfer_plugin.so更改文件名加后綴_bk作為備份
sudo mv libnvinfer_plugin.so.8.0.0 libnvinfer_plugin.so.8.0.0_bk
# 查看
ls
# pwd命令查看build下libnvinfer_plugin.so路徑
pwd
# 將build文件夾下的libnvinfer_plugin.so.8.0.1移動到TensorRT目錄
sudo cp /data1/xuduo/optimize/yolact_dir_0804/TensorRT/build/libnvinfer_plugin.so.8.0.1 ./
# 將onnx-tensorrt文件夾下的libnvonnxparser.so.8.0.0移動到TensorRT目錄
sudo cp /data1/xuduo/optimize/yolact_dir_0804/onnx-tensorrt/build/libnvonnxparser.so.8.0.0 ./

# 刪除軟連接
sudo rm libnvinfer_plugin.so
# 創建軟連接到新的libnvinfer_plugin.so.8.0.1文件，
# ln -s [源文件] [目標文件]
sudo ln -s libnvinfer_plugin.so.8.0.1 libnvinfer_plugin.so
# 刪除軟連接
sudo rm libnvinfer_plugin.so.8
sudo ln -s libnvinfer_plugin.so.8.0.1 libnvinfer_plugin.so.8
# 使用ll命令查看軟連接是否改變
ll

4.3.編譯onnx-tensorrt

修改安裝的onnx-tensorrt下的builtin_op_importers.cpp文件，此文件為onnx解析模型算子的文件，添加對DCNv2算子的支持，並且映射到tensorrt插件，使用宏DEFINE_BUILTIN_OP_IMPORTER來添加onnx插件

DEFINE_BUILTIN_OP_IMPORTER(DCNv2)
{
    ASSERT(inputs.at(0).is_tensor(),  ErrorCode::kUNSUPPORTED_NODE); // input
    ASSERT(inputs.at(1).is_tensor(), ErrorCode::kUNSUPPORTED_NODE); // offset
    ASSERT(inputs.at(2).is_tensor(), ErrorCode::kUNSUPPORTED_NODE); // mask
    ASSERT(inputs.at(3).is_weights(), ErrorCode::kUNSUPPORTED_NODE); // weight

    nvinfer1::ITensor* InputTensors = &convertToTensor(inputs.at(0), ctx);
    nvinfer1::ITensor* Offset = &convertToTensor(inputs.at(1), ctx);
    nvinfer1::ITensor* mask = &convertToTensor(inputs.at(2), ctx);
    onnx2trt::ShapedWeights weights = inputs.at(3).weights();
    onnx2trt::ShapedWeights bias = inputs.at(4).weights();

    std::vector<nvinfer1::ITensor*> tensors {&inputs.at(0).tensor(), &inputs.at(1).tensor(),&inputs.at(2).tensor()};
    // 獲取onnx當前節點的attribute，對應的就是python上提到的info_s，kernel_size_i，eps_f等函數
    int out_channel,in_channel,kernel_H,kernel_W,deformable_group,dilation,groups,padding,stride;
    out_channel = weights.shape.d[0];
    in_channel = weights.shape.d[1];
    kernel_H = weights.shape.d[2];
    kernel_W = weights.shape.d[3];

    OnnxAttrs attrs(node, ctx);
    dilation = attrs.get<int>("dilation",1);
    padding = attrs.get<int>("padding",1);
    stride = attrs.get<int>("stride",1);
    groups = attrs.get("groups", 1);
    deformable_group = attrs.get<int>("deformable_groups",1);
    std::string name = "DCNv2";
    std::vector<nvinfer1::PluginField> f;
    f.emplace_back("in_channel", &in_channel, nvinfer1::PluginFieldType::kINT32, 1);
    f.emplace_back("out_channel", &out_channel, nvinfer1::PluginFieldType::kINT32, 1);
    f.emplace_back("kernel_h", &kernel_H, nvinfer1::PluginFieldType::kINT32, 1);
    f.emplace_back("kernel_w", &kernel_W, nvinfer1::PluginFieldType::kINT32, 1);
    f.emplace_back("deformable_group", &deformable_group, nvinfer1::PluginFieldType::kINT32, 1);
    f.emplace_back("groups", &groups, nvinfer1::PluginFieldType::kINT32, 1);
    f.emplace_back("padding", &padding, nvinfer1::PluginFieldType::kINT32, 1);
    f.emplace_back("stride", &stride, nvinfer1::PluginFieldType::kINT32, 1);
    f.emplace_back("dilation", &dilation, nvinfer1::PluginFieldType::kINT32, 1);
    f.emplace_back("weight", weights.values, nvinfer1::PluginFieldType::kFLOAT32, weights.count());
    f.emplace_back("bias", bias.values, nvinfer1::PluginFieldType::kFLOAT32, bias.count());

    nvinfer1::PluginFieldCollection fc;
    fc.nbFields = f.size();
    fc.fields = f.data();
    const auto mPluginRegistry = getPluginRegistry();
    const auto pluginCreator = mPluginRegistry->getPluginCreator(name.c_str(), "1");

    ASSERT(pluginCreator != nullptr, ErrorCode::kINVALID_VALUE);
     nvinfer1::IPluginV2* plugin = pluginCreator->createPlugin(node.name().c_str(),&fc);
    if(plugin == nullptr){
        printf("%s DCNv2 was not found in the plugin registry!", name.c_str());
        ASSERT(false, ErrorCode::kUNSUPPORTED_NODE);
    }
    RETURN_FIRST_OUTPUT(ctx->network()->addPluginV2(tensors.data() ,2, *plugin));

}

重新執行編譯命令，編譯並安裝onnx-tensorrt，這里的安裝會將so文件寫入到/usr/local/lib中

# 進入目錄，這里為安裝時的onnx-tensort目錄
cd onnx-tensorrt/build
# 刪除所有文件
rm -rf *
# 重新編譯安裝
# 這里的path_to_trt是tensorrt的安裝目錄，即tensorrt下載到的目錄，如果報錯three_party/onnx沒有，使用cp復制命令將上面的onnx文件移動到此three_party下再進行編譯
# cmake .. -DProtobuf_INCLUDE_DIR=/data1/xuduo/optimize/yolact_dir_0804/protobuf/src -DTENSORRT_ROOT=/usr/local/TensorRT-8.0.0.3/TensorRT-8.0.0.3
cmake .. -DTENSORRT_ROOT=/usr/local/TensorRT-8.0.0.3/TensorRT-8.0.0.3 && make -j16
// Ensure that you update your LD_LIBRARY_PATH to pick up the location of the newly built library:
sudo make install

使用python代碼，將onnx轉為trt文件（反序列化），生成trt文件（因為這里方便調試，而且log的級別設置為VERBOSE級別，信息會非常詳細）

import os
import torch
import tensorrt as trt
import sys

TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)

def GiB(val):
    return val * 1 << 30

def conver_engine(onnx_file_path, engine_file_path="", max_batch_size=1):
    """Attempts to load a serialized engine if available, otherwise builds a new TensorRT engine and saves it."""
    """Takes an ONNX file and creates a TensorRT engine to run inference with"""
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(EXPLICIT_BATCH) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
        # builder.max_workspace_size = GiB(max_batch_size)
        builder.max_batch_size = max_batch_size
        # Parse model file
        if not os.path.exists(onnx_file_path):
            print('ONNX file {} not found, please run onnx2trt_yolact.py first to generate it.'.format(onnx_file_path))
            exit(0)
        print('Loading ONNX file from path {}...'.format(onnx_file_path))
        with open(onnx_file_path, 'rb') as model:
            print('Beginning ONNX file parsing')
            if not parser.parse(model.read()):
                print ('ERROR: Failed to parse the ONNX file.')
                for error in range(parser.num_errors):
                    print (parser.get_error(error))
                return None
        # The actual yolov3.onnx is generated with batch size 64. Reshape input to batch size 1
        print('Completed parsing of ONNX file')
        print('Building an engine from file {}; this may take a while...'.format(onnx_file_path))
        engine_config = builder.create_builder_config()
        engine = builder.build_engine(network,engine_config)
        print("Completed creating Engine")
        with open(engine_file_path, "wb") as f:
            f.write(engine.serialize())
        print("Completed writing Engine. Well done!")

if __name__ == "__main__":
    onnx_file_path = 'yolact.onnx'
    engine_file_path = "yolact.trt"
    conver_engine(onnx_file_path, engine_file_path)

4.4.使用trt模型進行推理

在完成trt的轉化之后，編寫推理代碼，可以得到推理結果

import os
import torch
import tensorrt as trt
from PIL import Image
import numpy as np
import time
import cv2
import glob
import config as cfg
import torch.nn.functional as F
import sys
sys.path.insert(1, os.path.join(sys.path[0], ".."))
import common
print("sys.path[0]",sys.path[0])

TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
# 初始化插件，將自定義插件初始化到插件庫
trt.init_libnvinfer_plugins(TRT_LOGGER, '')
EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)

def GiB(val):
    return val * 1 << 30

def preprocess_image(path):
    # img [h,w,c]
    image = cv2.imread(image_name)
    img_raw_data = cv2.imencode('.jpg', image)[1].tobytes()
    img_data = cv2.imdecode(np.asarray(bytearray(img_raw_data), dtype=np.uint8),
                       cv2.IMREAD_COLOR)
    frame = torch.from_numpy(img_data).cuda().float()
    # print(frame.size)
    batch = FastBaseTransform()(frame.unsqueeze(0))
    return batch

class FastBaseTransform(torch.nn.Module):
    """
    Transform that does all operations on the GPU for super speed.
    This doesn't suppport a lot of config settings and should only be used for production.
    Maintain this as necessary.
    """

    def __init__(self):
        super().__init__()

        self.mean = torch.Tensor(cfg.MEANS).float().cuda()[None, :, None, None]
        self.std = torch.Tensor(cfg.STD).float().cuda()[None, :, None, None]
        self.transform = cfg.resnet_transform

    def forward(self, img):
        self.mean = self.mean.to(img.device)
        self.std = self.std.to(img.device)

        # img assumed to be a pytorch BGR image with channel order [n, h, w, c]

        img_size = (cfg.max_size, cfg.max_size)
        # 圖片轉為[n,c,h,w]格式
        img = img.permute(0, 3, 1, 2).contiguous()
        img = F.interpolate(img, img_size, mode='bilinear', align_corners=False)

        if self.transform.normalize:
            img = (img - self.mean) / self.std
        elif self.transform.subtract_means:
            img = (img - self.mean)
        elif self.transform.to_float:
            img = img / 255

        if self.transform.channel_order != 'RGB':
            raise NotImplementedError

        img = img[:, (2, 1, 0), :, :].contiguous()

        # Return value is in channel order [n, c, h, w] and RGB
        return img

if __name__ == "__main__":
    onnx_file_path = 'yolact.onnx'
    engine_file_path = "yolact.trt"
    image_name = "/data1/xuduo/optimize/yolact_dir_0804/img/material_WholeBookQuestionData_7H1110B44850N_QuestionBookImage20210713083208_164_586_7H1110B44850N.jpg"
    if not os.path.exists(engine_file_path):
        print("no engine file")
        # conver_engine(onnx_file_path, engine_file_path)
    print(f"Reading engine from file {engine_file_path}")
    f = open(engine_file_path, "rb")
    runtime = trt.Runtime(TRT_LOGGER)
    engine = runtime.deserialize_cuda_engine(f.read())

    # Allocate buffers and create a CUDA stream.
    inputs, outputs, bindings, stream = common.allocate_buffers(engine)
    # Contexts are used to perform inference.
    context = engine.create_execution_context()
    batch = preprocess_image(image_name)
    np.copyto(inputs[0].host, batch.cpu().numpy().ravel())
    trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
    print("預測結果：",trt_outputs)

此項目最終停留在了推理的調試階段，DCNv2部分代碼需要自己進行調試，由於項目需要不再進行跟進，但是還是對優化比較感興趣，今年的目標是重新跑通此代碼，並會自己加一些優化策略

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 TensorRT-優化-原理 TensorFlow 筆記04 - 使用類封裝寫好的 TensorRT 模型，包括 int8 優化要用的 calibrator TensorFlow對象檢測-1.0和2.0：訓練，導出，優化（TensorRT），推斷（Jetson Nano） ubuntu TensorRT TensorRT介紹 TensorRT學習總結 tensorRT 與yolov3_tiny ubuntu安裝TensorRT (O^^O)你是怎么用tensorRT開發的，我是這樣的 TensorRT 加速性能分析