Tensorrt的運行需要環境中有Opencv的編譯環境,所以首先要opencv的編譯
一.opencv 編譯
1. 安裝依賴項
sudo apt-get install cmake
sudo apt-get install build-essential libgtk2.0-dev libavcodec-dev libavformat-dev libjpeg-dev libswscale-dev libtiff5-dev
sudo apt-get install libgtk2.0-dev
2. 下載自己需要的版本
解壓后放在自己想放的目錄下,在opencv-4.5.0目錄下 建立build 文件夾, 進入 build 文件夾下,打開終端
unzip opencv-4.5.0.zip
cd opencv-4.5.0/
mkdir build
cd build/
3.編譯cmake
3.1.安裝cmake以及依賴庫
sudo apt-get install cmake
$ sudo apt-get install build-essential libgtk2.0-dev libavcodec-dev libavformat-dev libjpeg.dev libtiff4.dev libswscale-dev libjasper1 libjasper-dev
錯誤:E: 無法定位軟件包 libjasper-dev
sudo add-apt-repository "deb http://security.ubuntu.com/ubuntu xenial-security main"
sudo apt update
sudo apt install libjasper1 libjasper-dev
成功的解決了問題,其中libjasper1是libjasper-dev的依賴包
3.2.安裝cmake
$ cmake .. || cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local ..
使用網友的后面加一堆配置就會遇到各種報錯,這種編譯一直用的很順手,沒有報錯
然后就等待安裝完成,最后輸出如下,沒有報錯就說成功了一半
3.3 編譯
$ sudo make 或者 sudo make -j4
這里需要耐心的等待編譯
3.4 安裝
$ sudo make install
對應卸載就是 :
$ sudo make uninstall
測試 opencv 版本的。
opencv_version 或者 pkg-config --modversion opencv
到此opencv源碼編譯成功。
二.TensorRT下載安裝
1.對應自己的Cuda 和Cudnn以及系統的版本
$ nvcc -V #查看cuda的版本
$ cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2 #查看cudnn的版本
$ cat /etc/issue #查看ubuntu系統版本
我的版本信息如下 :
2.下載
TensorRT鏈接
直接點擊下載進入選項
這里選擇一般選擇相對用戶比較多的版本。
注意:這里要注意選擇與自己cuda和操作系統的版本相適應的下載,否則兄弟你會懷疑人生的(千萬不要不符合,也不要比你的環境高也不要低,不然很麻煩,博主自己第一次按的時候沒注意看,導致一直報錯) 。
3.安裝TensorRT
(1)將下載的壓縮文件進行解壓:
$ tar zxvf TensorRT-7.2.2.3.Ubuntu-18.04.x86_64-gnu.cuda-11.1.cudnn8.0.tar.gz
(2)環境配置:
解壓得到TensorRT的文件夾,將里邊的lib絕對路徑添加到環境變量中
vim ~/.bashrc
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/software/TensorRT-7.2.2.3/lib
刷新配置:source ~/.bashrc
(3) 安裝TensorRT
$ cd TensorRT-7.2.2.3/python/
$ pip install tensorrt-7.2.2.3-cp37-none-linux_x86_64.whl
博主親測,這里你的python環境並沒有限制,py35,py36,py37都可 ,一般安裝與自己python版本對應的最穩。
#安裝UFF,支持tensorflow模型轉化
$ cd TensorRT-7.2.2.3/uff/
$ pip install uff-0.6.9-py2.py3-none-any.whl
#安裝graphsurgeon,支持自定義結構
cd TensorRT-7.2.2.3/graphsurgeon
pip install graphsurgeon-0.4.5-py2.py3-none-any.whl
以上,我們就成功的將tensorRT安裝完了,試着執行一下python,然后看能不能導入這些模塊。 如果成功的import tensorrt,那么就算安裝成功咯。
ps:import uff的時候,需要提前install tensorflow模塊。
pip install tensorflow-gpu==2.4.0
(4)安裝PyCUDA
PyCUDA是Python使用NVIDIA CUDA的API,在Python中映射了所有CUDA的API
安裝:
pip3 install pycuda==2021.1
三.tensorrt部署yolov5s(v5.0)
參考地址: https://blog.csdn.net/xingtianyao/article/details/111353568
最終實現的是yolovs 中默認的fp16的engine部署,測試通過yolov5_trt.py就可以看到效果,改造一下就可以視屏測試了
改造代碼列:
"""
An example that uses TensorRT's Python api to make inferences.
"""
import ctypes
import os
import shutil
import random
import sys
import threading
import time
import cv2
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
import torch
import torchvision
import argparse
CONF_THRESH = 0.5
IOU_THRESHOLD = 0.4
def get_img_path_batches(batch_size, img_dir):
ret = []
batch = []
for root, dirs, files in os.walk(img_dir):
for name in files:
if len(batch) == batch_size:
ret.append(batch)
batch = []
batch.append(os.path.join(root, name))
if len(batch) > 0:
ret.append(batch)
return ret
def plot_one_box(x, img, color=None, label=None, line_thickness=None):
"""
description: Plots one bounding box on image img,
this function comes from YoLov5 project.
param:
x: a box likes [x1,y1,x2,y2]
img: a opencv image object
color: color to draw rectangle, such as (0,255,0)
label: str
line_thickness: int
return:
no return
"""
tl = (
line_thickness or round(0.002 * (img.shape[0] + img.shape[1]) / 2) + 1
) # line/font thickness
color = color or [random.randint(0, 255) for _ in range(3)]
c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))
cv2.rectangle(img, c1, c2, color, thickness=tl, lineType=cv2.LINE_AA)
if label:
tf = max(tl - 1, 1) # font thickness
t_size = cv2.getTextSize(label, 0, fontScale=tl / 3, thickness=tf)[0]
c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3
cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA) # filled
cv2.putText(
img,
label,
(c1[0], c1[1] - 2),
0,
tl / 3,
[225, 255, 255],
thickness=tf,
lineType=cv2.LINE_AA,
)
class YoLov5TRT(object):
"""
description: A YOLOv5 class that warps TensorRT ops, preprocess and postprocess ops.
"""
def __init__(self, engine_file_path):
# Create a Context on this device,
self.ctx = cuda.Device(0).make_context()
stream = cuda.Stream()
TRT_LOGGER = trt.Logger(trt.Logger.INFO)
runtime = trt.Runtime(TRT_LOGGER)
# Deserialize the engine from file
with open(engine_file_path, "rb") as f:
engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()
host_inputs = []
cuda_inputs = []
host_outputs = []
cuda_outputs = []
bindings = []
for binding in engine:
print('bingding:', binding, engine.get_binding_shape(binding))
size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
cuda_mem = cuda.mem_alloc(host_mem.nbytes)
# Append the device buffer to device bindings.
bindings.append(int(cuda_mem))
# Append to the appropriate list.
if engine.binding_is_input(binding):
self.input_w = engine.get_binding_shape(binding)[-1]
self.input_h = engine.get_binding_shape(binding)[-2]
host_inputs.append(host_mem)
cuda_inputs.append(cuda_mem)
else:
host_outputs.append(host_mem)
cuda_outputs.append(cuda_mem)
# Store
self.stream = stream
self.context = context
self.engine = engine
self.host_inputs = host_inputs
self.cuda_inputs = cuda_inputs
self.host_outputs = host_outputs
self.cuda_outputs = cuda_outputs
self.bindings = bindings
self.batch_size = engine.max_batch_size
def infer(self, input_image_path):
threading.Thread.__init__(self)
# Make self the active context, pushing it on top of the context stack.
self.ctx.push()
self.input_image_path = input_image_path
# Restore
stream = self.stream
context = self.context
engine = self.engine
host_inputs = self.host_inputs
cuda_inputs = self.cuda_inputs
host_outputs = self.host_outputs
cuda_outputs = self.cuda_outputs
bindings = self.bindings
# Do image preprocess
batch_image_raw = []
batch_origin_h = []
batch_origin_w = []
batch_input_image = np.empty(shape=[self.batch_size, 3, self.input_h, self.input_w])
input_image, image_raw, origin_h, origin_w = self.preprocess_image(input_image_path
)
batch_origin_h.append(origin_h)
batch_origin_w.append(origin_w)
np.copyto(batch_input_image, input_image)
batch_input_image = np.ascontiguousarray(batch_input_image)
# Copy input image to host buffer
np.copyto(host_inputs[0], batch_input_image.ravel())
start = time.time()
# Transfer input data to the GPU.
cuda.memcpy_htod_async(cuda_inputs[0], host_inputs[0], stream)
# Run inference.
context.execute_async(batch_size=self.batch_size, bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
cuda.memcpy_dtoh_async(host_outputs[0], cuda_outputs[0], stream)
# Synchronize the stream
stream.synchronize()
end = time.time()
# Remove any context from the top of the context stack, deactivating it.
self.ctx.pop()
# Here we use the first row of output in that batch_size = 1
output = host_outputs[0]
# Do postprocess
result_boxes, result_scores, result_classid = self.post_process(
output, origin_h, origin_w)
# Draw rectangles and labels on the original image
for j in range(len(result_boxes)):
box = result_boxes[j]
plot_one_box(
box,
image_raw,
label="{}:{:.2f}".format(
categories[int(result_classid[j])], result_scores[j]
),
)
return image_raw, end - start
def destroy(self):
# Remove any context from the top of the context stack, deactivating it.
self.ctx.pop()
def get_raw_image(self, image_path_batch):
"""
description: Read an image from image path
"""
for img_path in image_path_batch:
yield cv2.imread(img_path)
def get_raw_image_zeros(self, image_path_batch=None):
"""
description: Ready data for warmup
"""
for _ in range(self.batch_size):
yield np.zeros([self.input_h, self.input_w, 3], dtype=np.uint8)
def preprocess_image(self, input_image_path):
"""
description: Convert BGR image to RGB,
resize and pad it to target size, normalize to [0,1],
transform to NCHW format.
param:
input_image_path: str, image path
return:
image: the processed image
image_raw: the original image
h: original height
w: original width
"""
image_raw = input_image_path
h, w, c = image_raw.shape
image = cv2.cvtColor(image_raw, cv2.COLOR_BGR2RGB)
# Calculate widht and height and paddings
r_w = self.input_w / w
r_h = self.input_h / h
if r_h > r_w:
tw = self.input_w
th = int(r_w * h)
tx1 = tx2 = 0
ty1 = int((self.input_h - th) / 2)
ty2 = self.input_h - th - ty1
else:
tw = int(r_h * w)
th = self.input_h
tx1 = int((self.input_w - tw) / 2)
tx2 = self.input_w - tw - tx1
ty1 = ty2 = 0
# Resize the image with long side while maintaining ratio
image = cv2.resize(image, (tw, th))
# Pad the short side with (128,128,128)
image = cv2.copyMakeBorder(
image, ty1, ty2, tx1, tx2, cv2.BORDER_CONSTANT, (128, 128, 128)
)
image = image.astype(np.float32)
# Normalize to [0,1]
image /= 255.0
# HWC to CHW format:
image = np.transpose(image, [2, 0, 1])
# CHW to NCHW format
image = np.expand_dims(image, axis=0)
# Convert the image to row-major order, also known as "C order":
image = np.ascontiguousarray(image)
return image, image_raw, h, w
def xywh2xyxy(self, origin_h, origin_w, x):
"""
description: Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right
param:
origin_h: height of original image
origin_w: width of original image
x: A boxes tensor, each row is a box [center_x, center_y, w, h]
return:
y: A boxes tensor, each row is a box [x1, y1, x2, y2]
"""
y = torch.zeros_like(x) if isinstance(x, torch.Tensor) else np.zeros_like(x)
r_w = self.input_w / origin_w
r_h = self.input_h / origin_h
if r_h > r_w:
y[:, 0] = x[:, 0] - x[:, 2] / 2
y[:, 2] = x[:, 0] + x[:, 2] / 2
y[:, 1] = x[:, 1] - x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2
y[:, 3] = x[:, 1] + x[:, 3] / 2 - (self.input_h - r_w * origin_h) / 2
y /= r_w
else:
y[:, 0] = x[:, 0] - x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2
y[:, 2] = x[:, 0] + x[:, 2] / 2 - (self.input_w - r_h * origin_w) / 2
y[:, 1] = x[:, 1] - x[:, 3] / 2
y[:, 3] = x[:, 1] + x[:, 3] / 2
y /= r_h
return y
def post_process(self, output, origin_h, origin_w):
"""
description: postprocess the prediction
param:
output: A tensor likes [num_boxes,cx,cy,w,h,conf,cls_id, cx,cy,w,h,conf,cls_id, ...]
origin_h: height of original image
origin_w: width of original image
return:
result_boxes: finally boxes, a boxes tensor, each row is a box [x1, y1, x2, y2]
result_scores: finally scores, a tensor, each element is the score correspoing to box
result_classid: finally classid, a tensor, each element is the classid correspoing to box
"""
# Get the num of boxes detected
num = int(output[0])
# Reshape to a two dimentional ndarray
pred = np.reshape(output[1:], (-1, 6))[:num, :]
# to a torch Tensor
pred = torch.Tensor(pred).cuda()
# Get the boxes
boxes = pred[:, :4]
# Get the scores
scores = pred[:, 4]
# Get the classid
classid = pred[:, 5]
# Choose those boxes that score > CONF_THRESH
si = scores > CONF_THRESH
boxes = boxes[si, :]
scores = scores[si]
classid = classid[si]
# Trandform bbox from [center_x, center_y, w, h] to [x1, y1, x2, y2]
boxes = self.xywh2xyxy(origin_h, origin_w, boxes)
# Do nms
indices = torchvision.ops.nms(boxes, scores, iou_threshold=IOU_THRESHOLD).cpu()
result_boxes = boxes[indices, :].cpu()
result_scores = scores[indices].cpu()
result_classid = classid[indices].cpu()
return result_boxes, result_scores, result_classid
class inferThread(threading.Thread):
def __init__(self, yolov5_wrapper):
threading.Thread.__init__(self)
self.yolov5_wrapper = yolov5_wrapper
def infer(self , frame):
batch_image_raw, use_time = self.yolov5_wrapper.infer(frame)
# for i, img_path in enumerate(self.image_path_batch):
# parent, filename = os.path.split(img_path)
# save_name = os.path.join('output', filename)
# # Save image
# cv2.imwrite(save_name, batch_image_raw[i])
# print('input->{}, time->{:.2f}ms, saving into output/'.format(self.image_path_batch, use_time * 1000))
return batch_image_raw,use_time*1000
class warmUpThread(threading.Thread):
def __init__(self, yolov5_wrapper):
threading.Thread.__init__(self)
self.yolov5_wrapper = yolov5_wrapper
def run(self):
batch_image_raw, use_time = self.yolov5_wrapper.infer(self.yolov5_wrapper.get_raw_image_zeros())
print('warm_up->{}, time->{:.2f}ms'.format(batch_image_raw[0].shape, use_time * 1000))
if __name__ == "__main__":
# load custom plugins
PLUGIN_LIBRARY = "/home/module/TR/yolov5-5.0/yolov5/build/libmyplugins.so"
engine_file_path = "/home/module/TR/yolov5-5.0/yolov5/build/yolov5s.engine"
if len(sys.argv) > 1:
engine_file_path = sys.argv[1]
if len(sys.argv) > 2:
PLUGIN_LIBRARY = sys.argv[2]
ctypes.CDLL(PLUGIN_LIBRARY)
# load coco labels
categories = ["bus", "truck", "car", ]
# categories = ["person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light",
# "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow",
# "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee",
# "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove", "skateboard", "surfboard",
# "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple",
# "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair", "couch",
# "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote", "keyboard", "cell phone",
# "microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase", "scissors", "teddy bear",
# "hair drier", "toothbrush"]
# a YoLov5TRT instance
yolov5_wrapper = YoLov5TRT(engine_file_path)
url="/opt/nvidia/deepstream/deepstream-5.1/samples/streams/test6.mp4"
cap = cv2.VideoCapture(url)
try:
thread1 = inferThread(yolov5_wrapper)
thread1.start()
thread1.join()
while 1:
st=time.time()
_,frame = cap.read()
img,t=thread1.infer(frame)
cv2.imshow("result", img)
cv2.waitKey(1)
print("time->{:.2f}ms",t)
if cv2.waitKey(1) & 0XFF == ord('q'): # 1 millisecond
break
if time.time()-st>0:
print("====fps",1/(time.time()-st))
finally:
# destroy the instance
cap.release() #釋放資源
cv2.destroyAllWindows() #destroyWindow()或destroyAllWindows()來關閉窗口並取消分配任何相關的內存使用。
yolov5_wrapper.destroy() #釋放堆棧資源
四.TensorRT量化INT8部署
因為很多小伙伴都怎樣再提升幀率和識別速度等,手段就是通過INT8的模型去部署,但是網上關於量化INT8的模型方法千奇白怪,搞得不知如何下手,這里就就記錄我趟坑之后的一個最簡單的實踐方式,雖然最后的檢測效果不是很理想,可能是我自己的校准數據集比較少,但是流程是沒有問題的,如有開發過程中的問題,可以郵箱找我,我看到就會回復你。
注意:在這之前請一定要保證你的上一步的Tensorrt.wts轉.engine是成功的,這樣就能保證你前面環境的正常,那么后面的量化INT8就會很簡單。
1.打開yolov5.cpp
這里默認是USE_FP16,你要做的就是手動把USE_FP16替換成USE_INT8
2.指定INT校准圖片
第一步:創建build目錄然后在build目錄中創建coco_calib文件夾,具體名字可參考你的yolov5.cpp的第109行
第二步:把你的數據集圖片拷貝到coco_calib目錄下,這些圖片就是校准需要用到的
第三步:就是編譯效准的過程了
cmake ..
make
./yolov5 -s yolov5s.wts yolov5s.engine s
最后就會生成一個int8calib.table和yolov5s.engine文件,在yolov5_trt.py中替換模型后即可測試,到此本篇完成。
如果小伙伴們發現我哪里寫的不好,或者有錯誤的地方請幫忙指出來,我再改進