本文詳細說明,如何使用 tensorrt python API搭建MLP網絡,實現推理,幫助與我類似的小白更快上手python版本的方法,我將介紹內容為: 簡單介紹、linux如何配置tensorRT、MLP網絡搭建步驟及詳細說明、原始代碼與改編代碼。
同篇關聯C++ API文章為:https://www.cnblogs.com/tangjunjun/p/16127634.html
一.簡單介紹
TensorRT是英偉達針對自家平台做的一個加速包,可以認為 TensorRT 是一個只有前向傳播的深度學習框架,這個框架可以將 Caffe,TensorFlow 的網絡模型解析,然后與 TensorRT 中對應的層進行一一映射,把其他框架的模型統一全部轉換到 TensorRT 中,然后在 TensorRT 中可以針對 NVIDIA 自家 GPU 實施優化策略,並進行部署加速。根據官方文檔,使用TensorRT,在CPU或者GPU模式下其可提供10X乃至100X的加速。
TensorRT主要做了這么兩件事情,來提升模型的運行速度:
- TensorRT支持INT8和FP16的計算。深度學習網絡在訓練時,通常使用 32 位或 16 位數據。TensorRT則在網絡的推理時選用不這么高的精度,達到加速推斷的目的
- TensorRT對於網絡結構進行了重構,把一些能夠合並的運算合並在了一起,針對GPU的特性做了優化
二.Linux系統配置tensorrt環境/pycharm如何使用配置環境
簡單介紹visual studio的環境配置,前提條件你已經將tensorrt庫相應放在cuda文件夾下了:
①官網下載tensorrt對應cuda版本,下載地址:https://developer.nvidia.com/nvidia-tensorrt-8x-download;
②添加環境路徑:
執行: vim ~/.bashrc
添加: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/soft/TensorRT-8.2.1.8/lib
執行:source ~/.bashrc
③虛擬環境安裝對應環境
pycuda安裝:pip install pycuda
tensorrt安裝:
cd /home/soft/TensorRT-8.2.1.8
/python
pip install tensorrt-8.2.1.8**省略****.whl #下載tensorrt包自帶
④若pycharm出現找不到libnvonnxparser.so.8庫等報錯,需
選擇Run——>Edit Configurations——>Environment variables——>輸入:LD_LIBRARY_PATH=/home/soft/TensorRT-8.2.1.8/lib
三.tensorrt python API 以搭建MLP網絡結構,詳細說明步驟:
需引用頭文件如下:
import os import numpy as np import struct import tensorrt as trt # required for the inference using TRT engine import pycuda.driver as cuda
建引擎engine,並將其保存為文件形式
①構建glogging,為創建builder做准備,簡單創建代碼如下:
# A logger provided by NVIDIA-TRT gLogger = trt.Logger(trt.Logger.INFO)
②創建builder,使用gLogger
# Create Builder with logger provided by TRT builder = trt.Builder(gLogger)
③構建網絡
# build an empty network using builder network = builder.create_network()
網絡構建完畢后,需為網絡添加結構,可以使用onnx/caffe/uft解析添加網絡,但本篇博客使用C++ API 構建網絡,如下:
# add an input to network using the *input-name data = network.add_input('data', dt, (1, 1,INPUT_SIZE )) # add the layer with output-size (number of outputs) linear = network.add_fully_connected(input=data, num_outputs=OUTPUT_SIZE, kernel=weight_map['linear.weight'], bias=weight_map['linear.bias']) # set the name for output layer linear.get_output(0).name = OUTPUT_BLOB_NAME # mark this layer as final output layer network.mark_output(linear.get_output(0))
其中weightMap為權重保存變量,類似一個字典
④設置網絡參數
調用TensorRT的builder來創建優化的runtime。 builder的其中一個功能是搜索其CUDA內核目錄以獲得最快的實現,因此用來構建優化的engine的GPU設備和實際跑的GPU設備一定要是相同的才行,這也是為什么無法適應其它環境原因。
builder具有許多屬性,可以通過設置這些屬性來控制網絡運行的精度,以及自動調整參數。還可以查詢builder以找出硬件本身支持的降低的精度類型。
有個特別重要的屬性,最大batch size :大batch size指定TensorRT將要優化的batch大小。在運行時,只能選擇比這個值小的batch。
# Create configurations from Engine Builder config = builder.create_builder_config() # set the batch size of current builder builder.max_batch_size = max_batch_size
⑤創建引擎engine
# create the engine with model and hardware configs engine = builder.build_engine(network, config)
⑥引擎engine序列化並保存
# Write the engine into binary file print("[INFO]: Writing engine into binary...") with open(ENGINE_PATH, "wb") as f: # write serialized model in file f.write(engine.serialize())
其中file_engine為保存engine的地址,如:"/home/mlp/mlp.engine"
⑦釋放內存
# free the memory del engine del builder # free captured memory del network del weight_map
以上為tensorrt C++ API 將網絡編譯成engine,並保存的全部流程,若后續更改不同網絡,主要更改步驟③構建網絡模塊。
重載引擎文件,並實現推理:
①讀取引擎engine,並反序列化
# create a runtime (required for deserialization of model) with NVIDIA's logger runtime = trt.Runtime(gLogger) assert runtime # read and deserialize engine for inference with open(ENGINE_PATH, "rb") as f: engine = runtime.deserialize_cuda_engine(f.read())
其中ENGINE_PATH為:ENGINE_PATH = "C:\\Users\\Administrator\\Desktop\\code\\tensorrt-code\\mlp\\mlp.engine"
其中gLogger來源創建引擎構建的glogging
② ④設置輸入輸出
# create input as array data = np.array([input_data], dtype=np.float32) # capture free memory for input in GPU host_in = cuda.pagelocked_empty((INPUT_SIZE), dtype=np.float32) # copy input-array from CPU to Flatten array in GPU np.copyto(host_in, data.ravel()) # capture free memory for output in GPU host_out = cuda.pagelocked_empty(OUTPUT_SIZE, dtype=np.float32)
③調用推理
# do inference using required parameters do_inference(context, host_in, host_out)
推理函數為:
def do_inference(inf_context, inf_host_in, inf_host_out): """ Perform inference using the CUDA context :param inf_context: context created by engine :param inf_host_in: input from the host :param inf_host_out: output to save on host :return: """ inference_engine = inf_context.engine # Input and output bindings are required for inference assert inference_engine.num_bindings == 2 # allocate memory in GPU using CUDA bindings device_in = cuda.mem_alloc(inf_host_in.nbytes) device_out = cuda.mem_alloc(inf_host_out.nbytes) # create bindings for input and output bindings = [int(device_in), int(device_out)] # create CUDA stream for simultaneous CUDA operations stream = cuda.Stream() # copy input from host (CPU) to device (GPU) in stream cuda.memcpy_htod_async(device_in, inf_host_in, stream) # execute inference using context provided by engine inf_context.execute_async(bindings=bindings, stream_handle=stream.handle) ################# key step ########### # copy output back from device (GPU) to host (CPU) cuda.memcpy_dtoh_async(inf_host_out, device_out, stream) # synchronize the stream to prevent issues # (block CUDA and wait for CUDA operations to be completed) stream.synchronize()
以上為tensorrt實現推理過程
執行結果如下:
四.完整代碼

import argparse import os import numpy as np import struct # required for the model creation import tensorrt as trt # required for the inference using TRT engine import pycuda.autoinit import pycuda.driver as cuda # Sizes of input and output for TensorRT model INPUT_SIZE = 1 OUTPUT_SIZE = 1 # path of .wts (weight file) and .engine (model file) # input and output names are must for the TRT model # INPUT_BLOB_NAME = 'data' OUTPUT_BLOB_NAME = 'out' # A logger provided by NVIDIA-TRT gLogger = trt.Logger(trt.Logger.INFO) ################################ # DEPLOYMENT RELATED ########### ################################ def load_weights(file_path): """ Parse the .wts file and store weights in dict format :param file_path: :return weight_map: dictionary containing weights and their values """ print(f"[INFO]: Loading weights: {file_path}") assert os.path.exists(file_path), '[ERROR]: Unable to load weight file.' weight_map = {} with open(file_path, "r") as f: lines = [line.strip() for line in f] # count for total # of weights count = int(lines[0]) assert count == len(lines) - 1 # Loop through counts and get the exact num of values against weights for i in range(1, count + 1): splits = lines[i].split(" ") name = splits[0] cur_count = int(splits[1]) # len of splits must be greater than current weight counts assert cur_count + 2 == len(splits) # loop through all weights and unpack from the hexadecimal values values = [] for j in range(2, len(splits)): # hex string to bytes to float values.append(struct.unpack(">f", bytes.fromhex(splits[j]))) # store in format of { 'weight.name': [weights_val0, weight_val1, ..] } weight_map[name] = np.array(values, dtype=np.float32) return weight_map def create_mlp_engine(max_batch_size, builder, config, dt, WEIGHT_PATH): """ Create Multi-Layer Perceptron using the TRT Builder and Configurations :param max_batch_size: batch size for built TRT model :param builder: to build engine and networks :param config: configuration related to Hardware :param dt: datatype for model layers :return engine: TRT model """ print("[INFO]: Creating MLP using TensorRT...") # load weight maps from the file weight_map = load_weights(WEIGHT_PATH) # build an empty network using builder network = builder.create_network() # add an input to network using the *input-name data = network.add_input('data', dt, (1, 1,INPUT_SIZE )) # add the layer with output-size (number of outputs) linear = network.add_fully_connected(input=data, num_outputs=OUTPUT_SIZE, kernel=weight_map['linear.weight'], bias=weight_map['linear.bias']) # set the name for output layer linear.get_output(0).name = OUTPUT_BLOB_NAME # mark this layer as final output layer network.mark_output(linear.get_output(0)) # set the batch size of current builder builder.max_batch_size = max_batch_size # create the engine with model and hardware configs engine = builder.build_engine(network, config) # free captured memory del network del weight_map # return engine return engine def api2model(max_batch_size,dt=trt.float32, WEIGHT_PATH=None,ENGINE_PATH=None): """ Create engine using TensorRT APIs :param max_batch_size: for the deployed model configs :return: """ # Create Builder with logger provided by TRT builder = trt.Builder(gLogger) # Create configurations from Engine Builder config = builder.create_builder_config() # Create MLP Engine engine = create_mlp_engine(max_batch_size, builder, config, dt, WEIGHT_PATH) assert engine # Write the engine into binary file print("[INFO]: Writing engine into binary...") with open(ENGINE_PATH, "wb") as f: # write serialized model in file f.write(engine.serialize()) # free the memory del engine del builder ################################ # INFERENCE RELATED ############ ################################ def inite_engine(ENGINE_PATH): # create a runtime (required for deserialization of model) with NVIDIA's logger runtime = trt.Runtime(gLogger) assert runtime # read and deserialize engine for inference with open(ENGINE_PATH, "rb") as f: engine = runtime.deserialize_cuda_engine(f.read()) assert engine return engine def do_inference(inf_context, inf_host_in, inf_host_out): """ Perform inference using the CUDA context :param inf_context: context created by engine :param inf_host_in: input from the host :param inf_host_out: output to save on host :return: """ inference_engine = inf_context.engine # Input and output bindings are required for inference assert inference_engine.num_bindings == 2 # allocate memory in GPU using CUDA bindings device_in = cuda.mem_alloc(inf_host_in.nbytes) device_out = cuda.mem_alloc(inf_host_out.nbytes) # create bindings for input and output bindings = [int(device_in), int(device_out)] # create CUDA stream for simultaneous CUDA operations stream = cuda.Stream() # copy input from host (CPU) to device (GPU) in stream cuda.memcpy_htod_async(device_in, inf_host_in, stream) # execute inference using context provided by engine inf_context.execute_async(bindings=bindings, stream_handle=stream.handle) ################# key step ########### # copy output back from device (GPU) to host (CPU) cuda.memcpy_dtoh_async(inf_host_out, device_out, stream) # synchronize the stream to prevent issues # (block CUDA and wait for CUDA operations to be completed) stream.synchronize() def perform_inference(input_data,ENGINE_PATH): """ Get inference using the pre-trained model :param input_val: a number as an input :return: """ engine=inite_engine(ENGINE_PATH) # create execution context -- required for inference executions context = engine.create_execution_context() assert context # create input as array data = np.array([input_data], dtype=np.float32) # capture free memory for input in GPU host_in = cuda.pagelocked_empty((INPUT_SIZE), dtype=np.float32) # copy input-array from CPU to Flatten array in GPU np.copyto(host_in, data.ravel()) # capture free memory for output in GPU host_out = cuda.pagelocked_empty(OUTPUT_SIZE, dtype=np.float32) # do inference using required parameters do_inference(context, host_in, host_out) print(f'\n[INFO]: Predictions using pre-trained model..\n\tInput:\t{input_data}\n\tOutput:\t{host_out[0]:.4f}') if __name__ == "__main__": args=2 weight_path = "./mlp.wts" output_engine_path = "./mlp.engine" if args==1: api2model(max_batch_size=1, WEIGHT_PATH=weight_path,ENGINE_PATH=output_engine_path) print("[INFO]: Successfully created TensorRT engine...") print("\n\tRun inference using `python mlp.py -d`\n") else: data=4.0 perform_inference(input_data=data,ENGINE_PATH=output_engine_path)