Import From ONNX
ONNX版本更迭比較快,TensorRT 5.1.x支持ONNX Parser支持ONNX IR(中間表示)版本0.0.3,opset版本9。ONNX版本不兼容的問題,見ONNX Model Opset Version Converter。
Create the build, network, and parser
with builder = trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.OnnxParser(network, TRT_LOGGER) as parser: with open(model_path, 'rb') as model: parser.parse(model.read())
Building the engine
對象bulider有很多的屬性,可以用這些屬性可以控制量化精度、batch size 大小等等。
builder有兩個重要的屬性:
1、maximum batch size:TensorRT可以優化的最大的batch size,實際運行時,選擇的batch size小於等於該值。
2、the maximum workspace size:用於層算法的臨時空間,這個值限制了網絡中層占用空間的最大值,如果這個值設置的小了,TensorRT可能找不到給定層的實現。
build the engine:
builder.max_batch_size = max_batch_size builder.max_workspace_size = 1 << 20 # This determines the amount of memory available to the builder when building an optimized engine and should generally be set as high as possible. with trt.Builder(TRT_LOGGER) as builder: with builder.build_cuda_engine(network) as engine: # Do inference here.
在engine built的同時,TensorRT copy權重數據。
serializing a model
所謂的serialize,就是將這個engine轉換為一種格式存儲起來,后面用在inference上。
在Inference的時候,只需要deserialize這個存儲的engine就可以了。
之所以這樣做,是因為build engine的過程時比較消耗時間的,如果能將已經build的engine存儲起來后面調用,這會加速整個inference的准備時間。
注意:保存的engine時不能跨平台使用的。
Serialize the model to a modelstream:
serialized_engine = engine.serialize()
Deserialize modelstream to perform inference. Deserializing requires creation of a runtime object:
with trt.Runtime(TRT_LOGGER) as runtime:
engine = runtime.deserialize_cuda_engine(serialized_engine)
如果是將這個engine保存在一個文件中:
Serialize the engine and write to a file:
with open(“sample.engine”, “wb”) as f:
f.write(engine.serialize())
Read the engine from the file and deserialize:
with open(“sample.engine”, “rb”) as f, trt.Runtime(TRT_LOGGER) as runtime:
engine = runtime.deserialize_cuda_engine(f.read())
Performing Inference
為輸入輸出分配一些 host and device buffers
# Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs. h_input = cuda.pagelocked_empty(engine.get_binding_shape(0).volume(), dtype=np.float32) h_output = cuda.pagelocked_empty(engine.get_binding_shape(1).volume(), dtype=np.float32) # Allocate device memory for inputs and outputs. d_input = cuda.mem_alloc(h_input.nbytes) d_output = cuda.mem_alloc(h_output.nbytes) # Create a stream in which to copy inputs/outputs and run inference. stream = cuda.Stream()
需要創建空間保存中間的激活值。engine里面有網絡的定義和訓練好的權重,需要額外的空間。These are held in an execution context:
with engine.create_execution_context() as context: # Transfer input data to the GPU. cuda.memcpy_htod_async(d_input, h_input, stream) # Run inference. context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle) # Transfer predictions back from the GPU. cuda.memcpy_dtoh_async(h_output, d_output, stream) # Synchronize the stream stream.synchronize() # Return the host output. return h_output