TensorRT加速 ——NVIDIA終端AI芯片加速用，可以直接利用caffe或TensorFlow生成的模型來predict（inference）

本文轉載自查看原文 2018-01-18 17:36 9774 tensorflow/ 機器學習

官網：https://developer.nvidia.com/tensorrt

作用：NVIDIA TensorRT™ is a high-performance deep learning inference optimizer and runtime that delivers low latency, high-throughput inference for deep learning applications. TensorRT can be used to rapidly optimize, validate, and deploy trained neural networks for inference to hyperscale data centers, embedded, or automotive product platforms.

(Click to Zoom)

Developers can use TensorRT to deliver fast inference using INT8 or FP16 optimized precision that significantly reduces latency, as demanded by real-time services such as streaming video categorization on the cloud or object detection and segmentation on embedded and automotive platforms.

With TensorRT developers can focus on developing novel AI-powered applications rather than performance tuning for inference deployment. TensorRT runtime ensures optimal inference performance that can meet the most demanding latency and throughput requirements.

TensorRT can be deployed to Tesla GPUs in the datacenter, Jetson embedded platforms, and NVIDIA DRIVE autonomous driving platforms.

What's New in TensorRT 3?

TensorRT 3 is the key to unlocking optimal inference performance on Volta GPUs. It delivers up to 40x higher throughput in under 7ms real-time latency vs. CPU-Only inference.
Highlights from this release include:

Deliver up to 3.7x faster inference on Tesla V100 vs. Tesla P100 under 7ms real-time latency
Optimize and deploy TensorFlow models up to 18x faster compared to TensorFlow framework inference on Tesla V100
Improved productivity with easy to use Python API

Learn more about how to get started with TensorRT 3 in the following technical blog posts:

見：https://stackoverflow.com/questions/41142284/run-tensorflow-with-nvidia-tensorrt-inference-engine 可以知道已經支持導入TensorFlow的模型

TensorRT 3.0 supports import/conversion of TensorFlow graphs via it's UFF (universal framework format). Some layer implementations are missing and will require custom implementations via IPlugin interface.

Previous versions didn't support native import of TensorFlow models/checkpoints.

from：http://blog.csdn.net/jsa158/article/details/53944159

TensorRT介紹
TensorRT 現在是inference 精度最高，速度最快的，而且在不斷的改進過程中，在保證軟件精度的同時，不斷提高速度；TensorRT只能用來做Inference，不能用來做train；

1、TensorRT的需要的文件
需要的基本文件（不是必須的）
1>網絡結構文件（deploy.prototxt）
2>訓練的權重模型(net.caffemodel)
TensorRT 2.0 EA版中的sampleMNISTAPI和TensorRT 1.0中的sampleMNISTGIE 幾乎沒有變化，就是不使用caffemodel 文件構建network 的例子。
2、TensorRT支持的層
Convolution: 2D
Activation: ReLU, tanh and sigmoid
Pooling: max and average
ElementWise: sum, product or max of two tensors
LRN: cross-channel only
Fully-connected: with or without bias
SoftMax: cross-channel only
Deconvolution
對於TensorRT 不支持的層，可以先將支持的層跑完，然后將輸出作為caffe的輸入，用caffe再跑，V1不支持TensorRT 和caffe同時工作，V2支持。（例子NVIDIA正在做，后期可能會上傳github）
3、TensorRT 處理流程
基本處理過程：1>caffe model 轉化 gie的model，或者從磁盤或者網絡加載gie可用的model；2>運行GIE引擎（數據提前copy到GPU中）；3>提取結果

三、TensorRT Optimization
使用了很多優化網絡和層計算的方法
內存優化、網絡優化，層合並，層刪除以及GPU匯編指令，內部函數，提高GPU利用率，減少精度需求，cuDNN優化，根據不同的batchsize設置不同的計算模式或者GPU clock；卷積的優化，使用Winograd（提升3倍）等算法或者特定硬件方式實現；

 批量處理盡可能並行處理，在cuda中 使用warp對齊，提高GPU命令命中率，除此還有CPU可以使用，CPU做一部分工作，GPU做一部分工作；

數據布局可以使用半精度FP16 ， textute memory , 13% inference speedup
網絡優化中，網絡的垂直融合，網絡的水平融合，級聯層可以刪掉（concat）
內部使用稀疏矩陣編碼

Jetson TX1 開發教程（4）--TensorRT加速Caffe初探

轉自：http://blog.csdn.net/amds123/article/details/72234167?locationnum=13&fps=1

項目地址：NVIDIA TensorRT

前言

TensorRT（GIE）是一個C++庫，適用於Jetson TX1和Pascal架構的顯卡（Tesla P100, K80, M4 and Titan X等），支持fp16特性，也就是半精度運算。由於采用了“精度換速度”的策略，在精度無明顯下降的同時，其對inference的加速很明顯，往往可以有一倍的性能提升，而且還支持使用caffe模型。目前網上關於TensorRT的介紹很少，這里博主嘗試着寫一些，有空還會繼續補充。

TensorRT簡介

TensorRT目前基於gcc4.8而寫成，其獨立於任何深度學習框架。對於caffe而言，TensorRT是把caffe那一套東西轉化后獨立運行，能夠解析caffe模型的相關工具叫做 NvCaffeParser,它根據prototxt文件和caffemodel權值，轉化為支持半精度的新的模型。

目前TensorRT 支持caffe大部分常用的層，包括：

Convolution（卷積層）, with or without bias. Currently only 2D convolutions (i.e. 4D input and output tensors) are supported. Note: The operation this layer performs is actually a correlation, which is a consideration if you are formatting weights to import via GIE’s API rather than the caffe parser library.

Activation（激活層）: ReLU, tanh and sigmoid.

Pooling（池化層）: max and average.

Scale（尺度變換層）: per-tensor, per channel or per-weight affine transformation and exponentiation by constant values. Batch Normalization can be implemented using the Scale layer.

ElementWise（矩陣元素運算）: sum, product or max of two tensors.

LRN（局部相應歸一化層）: cross-channel only.

Fully-connected（全連接層） with or without bias

SoftMax: cross-channel only

Deconvolution（反卷積層）, with and without bias

不支持的層包括：

Deconvolution groups

PReLU

Scale, other than per-channel scaling

EltWise with more than two inputs

使用TensorRT主要有兩個步驟（C++代碼）：

In the build phase, the toolkit takes a network definition, performs optimizations, and generates the inference engine.

In the execution phase, the engine runs inference tasks using input and output buffers on the GPU.

想要具體了解TensorRT的相關原理的，可以參看這篇官方博客：
Production Deep Learning with NVIDIA GPU Inference Engine

這里暫時對原理不做太多涉及，下面以mnist手寫體數字檢測為例，結合官方例程，說明TensorRT的使用步驟。

TensorRT運行caffe模型實戰

獲取TensorRT支持

首先，Jetson TX1可以通過Jetpack 2.3.1的完全安裝而自動獲得TensorRT的支持，可參考博主之前的教程。TX1刷機之后，已經添加了一系列的C++運行庫去支持TensorRT，如果掌握API的話，寫一個C++程序就可以實現功能。

沒有TX1，只有Pascal架構的顯卡（如TITAN X），那也能感受TensorRT的效果，方法是去官網申NVIDIA TensorRT請測試資格，需要詳細說明自己的研究目的，一般經過一兩次郵件溝通后就能通過。博主目前已經獲得TensorRT 1.0和2.0的測試資格，有機會也會進行TITAN X的TensorRT測試。

運行官方例程

這里，博主就先以Jetson TX1為例，看看官方自帶的例程是如何運行的。自帶例程的地址是：/usr/src/gie_samples/samples，我們打開文件夾，發現如下文件：

這里寫圖片描述

其中，data文件夾存放LeNet和GoogleNet的模型描述文件和權值，giexec文件夾是TensorRT通用接口的源代碼，剩下的文件夾是特定網絡的接口源代碼。Makefile是配置文件，在gie_sample文件夾位置打開終端，輸入sudo make就能完成編譯，生成一系列可執行文件，存放在bin文件夾中，那我們就來看看bin文件夾的內容：

這里寫圖片描述

首先測試giexec文件：

cd /usr/src/gie_samples/samples
./bin/giexec

#得到如下的使用方法：
Mandatory params:
  --model=<file> Caffe model file --deploy=<file> Caffe deploy file --output=<name> Output blob name (can be specified multiple times Optional params: --batch=N Set batch size (default = 1) --device=N Set cuda device to N (default = 0) --iterations=N Run N iterations (default = 10) --avgRuns=N Set avgRuns to N - perf is measured as an average of avgRuns (default=10) --workspace=N Set workspace size in megabytes (default = 16) --half2 Run in paired fp16 mode - default = false --verbose Use verbose logging - default = false --hostTime Measure host time rather than GPU time - default = false --engine=<file> Generate a serialized GIE engine

二者比較的話，發現使用半精度的話，確實速度確實上去了一些，但是提升幅度一般，大約快了50%的樣子。當然，博主認為這個mnist例子太簡單了，可能並不具備太大的說服力，大家可以參看官方給出的加速效果圖。

這里寫圖片描述

那么剩下的幾個可執行程序分別有什么用呢？我們還是來一一試驗，結果如圖：
./bin/sample_mnist
這里寫圖片描述

./bin/sample_mnist_gie
這里寫圖片描述

以上二者並沒有帶什么參數，貌似都在進行隨機圖片的檢測，具體區別需要看源代碼。

./bin/sample_googlenet
這里寫圖片描述

這個也沒有參數可選，只是跑了一遍，最后得出一個時間。

寫在后面

TX1自帶的TensorRT例程操作起來並不難，博主認為，其最重要的價值在於那些cpp源代碼，只有參考這些官方例程，我們才能獨立寫出C++代碼，從而加速自己的caffe模型。giexec.cpp總共代碼300多行，還是有點長的，雖然有些注釋，但是如果沒有仔細研究其API的話，里面的很多函數還是不會用。TensorRT的API文檔內容有點多，本人暫時沒空研究。
擁有Jetson TX1的小伙伴可以打開/usr/share/doc/gie/doc/API/index.html查看官方API文檔，我這里連同例程源代碼一起，都上傳到了csdn，有興趣者可以下載來看看。（貌似離開了TX1，API文檔效果不佳）

TensorRT之例程源代碼
 TensorRT之官方API文檔

有小伙伴在我之前的博客問道，TensorRT能不能加快目標檢測框架如SSD的運行速度呢。博主所知道的是，目前還不能。因為現在的TensorRT有較大的局限性，僅支持caffe模型轉化后進行前向推理（inference），訓練部分還不支持。而且在我看來，目前應用99%都是在圖片分類上，比如實現一下Alexlet或者GoogleNet模型的半精度轉化，這樣就把圖片分類的速度提升一倍。由於SSD等目標檢測框架含有特殊層以及結構復雜的原因，現有的那些TensorRT函數根本無法去進行轉化和定義。

不過Nvidia官方已經開始重視目標檢測這一塊了，博主和Nvidia技術人員的郵件往來中，獲悉未來的TensorRT將會支持Faster RCNN以及SSD，他們已經在開發中了，相信到時使用Jetson TX1進行目標檢測，幀率達到10fps以上不是夢。

You mentioned the SSD Single Shot Detector. We are working on SSD right now. I think you’ll find there are some layers needed for SSD that aren’t supported in the versions of TensorRT available through these early release programs. We are adding a custom layer capability to the next version and providing support using that mechanism for Faster R-CNN and SSD.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 模型加速[tensorflow&tensorrt] TensorRT推理加速-基於Tensorflow(keras)的uff格式模型(文件准備) 利用NVIDIA NGC的TensorRT容器優化和加速人工智能推理探討TensorRT加速AI模型的簡易方案 — 以圖像超分為例 NVIDIA TensorRT：可編程推理加速器 TensorRT加速原理記錄 TensorRT 加速性能分析 Mxnet使用TensorRT加速模型--Mxnet官方例子實現TensorRT加速Pytorch模型的過程（Yolov5為例）用NVIDIA Tensor Cores和TensorFlow 2加速醫學圖像分割