TensorRT-8量化分析
本文講非對稱量化、量化方式等等一些細節,不過有一段時間在做基於TensorRT的量化,需要看下TensorRT的量化細節。這次文章是偏實踐的一篇,主要過一下TensorRT對於explict quantization的流程和通用的量化思路。
01
0x01 TensorRT量化
都2022年了,量化技術已經很成熟了,各種量化框架[1]和量化算法層出不窮。之前接觸過幾個量化框架,大部分都是在算法層面模擬一下,實際上無法直接部署到具體的硬件層,也只是停留在算法的層面。而現在成熟的量化框架已經不少,開源的也有很多,無論是pytorch、TVM還是TensorRT,基於這些框架的GPU和CPU量化已經應用了不少,也看了看最近商湯新開源的量化框架ppq,同樣也挺成熟了,最起碼用起來是的的確確可以實際部署,為帶來性能的提升。
上一篇主要是理論細節比較多,那么這一篇主要說說實際的量化流程。要實際用起來、跑起來才有意義。因為有一段時間在用TensorRT,所以就說說TensorRT的量化細節和實際量化流程吧!
TensorRT的量化工具也比較成熟了。支持PTQ和QAT量化,官方也提供了一些工具[2]去幫助實現量化(無論是基於trt本身還是基於周邊工具)。
當然除了TensorRT也用過一些其他的量化框架,也寫過一些代碼。其實大部分量化方式基本大同小異,大方向都是讀取模型,轉化為IR進行圖分析,做一些優化策略等等,關於怎么組織圖,怎么優化結構可能會不一樣。還有具體的校准算法的不同,不過總體上,量化的整體思路是差不多的。
因此,了解TensorRT的量化過程是是挺重要的,也有助於理解其他框架的量化方式,畢竟萬變不離其宗。
02
0x02 TensorRT的量化模式
TensorRT有兩種量化模式,分別是implicitly以及explicitly量化。前者是隱式量化,在trt7版本之前用的比較多。而后者顯式量化是在8版本后才完全支持,具體就是可以加載帶有QDQ信息的模型然后生成對應量化版本的engine。
兩種量化模型的一些支持情況:
TensorRT中兩種量化流程
與隱式量化相關性較強的是訓練后量化。
03
訓練后量化
訓練后量化即PTQ量化,trt的訓練后量化算法第一次公布在2017年,NVIDIA放出了使用交叉熵量化的一個PPT,簡單說明了其量化原理和流程,其思想集成在trt內部供用戶去使用。對是閉源的,只能通過trt提供的API去量化。
不需要訓練,只需要提供一些樣本圖片,然后在已經訓練好的模型上進行校准,統計出來需要的每一層的scale就可以實現量化了,大概流程就是這樣:
PTQ量化流程
具體使用就是,導出ONNX模型,轉換為TensorRT的過程中可以使用trt提供的Calibration方法去校准,這個使用起來比較簡單。可以直接使用trt官方提供的trtexec命令去實現,也可以使用trt提供的python或者C++的API接口去量化,比較容易。
目前,TensorRT提供的后訓練量化算法也多了好多,分別適合於不同的任務:
- EntropyCalibratorV2
Entropy calibration chooses the tensor’s scale factor to optimize the quantized tensor’s information-theoretic content, and usually suppresses outliers in the distribution. This is the current and recommended entropy calibrator and is required for DLA. Calibration happens before Layer fusion by default. It is recommended for CNN-based networks.
- MinMaxCalibrator
This calibrator uses the entire range of the activation distribution to determine the scale factor. It seems to work better for NLP tasks. Calibration happens before Layer fusion by default. This is recommended for networks such as NVIDIA BERT (an optimized version of Google's official implementation).
- EntropyCalibrator
This is the original entropy calibrator. It is less complicated to use than the LegacyCalibrator and typically produces better results. Calibration happens after Layer fusion by default.
- LegacyCalibrator
This calibrator is for compatibility with TensorRT 2.0 EA. This calibrator requires user parameterization and is provided as a fallback option if the other calibrators yield poor results. Calibration happens after Layer fusion by default. You can customize this calibrator to implement percentile max, for example, 99.99% percentile max is observed to have best accuracy for NVIDIA BERT.
通過上述這些算法量化時,TensorRT會在優化網絡的時候嘗試INT8精度,假如某一層在INT8精度下速度優於默認精度(FP32或者FP16)則優先使用INT8。這個時候無法控制某一層的精度,因為TensorRT是以速度優化為優先的(很有可能某一層想跑int8結果卻是fp32)。即使使用API去設置也不行,比如set_precision這個函數,因為TensorRT還會做圖級別的優化,如果發現這個op(顯式設置了INT8精度)和另一個op可以合並,就會忽略設置的INT8精度。
說白了就是不好控制。也嘗試過這種方式,簡單情況,簡單模型問題不大(resnet系列),涉及到比較復雜的(transformer)這個設置精度可能不管用,誰知道TensorRT內部是怎么做優化的呢,畢竟是黑盒子。
04
訓練中量化
訓練中量化(QAT)是TensorRT8新出的一個“新特性”,這個特性其實是指TensorRT有直接加載QAT模型的能力。QAT模型這里是指包含QDQ操作的量化模型。實際上QAT過程和TensorRT沒有太大關系,trt只是一個推理框架,實際的訓練中量化操作一般都是在訓練框架中去做,比如熟悉的Pytorch。(當然也不排除之后一些優化框架也會有訓練功能,因此同樣可以在優化框架中做)
TensorRT-8可以顯式地load包含有QAT量化信息的ONNX模型,實現一系列優化后,可以生成INT8的engine。
QAT量化信息的ONNX模型長這樣:
多了quantize和dequanzite算子
可以看到有QuantizeLiner和DequantizeLiner模塊,也就是對應的QDQ模塊,包含了該層或者該激活值的量化scale和zero-point。QDQ模塊會參與訓練,負責將輸入的FP32張量量化為INT8,隨后再進行反量化將INT8的張量在變為FP32。實際網絡中訓練使用的精度還是FP32,只不過這個量化算子在訓練中可以學習到量化和反量化的尺度信息,這樣訓練的時候就可以讓模型權重和量化參數更好地適應量化這個過程(當然,scale參數也是可以學習的),量化后的精度也相對更高一些。
感知量化過程中的qdq模塊
QAT量化中最重要的就是fake量化算子,fake算子負責將輸入該算子的參數和輸入先量化后反量化,然后記錄這個scale,就是模擬上圖這個過程。
比如有一個網絡,精度是FP32,輸入和權重因此也是FP32:
普通模型的訓練過程
可以插入fake算子:
QAT模型的訓練過程
FQ(fake-quan)算子會將FP32精度的輸入和權重轉化為INT8再轉回FP32,記住轉換過程中的尺度信息。
這些fake-quan算子在ONNX中可以表示為QDQ算子:
ONNX中的QDQ-fake-quantize
什么是QDQ呢,QDQ就是Q(量化)和DQ(反量化)兩個op,在網絡中通常作為模擬量化的op,比如:QDQ操作示例
輸入X是FP32類型的op,輸出是FP32,然后在輸入A這個op時會經過Q(即量化)操作,這個時候操作A會默認是INT8類型的操作,A操作之后會經過DQ(即反量化)操作將A輸出的INT8類型的結果轉化為FP32類型的結果並傳給下一個FP32類型的op。
那么QDQ有啥用呢?
1、第一個是可以存儲量化信息,比如scale和zero_point啥的,這些信息可以放在Q和QD操作中
2、第二個可以當做是顯式指定哪一層是量化層,可以默認認為包在QDQ操作中間的op都是INT8類型的op,也就是需要量化的op
比如下圖,可以通過QDQ的位置決定每一層OP的精度:
QDQ決定量化細節
因此對比顯式量化(explicitly),trt的隱式量化(implicitly)就沒有那么直接,在trt-8版本之前一般都是借助trt的內部的量化算法去量化,在構建engine的時候傳入圖像進行校准,執行的是訓練后量化的過程。
而有了QDQ信息,TensorRT在解析模型的時候會根據QDQ的位置找到(給予提示的)可量化的op,然后與QDQ融合(吸收尺度信息到OP中):
QDQ融合基本策略
融合后該算子就是實打實的INT8算子,也可以通過調整QDQ的位置來設置網絡每一個op的精度(某些op必須高精度,因此QDQ的位置要放對):
QDQ決定量化細節
也可以顯式地插入QDQ告訴TensorRT哪些層是INT8,哪些層可以被fuse:
QAT模型和TensorRT優化后的模型
經過一系列融合優化后,最終生成量化版的engine:
最終的量化后的網絡
總得來說,TensorRT加載QAT的ONNX模型並且優化的整理流程如下:
量化流程
因為TensorRT8可以直接加載通過QTA量化后且導出為ONNX的模型,官方也提供了Pytorch量化配套工具,可謂是一步到位。
TensorRT的量化性能是非常好的,可能有些模型或者op已經被其他庫超越(比如openppl或者tvm),不過TensorRT勝在支持的比較廣泛,用戶很多,大部分模型都有前人踩過坑,經驗相對較多些,而且支持dynamic shape,適用的場景也較多。
不過TensorRT也有缺點,就是自定義的INT8插件不是很好搞,很多坑要踩,也就是自己添加新的支持難度稍大一些。對於某些層不支持或者有bug的情況,除了在issue中催一下官方盡快更新之外,也沒有其他辦法了。
05
各個層對INT8的支持
在官方文檔的Layer specific restrictions這一節中有詳細的說明,常見的卷積、反卷積、BN、矩陣乘法等等都是支持的,更多可以自己去查:
傳送門:
- https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html
06
顯式量化相關的TensorRT層
TensorRT顯式量化主要參與的op是IQuantizeLayer和IDequantizeLayer這倆,即Q和DQ。在構建TensorRT-network的時候就可以通過這兩個op來控制網絡的量化細節。
IQuantizeLayer
這個層就是將浮點型的Tensor轉換為,通過add_quantize這個API添加:
- 執行output = clamp(round(input / scale) + zeroPt)
- Clamping is in the range [-128, 127]
- API參考::https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Graph/Layers.html#iquantizelayer
- 執行𝑜𝑢𝑡𝑝𝑢𝑡=(𝑖𝑛𝑝𝑢𝑡−𝑧𝑒𝑟𝑜𝑃𝑡)∗𝑠𝑐𝑎𝑙𝑒
- 輸入INT8輸出FP32
- API:https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/Graph/Layers.html#tensorrt.IDequantizeLayer
上述兩個TensorRT的layer與ONNX中的QuantizeLinear和Dequantizelinear對應,在使用ONNX2trt工具的時候,ONNX中的這兩個op會被解析成IQuantizeLayer和IDequantizeLayer:
ONNX中的QDQ
IDequantizeLayer
與IQuantizeLayer作用相反,通過add_dequantize添加。
07
0x03 TensorRT中
對於QDQ模型的優化策略
當TensorRT檢測到模型中有QDQ算子的時候,就會觸發顯式量化。以下quantize算子簡稱Q,dequantize算子簡稱DQ。
Q算子一般輸入是FP32類型的,然后會有一個Q的scale,相反DQ也會有一個scale,這個scale參數就是per-tensor或者per-channel的尺度信息,不清楚的可以復習下上一篇內容。
如下圖:
帶QDQ的ONNX模型
08
優化准則
好了,那么TensorRT載入帶有QDQ算子的模型怎么處理呢?首先當然是要保證其模型的正確性,也就是計算順序不能變。當然s*a+b*s -> (a+b)*s這種是可以的,對結果不會有很大的影響(小的影響是有的,對於浮點運算,這種變化也會造成結果一點點的不一樣,不信試試)。
之前也提到過,有QDQ算子的算是顯式量化,既然都是顯式了那就是很明顯啊。Q算子負責FP32->INT8,而DQ算子負責INT8->FP32,被QDQ包起來的算子理所應當就是量化算子(或者說准備被量化、可以被量化的算子,這句話有待揣摩...)。最終QDQ算子的scale要被吸收進量化算子中:
官方文檔-QDQ合並
上圖綠色AvgPool就是量化版本的算子了。
QDQ-ONNX網絡在輸入到TensorRT中的時候,TensorRT的算法會propagate整個網絡,根據一些規則適當移動Q/DQ算子的位置,(畢竟網絡往往比較復雜,並不是很多結構都剛好QDQ-pair了,需要盡可以拼湊出QDQ結構,使整個網絡盡可能多的op變為量化算子)然后再執行QDQ融合策略。
這些規則簡單說就是:
- 盡可能將DQ算子推遲,推遲反量化操作
- 盡可能將Q算子提前,提前量化操作
光說可能不大明白,看個圖:
官方文檔-QDQ-propagation
第一個將DQ挪到MaxPool后面,這樣MaxPool就從FP32->INT8了,第二個將Q從MaxPool后面移到前面,這樣MaxPool也就從FP32->INT8了。這樣搞完有助於下一步的優化。
至於為什么可以把Q、DQ在MaxPool周圍移動呢?這里有一個簡單的證明:
MaxPool
有一點注意,需要區分quantizable-layers and commuting-layers,大概意思就是quantizable-layers是會實際計算可量化算子,比如Conv、BN啥的;而commuting-layers中不涉及到計算,僅僅是根據某些規則將輸入來的Tensor過濾一部分再輸出出來,比如上述的maxpool。這種操作的過濾規則和量化操作可以互動。
為什么移動QDQ呢,畢竟QDQ模型是產出的,QDQ算子也是親手插的,這個插得位置其實也是有講究的。畢竟這個QDQ模型是要經過TensorRT進行解析優化(或者其他推理框架進行解析),而解析算法也是人寫的,難免會有一些case沒有考慮到,而這些badcase或者hardcase往往與QDQ插得位置有關。
09 QDQ優化建議
因此TensorRT針對優化器的優化細節,提出了一些建議,這些建議或者說規則吧,感覺是比較通用的,其他類似的量化框架中也會遇到同樣的思想。
下面詳細展開說說。
Quantize all inputs of weighted-operations
常見的操作,比如Convolution, Transposed Convolution and GEMM,這些都是帶參數的。所以在量化的時候最好把這些op的輸入和權重都量化了,這樣可以達到速度最大化。
下圖中TensorRT會根據QDQ的分布進行不同的優化,比如左邊的conv融合后輸入INT8但輸出為FP32,而右邊的輸入輸出皆為INT8(兩者的區別只是因為右面的conv后頭跟了一個Q)。
不同情況下conv融合策略
By default, don’t quantize the outputs of weighted-operations.
通常情況下,常見的weighted-operations,一般都是卷積、矩陣相乘、反卷積等等,而這些op后頭一般都會跟着BN層或者激活層。BN層的話,比較特殊,不論是在PTQ場景還是QAT場景都比較重要(這里咱不展開)。而激活層的話,除了常見的RELU,其他的一些激活層比如SILU,因為不好量化,所以就保持浮點型(比如sigmoid在TensorRT中僅支持FP16量化)。
conv與FP32激活函數的合並
Don’t simulate batch-normalization and ReLU fusions in the training framework
TensorRT在優化網絡的過程中會順手將CONV+BN+RELU合並,所以在導出ONNX模型時候沒必要自己融合,特別是在QAT的時候可以保留BN層。
不過融合了也沒關系。
CONV+BN+RELU合並
10
OP的輸入和輸出類型決定融合策略
TensorRT的融合策略也會受到模型中OP的精度影響。
適當QDQ條件下conv+bn+add的融合
上圖中,當 被QDQ顯式指定為INT8類型,另一個分支的fused-conv的輸出也是INT8,那么跟在后頭的Q-layer也會被融合到conv里頭。
因為TensorRT可以對weighted layers之后的element-wise addition執行融合(這種一般都是有skip connections,比如Resnet和EfficientNet)。但是這個add層輸出的精度是由第一個輸入(這里的第一個如何判斷值得商榷)的精度決定。
比如下圖的add輸入是,所以融合后的conv輸出也必須是FP32(這里理解為融合后的conv輸出是add的第二個輸入,第二輸入類型必須與第一個一致),這樣輸入和輸出就都是FP32,所以最后一個Q-layer無法(像上一種情況一樣)被融合了。
conv+bn+add融合
For extra performance, try quantizing layers that do not commute with Q/DQ
像add這類的操作,最好是輸入輸出都是INT8,這樣性能能達到最大化。
add和QDQ的優化
上圖fusion之后,Add操作的輸入和輸出類型都是INT8。
一些badcase
絕大部分情況,融合QDQ可以帶來性能提升,不過有些情況就不行了,畢竟這個優化過程是編好的程序,badcase或者hardcase肯定是有的。
次優融合和最優綜合
另外一些情況,因為有一些QDQ的優化需要比較其中兩個或者多個QDQ算子的scale重新計算scale(比如常見的add或者concat,需要對多個輸入的scale進行requantize,這里暫時不細說)。如果這個trt模型是支持refitted(簡單來說就是支持修改模型參數的trt模型),那么也是可以修改這些QDQ的scale值的,但修改之后之前重新計算的scale可能就不適用了,這時候該過程就會報錯。
比如下圖,TensorRT對整個網絡進行遍歷的時候會比較concat中兩個Q的scale是否一致,如果一致的話就可以將concat之后的兩個Q放到前面來:
concat融合條件
11
總結
到這里關於QDQ的說明就結束了,通過上述例子,不難認為下面紅色圈圈內的OP精度都可以為INT8。
顯式指定量化op
因為QDQ是顯式量化,所以QDQ的放置位置很重要,有幾點規則:
- Recommend QDQ ops insertion at Inputs of quantizable ops
- Matches QLinear/QConv semantics i.e. low precision input, high precision output.
- No complexity in deciding whether to quantize output or not. Just Don't.
- Let the ops decide what precision input they want.
這里就不轉述了,原文看起來更准確些,這些內容之后可能也會更新。
- Inserting QDQ ops at inputs (recommended)
1.Explicit quantization. No implicit rule eg. "Quantize operator input if output is quantized”.2.No special logic for Conv-BN or Conv-ReLU3.Just insert QDQ in front of quantizable ops. Leave the rest to the back end (TensorRT).4.Makes life easy for frameworks quantization tool5.Makes life easy for back end optimizers (TensorRT)
- Inserting QDQ ops at outputs (not recommended, but supported)
1.Some frameworks quantization tools have this behavior by default.
2.Sub-optimal performance when network is "partial quantization" i.e. not all ops are quantized.
3.Optimal performance when network is "fully quantized" i.e. all ops in network are quantized.
再詳細點,舉個實際的例子。
0x04 來個量化例子
接下來過一下TensorRT對於導出帶有QQQ節點的ONNX模型,是如何一步一步轉化為engine的。
這里通過分析TensorRT的官方轉換工具trtexec執行的產生verbose信息來描述trt的量化過程,經常用trt的伙伴應該也比較熟悉。verbose信息可以通過指定--verbose參數開啟,verbose信息包含TensorRT在執行轉換中的一些信息:
- 解析onnx模型的過程
- 優化onnx模型op的過程
- onnx中op轉換為engine中op的過程
- 優化engine中op的過程
因為這里使用的ONNX已經擁有QDQ信息,即不需要Calibrator了,TensorRT會出現以下信息:
[08/25/2021-17:30:06] [W] [TRT] Calibrator won't be used in explicit precision mode. Use quantization aware training to generate network with Quantize/Dequantize nodes.
接下來開始優化。
首先優化一些無用的node(置空的等啥的op),正常模型(正常導出的沒有bug)一般沒有這種問題,所以優化前后模型總層數一致。
[08/25/2021-17:30:06] [V] [TRT] Applying generic optimizations to the graph for inference.[08/25/2021-17:30:06] [V] [TRT] Original: 863 layers[08/25/2021-17:30:06] [V] [TRT] After dead-layer removal: 863 layers
去掉所有trt網絡中的常量信息
[08/25/2021-17:30:06] [V] [TRT] Removing (Unnamed Layer* 1) [Constant]...[08/25/2021-17:30:06] [V] [TRT] Removing (Unnamed Layer* 853) [Constant][08/25/2021-17:30:06] [V] [TRT] Removing (Unnamed Layer* 852) [Constant][08/25/2021-17:30:06] [V] [TRT] After Myelin optimization: 415 layers[08/25/2021-17:30:06] [V] [TRT] After scale fusion: 415 layers
常量信息即各種模型中的參數,比如BN層中一些參數:
模型中的常量op
或者QDQ中的scale和zero_point信息,這些信息的類型一般是Constant或者initializers。
y_scale就是1/s
[08/25/2021-17:30:06] [V] [TRT] QDQ graph optimizer - constant folding of Q/DQ initializers[08/25/2021-17:30:06] [V] [TRT] QDQ graph optimizer forward pass - DQ motions and fusions
合並Add+Relu
很常見的合並。
[08/25/2021-17:30:06] [V] [TRT] EltReluFusion: Fusing Add_42 with Relu_43[08/25/2021-17:30:06] [V] [TRT] EltReluFusion: Fusing Add_73 with Relu_74[08/25/2021-17:30:06] [V] [TRT] EltReluFusion: Fusing Add_104 with Relu_105[08/25/2021-17:30:06] [V] [TRT] EltReluFusion: Fusing Add_146 with Relu_147[08/25/2021-17:30:06] [V] [TRT] EltReluFusion: Fusing Add_177 with Relu_178[08/25/2021-17:30:06] [V] [TRT] EltReluFusion: Fusing Add_208 with Relu_209...[08/25/2021-17:30:06] [V] [TRT] EltReluFusion: Fusing Add_540 with Relu_541
合並Add和relu
利用量化信息融合權重參數
FP32->INT8。轉換模型權重的精度。
[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing conv1.weight with QuantizeLinear_7_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer1.0.conv1.weight with QuantizeLinear_20_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer1.0.conv2.weight with QuantizeLinear_32_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer1.1.conv1.weight with QuantizeLinear_51_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer1.1.conv2.weight with QuantizeLinear_63_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer1.2.conv1.weight with QuantizeLinear_82_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer1.2.conv2.weight with QuantizeLinear_94_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer2.0.conv1.weight with QuantizeLinear_113_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer2.0.conv2.weight with QuantizeLinear_125_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer2.0.downsample.0.weight with QuantizeLinear_136_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer2.1.conv1.weight with QuantizeLinear_155_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer2.1.conv2.weight with QuantizeLinear_167_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer2.2.conv1.weight with QuantizeLinear_186_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer2.2.conv2.weight with QuantizeLinear_198_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer2.3.conv1.weight with QuantizeLinear_217_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer2.3.conv2.weight with QuantizeLinear_229_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer3.0.conv1.weight with QuantizeLinear_248_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer3.0.conv2.weight with QuantizeLinear_260_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer3.0.downsample.0.weight with QuantizeLinear_271_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer3.1.conv1.weight with QuantizeLinear_290_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer3.1.conv2.weight with QuantizeLinear_302_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer3.2.conv1.weight with QuantizeLinear_321_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer3.2.conv2.weight with QuantizeLinear_333_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer3.3.conv1.weight with QuantizeLinear_352_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer3.3.conv2.weight with QuantizeLinear_364_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer3.4.conv1.weight with QuantizeLinear_383_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer3.4.conv2.weight with QuantizeLinear_395_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer3.5.conv1.weight with QuantizeLinear_414_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer3.5.conv2.weight with QuantizeLinear_426_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer4.0.conv1.weight with QuantizeLinear_445_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer4.0.conv2.weight with QuantizeLinear_457_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer4.0.downsample.0.weight with QuantizeLinear_468_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer4.1.conv1.weight with QuantizeLinear_487_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer4.1.conv2.weight with QuantizeLinear_499_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer4.2.conv1.weight with QuantizeLinear_518_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing layer4.2.conv2.weight with QuantizeLinear_530_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] ConstWeightsQuantizeFusion: Fusing deconv_layers.0.weight with...
合並Conv+Relu
常規合並沒什么好說的。
[08/25/2021-17:30:06] [V] [TRT] ConvReluFusion: Fusing Conv_617 with Relu_618[08/25/2021-17:30:06] [V] [TRT] ConvReluFusion: Fusing Conv_638 with Relu_639[08/25/2021-17:30:06] [V] [TRT] ConvReluFusion: Fusing Conv_659 with Relu_660
合並conv和relu
將Q移動到Relu前
為啥要移動,移動完Relu的精度就從FP32->INT8了,便於之后繼續優化,符合上一節介紹的規則。
[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_55 with QuantizeLinear_58_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_86 with QuantizeLinear_89_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_117 with QuantizeLinear_120_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_159 with QuantizeLinear_162_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_190 with QuantizeLinear_193_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_221 with QuantizeLinear_224_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_252 with QuantizeLinear_255_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_294 with QuantizeLinear_297_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_325 with QuantizeLinear_328_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_356 with QuantizeLinear_359_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_387 with QuantizeLinear_390_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_418 with QuantizeLinear_421_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_449 with QuantizeLinear_452_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_491 with QuantizeLinear_494_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_522 with QuantizeLinear_525_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_563 with QuantizeLinear_566_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_585 with QuantizeLinear_588_quantize_scale_node```
交換這兩個節點
去掉多余的Q-op
...[08/25/2021-17:30:06] [V] [TRT] Eliminating QuantizeLinear_38_quantize_scale_node which duplicates (Q) QuantizeLinear_15_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Removing QuantizeLinear_38_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Eliminating QuantizeLinear_69_quantize_scale_node which duplicates (Q) QuantizeLinear_46_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Removing QuantizeLinear_69_quantize_scale_node...
可以看到右面的Q其實是個左面的Q一樣的,畢竟從同一個op出來的scale必須一致,因此這兩個可以去掉一個(下圖去掉了右面的) 。
去掉一個相同scale的quan節點
繼續移動Q-op
這里將Q從maxpool的后面移動到了relu的前面,符合上節已經講過的規則。
[08/25/2021-17:30:06] [V] [TRT] Swapping MaxPool_12 with QuantizeLinear_15_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_607 with QuantizeLinear_610_quantize_scale_node[08/25/2021-17:30:06] [V] [TRT] Swapping Relu_11 with QuantizeLinear_15_quantize_scale_node
移動Q的位置
[08/25/2021-17:30:06] [V] [TRT] QDQ graph optimizer quantization pass - Generate quantized ops
去掉BN
吸BN操作,沒什么好說的。不清楚的可以看之前的一篇:不看必進坑~不論是訓練還是部署都會讓踩坑的Batch Normalization.
[08/25/2021-17:30:06] [V] [TRT] Removing BatchNormalization_10[08/25/2021-17:30:06] [V] [TRT] Removing BatchNormalization_23[08/25/2021-17:30:06] [V] [TRT] Removing BatchNormalization_35[08/25/2021-17:30:06] [V] [TRT] Removing BatchNormalization_54[08/25/2021-17:30:06] [V] [TRT] Removing BatchNormalization_66[08/25/2021-17:30:06] [V] [TRT] Removing BatchNormalization_85[08/25/2021-17:30:06] [V] [TRT] Removing BatchNormalization_97[08/25/2021-17:30:06] [V] [TRT] Removing BatchNormalization_116...
移動Q的位置
[08/25/2021-17:30:07] [V] [TRT] Swapping Add_42 + Relu_43 with QuantizeLinear_46_quantize_scale_node
移動Q到合適位置
同樣將Q移動到Add_42 + Relu_43,使“量化操作盡可能提前”。
繼續融合conv+add+relu
[08/25/2021-17:30:07] [V] [TRT] QuantizeDoubleInputNodes: fusing QuantizeLinear_46_quantize_scale_node into Conv_34[08/25/2021-17:30:07] [V] [TRT] QuantizeDoubleInputNodes: fusing (DequantizeLinear_30_quantize_scale_node and DequantizeLinear_33_quantize_scale_node) into Conv_34
有兩段。
[08/25/2021-17:30:07] [V] [TRT] Removing QuantizeLinear_46_quantize_scale_node[08/25/2021-17:30:07] [V] [TRT] Removing DequantizeLinear_30_quantize_scale_node[08/25/2021-17:30:07] [V] [TRT] Removing DequantizeLinear_33_quantize_scale_node[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer1.0.conv2.weight + QuantizeLinear_32_quantize_scale_node with Conv_34[08/25/2021-17:30:07] [V] [TRT] ConvEltwiseSumFusion: Fusing layer1.0.conv2.weight + QuantizeLinear_32_quantize_scale_node + Conv_34 with Add_42 + Relu_43[08/25/2021-17:30:07] [V] [TRT] Removing DequantizeLinear_41_quantize_scale_node...[08/25/2021-17:30:07] [V] [TRT] QuantizeDoubleInputNodes: fusing QuantizeLinear_27_quantize_scale_node into Conv_22[08/25/2021-17:30:07] [V] [TRT] QuantizeDoubleInputNodes: fusing (DequantizeLinear_18_quantize_scale_node and DequantizeLinear_21_quantize_scale_node) into Conv_22[08/25/2021-17:30:07] [V] [TRT] Removing QuantizeLinear_27_quantize_scale_node...
conv吸收融合
如上圖,紅色圈圈里頭的所有op融入到Conv_34中,藍色的Q被吸入上一個conv中。
收尾
差不多一些與上一個相同的操作。
[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing conv1.weight + QuantizeLinear_7_quantize_scale_node with Conv_9[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer1.0.conv1.weight + QuantizeLinear_20_quantize_scale_node with Conv_22[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer1.1.conv1.weight + QuantizeLinear_51_quantize_scale_node with Conv_53[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer1.2.conv1.weight + QuantizeLinear_82_quantize_scale_node with Conv_84[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer2.0.conv1.weight + QuantizeLinear_113_quantize_scale_node with Conv_115[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer2.0.downsample.0.weight + QuantizeLinear_136_quantize_scale_node with Conv_138[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer2.1.conv1.weight + QuantizeLinear_155_quantize_scale_node with Conv_157[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer2.2.conv1.weight + QuantizeLinear_186_quantize_scale_node with Conv_188[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer2.3.conv1.weight + QuantizeLinear_217_quantize_scale_node with Conv_219[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer3.0.conv1.weight + QuantizeLinear_248_quantize_scale_node with Conv_250[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer3.0.downsample.0.weight + QuantizeLinear_271_quantize_scale_node with Conv_273[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer3.1.conv1.weight + QuantizeLinear_290_quantize_scale_node with Conv_292[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer3.2.conv1.weight + QuantizeLinear_321_quantize_scale_node with Conv_323[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer3.3.conv1.weight + QuantizeLinear_352_quantize_scale_node with Conv_354[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer3.4.conv1.weight + QuantizeLinear_383_quantize_scale_node with Conv_385[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer3.5.conv1.weight + QuantizeLinear_414_quantize_scale_node with Conv_416[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer4.0.conv1.weight + QuantizeLinear_445_quantize_scale_node with Conv_447[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer4.0.downsample.0.weight + QuantizeLinear_468_quantize_scale_node with Conv_470[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer4.1.conv1.weight + QuantizeLinear_487_quantize_scale_node with Conv_489[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing layer4.2.conv1.weight + QuantizeLinear_518_quantize_scale_node with Conv_520[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing deconv_layers.0.weight + QuantizeLinear_549_quantize_scale_node with ConvTranspose_551[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing deconv_layers.1.weight + QuantizeLinear_559_quantize_scale_node with Conv_561[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing deconv_layers.4.weight + QuantizeLinear_571_quantize_scale_node with ConvTranspose_573[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing deconv_layers.5.weight + QuantizeLinear_581_quantize_scale_node with Conv_583[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing deconv_layers.8.weight + QuantizeLinear_593_quantize_scale_node with ConvTranspose_595[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing deconv_layers.9.weight + QuantizeLinear_603_quantize_scale_node with Conv_605[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing hm.0.weight + QuantizeLinear_615_quantize_scale_node with Conv_617 + Relu_618[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing hm.2.weight + QuantizeLinear_626_quantize_scale_node with Conv_628[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing wh.0.weight + QuantizeLinear_636_quantize_scale_node with Conv_638 + Relu_639[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing wh.2.weight + QuantizeLinear_647_quantize_scale_node with Conv_649[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing reg.0.weight + QuantizeLinear_657_quantize_scale_node with Conv_659 + Relu_660[08/25/2021-17:30:07] [V] [TRT] ConstWeightsFusion: Fusing reg.2.weight + QuantizeLinear_668_quantize_scale_node with Conv_670[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing conv1.weight + QuantizeLinear_7_quantize_scale_node + Conv_9 with Relu_11[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing layer1.0.conv1.weight + QuantizeLinear_20_quantize_scale_node + Conv_22 with Relu_24[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing layer1.1.conv1.weight + QuantizeLinear_51_quantize_scale_node + Conv_53 with Relu_55[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing layer1.2.conv1.weight + QuantizeLinear_82_quantize_scale_node + Conv_84 with Relu_86[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing layer2.0.conv1.weight + QuantizeLinear_113_quantize_scale_node + Conv_115 with Relu_117[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing layer2.1.conv1.weight + QuantizeLinear_155_quantize_scale_node + Conv_157 with Relu_159[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing layer2.2.conv1.weight + QuantizeLinear_186_quantize_scale_node + Conv_188 with Relu_190[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing layer2.3.conv1.weight + QuantizeLinear_217_quantize_scale_node + Conv_219 with Relu_221[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing layer3.0.conv1.weight + QuantizeLinear_248_quantize_scale_node + Conv_250 with Relu_252[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing layer3.1.conv1.weight + QuantizeLinear_290_quantize_scale_node + Conv_292 with Relu_294[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing layer3.2.conv1.weight + QuantizeLinear_321_quantize_scale_node + Conv_323 with Relu_325[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing layer3.3.conv1.weight + QuantizeLinear_352_quantize_scale_node + Conv_354 with Relu_356[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing layer3.4.conv1.weight + QuantizeLinear_383_quantize_scale_node + Conv_385 with Relu_387[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing layer3.5.conv1.weight + QuantizeLinear_414_quantize_scale_node + Conv_416 with Relu_418[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing layer4.0.conv1.weight + QuantizeLinear_445_quantize_scale_node + Conv_447 with Relu_449[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing layer4.1.conv1.weight + QuantizeLinear_487_quantize_scale_node + Conv_489 with Relu_491[08/25/2021-17:30:07] [V] [TRT] ConvReluFusion: Fusing layer4.2.conv1.weight + QuantizeLinear_518_quantize_scale_node + Conv_520 with Relu_522[08/25/2021-17:30:08] [V] [TRT] ConvReluFusion: Fusing deconv_layers.1.weight + QuantizeLinear_559_quantize_scale_node + Conv_561 with Relu_563[08/25/2021-17:30:08] [V] [TRT] ConvReluFusion: Fusing deconv_layers.5.weight + QuantizeLinear_581_quantize_scale_node + Conv_583 with Relu_585[08/25/2021-17:30:08] [V] [TRT] ConvReluFusion: Fusing deconv_layers.9.weight + QuantizeLinear_603_quantize_scale_node + Conv_605 with Relu_607還是融合Q或者QD到附近的conv中。
還是融合Q或者QD到附近的conv中。
最終模型結構
最終模型結構如下,這些信息來自trt的verbose信息,關鍵詞是Engine Layer Information。當然也可以使用graphvis將這些模型畫出來:
神器!終於把TensorRT的engine模型的結構圖畫出來了!
[08/25/2021-17:30:37] [V] [TRT] Engine Layer Information:
Layer(Scale): QuantizeLinear_2_quantize_scale_node, Tactic: 0, input[Float(1,3,-17,-18)] -> 255[Int8(1,3,-17,-18)]
Layer(CaskConvolution): conv1.weight + QuantizeLinear_7_quantize_scale_node + Conv_9 + Relu_11, Tactic: 4438325421691896755, 255[Int8(1,3,-17,-18)] -> 267[Int8(1,64,-40,-44)]
Layer(CudaPooling): MaxPool_12, Tactic: -3, 267[Int8(1,64,-40,-44)] -> Reformatted Output Tensor 0 to MaxPool_12[Int8(1,64,-21,-24)]
Layer(Reformat): Reformatting CopyNode for Output Tensor 0 to MaxPool_12, Tactic: 0, Reformatted Output Tensor 0 to MaxPool_12[Int8(1,64,-21,-24)] -> 270[Int8(1,64,-21,-24)]
Layer(CaskConvolution): layer1.0.conv1.weight + QuantizeLinear_20_quantize_scale_node + Conv_22 + Relu_24, Tactic: 4871133328510103657, 270[Int8(1,64,-21,-24)] -> 284[Int8(1,64,-21,-24)]
Layer(CaskConvolution): layer1.0.conv2.weight + QuantizeLinear_32_quantize_scale_node + Conv_34 + Add_42 + Relu_43, Tactic: 4871133328510103657, 284[Int8(1,64,-21,-24)], 270[Int8(1,64,-21,-24)] -> 305[Int8(1,64,-21,-24)]
Layer(CaskConvolution): layer1.1.conv1.weight + QuantizeLinear_51_quantize_scale_node + Conv_53 + Relu_55, Tactic: 4871133328510103657, 305[Int8(1,64,-21,-24)] -> 319[Int8(1,64,-21,-24)]
Layer(CaskConvolution): layer1.1.conv2.weight + QuantizeLinear_63_quantize_scale_node + Conv_65 + Add_73 + Relu_74, Tactic: 4871133328510103657, 319[Int8(1,64,-21,-24)], 305[Int8(1,64,-21,-24)] -> 340[Int8(1,64,-21,-24)]
Layer(CaskConvolution): layer1.2.conv1.weight + QuantizeLinear_82_quantize_scale_node + Conv_84 + Relu_86, Tactic: 4871133328510103657, 340[Int8(1,64,-21,-24)] -> 354[Int8(1,64,-21,-24)]
Layer(CaskConvolution): layer1.2.conv2.weight + QuantizeLinear_94_quantize_scale_node + Conv_96 + Add_104 + Relu_105, Tactic: 4871133328510103657, 354[Int8(1,64,-21,-24)], 340[Int8(1,64,-21,-24)] -> 375[Int8(1,64,-21,-24)]
Layer(CaskConvolution): layer2.0.conv1.weight + QuantizeLinear_113_quantize_scale_node + Conv_115 + Relu_117, Tactic: -1841683966837205309, 375[Int8(1,64,-21,-24)] -> 389[Int8(1,128,-52,-37)]
Layer(CaskConvolution): layer2.0.downsample.0.weight + QuantizeLinear_136_quantize_scale_node + Conv_138, Tactic: -1494157908358500249, 375[Int8(1,64,-21,-24)] -> 415[Int8(1,128,-52,-37)]
Layer(CaskConvolution): layer2.0.conv2.weight + QuantizeLinear_125_quantize_scale_node + Conv_127 + Add_146 + Relu_147, Tactic: -1841683966837205309, 389[Int8(1,128,-52,-37)], 415[Int8(1,128,-52,-37)] -> 423[Int8(1,128,-52,-37)]
Layer(CaskConvolution): layer2.1.conv1.weight + QuantizeLinear_155_quantize_scale_node + Conv_157 + Relu_159, Tactic: -1841683966837205309, 423[Int8(1,128,-52,-37)] -> 437[Int8(1,128,-52,-37)]
Layer(CaskConvolution): layer2.1.conv2.weight + QuantizeLinear_167_quantize_scale_node + Conv_169 + Add_177 + Relu_178, Tactic: -1841683966837205309, 437[Int8(1,128,-52,-37)], 423[Int8(1,128,-52,-37)] -> 458[Int8(1,128,-52,-37)]
Layer(CaskConvolution): layer2.2.conv1.weight + QuantizeLinear_186_quantize_scale_node + Conv_188 + Relu_190, Tactic: -1841683966837205309, 458[Int8(1,128,-52,-37)] -> 472[Int8(1,128,-52,-37)]
Layer(CaskConvolution): layer2.2.conv2.weight + QuantizeLinear_198_quantize_scale_node + Conv_200 + Add_208 + Relu_209, Tactic: -1841683966837205309, 472[Int8(1,128,-52,-37)], 458[Int8(1,128,-52,-37)] -> 493[Int8(1,128,-52,-37)]
Layer(CaskConvolution): layer2.3.conv1.weight + QuantizeLinear_217_quantize_scale_node + Conv_219 + Relu_221, Tactic: -1841683966837205309, 493[Int8(1,128,-52,-37)] -> 507[Int8(1,128,-52,-37)]
Layer(CaskConvolution): layer2.3.conv2.weight + QuantizeLinear_229_quantize_scale_node + Conv_231 + Add_239 + Relu_240, Tactic: -1841683966837205309, 507[Int8(1,128,-52,-37)], 493[Int8(1,128,-52,-37)] -> 528[Int8(1,128,-52,-37)]
Layer(CaskConvolution): layer3.0.conv1.weight + QuantizeLinear_248_quantize_scale_node + Conv_250 + Relu_252, Tactic: -8431788508843860955, 528[Int8(1,128,-52,-37)] -> 542[Int8(1,256,-59,-62)]
Layer(CaskConvolution): layer3.0.downsample.0.weight + QuantizeLinear_271_quantize_scale_node + Conv_273, Tactic: -5697614955743334137, 528[Int8(1,128,-52,-37)] -> 568[Int8(1,256,-59,-62)]
Layer(CaskConvolution): layer3.0.conv2.weight + QuantizeLinear_260_quantize_scale_node + Conv_262 + Add_281 + Relu_282, Tactic: -496455309852654971, 542[Int8(1,256,-59,-62)], 568[Int8(1,256,-59,-62)] -> 576[Int8(1,256,-59,-62)]
Layer(CaskConvolution): layer3.1.conv1.weight + QuantizeLinear_290_quantize_scale_node + Conv_292 + Relu_294, Tactic: -8431788508843860955, 576[Int8(1,256,-59,-62)] -> 590[Int8(1,256,-59,-62)]
Layer(CaskConvolution): layer3.1.conv2.weight + QuantizeLinear_302_quantize_scale_node + Conv_304 + Add_312 + Relu_313, Tactic: -496455309852654971, 590[Int8(1,256,-59,-62)], 576[Int8(1,256,-59,-62)] -> 611[Int8(1,256,-59,-62)]
Layer(CaskConvolution): layer3.2.conv1.weight + QuantizeLinear_321_quantize_scale_node + Conv_323 + Relu_325, Tactic: -8431788508843860955, 611[Int8(1,256,-59,-62)] -> 625[Int8(1,256,-59,-62)]
Layer(CaskConvolution): layer3.2.conv2.weight + QuantizeLinear_333_quantize_scale_node + Conv_335 + Add_343 + Relu_344, Tactic: -496455309852654971, 625[Int8(1,256,-59,-62)], 611[Int8(1,256,-59,-62)] -> 646[Int8(1,256,-59,-62)]
Layer(CaskConvolution): layer3.3.conv1.weight + QuantizeLinear_352_quantize_scale_node + Conv_354 + Relu_356, Tactic: -8431788508843860955, 646[Int8(1,256,-59,-62)] -> 660[Int8(1,256,-59,-62)]
Layer(CaskConvolution): layer3.3.conv2.weight + QuantizeLinear_364_quantize_scale_node + Conv_366 + Add_374 + Relu_375, Tactic: -496455309852654971, 660[Int8(1,256,-59,-62)], 646[Int8(1,256,-59,-62)] -> 681[Int8(1,256,-59,-62)]
Layer(CaskConvolution): layer3.4.conv1.weight + QuantizeLinear_383_quantize_scale_node + Conv_385 + Relu_387, Tactic: -8431788508843860955, 681[Int8(1,256,-59,-62)] -> 695[Int8(1,256,-59,-62)]
Layer(CaskConvolution): layer3.4.conv2.weight + QuantizeLinear_395_quantize_scale_node + Conv_397 + Add_405 + Relu_406, Tactic: -496455309852654971, 695[Int8(1,256,-59,-62)], 681[Int8(1,256,-59,-62)] -> 716[Int8(1,256,-59,-62)]
Layer(CaskConvolution): layer3.5.conv1.weight + QuantizeLinear_414_quantize_scale_node + Conv_416 + Relu_418, Tactic: -8431788508843860955, 716[Int8(1,256,-59,-62)] -> 730[Int8(1,256,-59,-62)]
Layer(CaskConvolution): layer3.5.conv2.weight + QuantizeLinear_426_quantize_scale_node + Conv_428 + Add_436 + Relu_437, Tactic: -496455309852654971, 730[Int8(1,256,-59,-62)], 716[Int8(1,256,-59,-62)] -> 751[Int8(1,256,-59,-62)]
Layer(CaskConvolution): layer4.0.conv1.weight + QuantizeLinear_445_quantize_scale_node + Conv_447 + Relu_449, Tactic: -6371781333659293809, 751[Int8(1,256,-59,-62)] -> 765[Int8(1,512,-71,-72)]
Layer(CaskConvolution): layer4.0.downsample.0.weight + QuantizeLinear_468_quantize_scale_node + Conv_470, Tactic: -1494157908358500249, 751[Int8(1,256,-59,-62)] -> 791[Int8(1,512,-71,-72)]
Layer(CaskConvolution): layer4.0.conv2.weight + QuantizeLinear_457_quantize_scale_node + Conv_459 + Add_478 + Relu_479, Tactic: -2328318099174473157, 765[Int8(1,512,-71,-72)], 791[Int8(1,512,-71,-72)] -> 799[Int8(1,512,-71,-72)]
Layer(CaskConvolution): layer4.1.conv1.weight + QuantizeLinear_487_quantize_scale_node + Conv_489 + Relu_491, Tactic: -2328318099174473157, 799[Int8(1,512,-71,-72)] -> 813[Int8(1,512,-71,-72)]
Layer(CaskConvolution): layer4.1.conv2.weight + QuantizeLinear_499_quantize_scale_node + Conv_501 + Add_509 + Relu_510, Tactic: -2328318099174473157, 813[Int8(1,512,-71,-72)], 799[Int8(1,512,-71,-72)] -> 834[Int8(1,512,-71,-72)]
Layer(CaskConvolution): layer4.2.conv1.weight + QuantizeLinear_518_quantize_scale_node + Conv_520 + Relu_522, Tactic: -2328318099174473157, 834[Int8(1,512,-71,-72)] -> 848[Int8(1,512,-71,-72)]
Layer(CaskConvolution): layer4.2.conv2.weight + QuantizeLinear_530_quantize_scale_node + Conv_532 + Add_540 + Relu_541, Tactic: -2328318099174473157, 848[Int8(1,512,-71,-72)], 834[Int8(1,512,-71,-72)] -> 869[Int8(1,512,-71,-72)]
Layer(CaskDeconvolution): deconv_layers.0.weight + QuantizeLinear_549_quantize_scale_node + ConvTranspose_551, Tactic: -3784829056659735491, 869[Int8(1,512,-71,-72)] -> 881[Int8(1,512,-46,-47)]
Layer(CaskConvolution): deconv_layers.1.weight + QuantizeLinear_559_quantize_scale_node + Conv_561 + Relu_563, Tactic: -496455309852654971, 881[Int8(1,512,-46,-47)] -> 895[Int8(1,256,-46,-47)]
Layer(CaskDeconvolution): deconv_layers.4.weight + QuantizeLinear_571_quantize_scale_node + ConvTranspose_573, Tactic: -3784829056659735491, 895[Int8(1,256,-46,-47)] -> 907[Int8(1,256,-68,-55)]
Layer(CaskConvolution): deconv_layers.5.weight + QuantizeLinear_581_quantize_scale_node + Conv_583 + Relu_585, Tactic: -8431788508843860955, 907[Int8(1,256,-68,-55)] -> 921[Int8(1,256,-68,-55)]
Layer(CaskDeconvolution): deconv_layers.8.weight + QuantizeLinear_593_quantize_scale_node + ConvTranspose_595, Tactic: -2621193268472024213, 921[Int8(1,256,-68,-55)] -> 933[Int8(1,256,-29,-32)]
Layer(CaskConvolution): deconv_layers.9.weight + QuantizeLinear_603_quantize_scale_node + Conv_605 + Relu_607, Tactic: -8431788508843860955, 933[Int8(1,256,-29,-32)] -> 947[Int8(1,256,-29,-32)]
Layer(CaskConvolution): hm.0.weight + QuantizeLinear_615_quantize_scale_node + Conv_617 + Relu_618, Tactic: 4871133328510103657, 947[Int8(1,256,-29,-32)] -> 960[Int8(1,64,-29,-32)]
Layer(CaskConvolution): wh.0.weight + QuantizeLinear_636_quantize_scale_node + Conv_638 + Relu_639, Tactic: 4871133328510103657, 947[Int8(1,256,-29,-32)] -> 985[Int8(1,64,-29,-32)]
Layer(CaskConvolution): reg.0.weight + QuantizeLinear_657_quantize_scale_node + Conv_659 + Relu_660, Tactic: 4871133328510103657, 947[Int8(1,256,-29,-32)] -> 1010[Int8(1,64,-29,-32)]
Layer(CaskConvolution): hm.2.weight + QuantizeLinear_626_quantize_scale_node + Conv_628, Tactic: -7185527339793611699, 960[Int8(1,64,-29,-32)] -> Reformatted Output Tensor 0 to hm.2.weight + QuantizeLinear_626_quantize_scale_node + Conv_628[Float(1,2,-29,-32)]
Layer(Reformat): Reformatting CopyNode for Output Tensor 0 to hm.2.weight + QuantizeLinear_626_quantize_scale_node + Conv_628, Tactic: 0, Reformatted Output Tensor 0 to hm.2.weight + QuantizeLinear_626_quantize_scale_node + Conv_628[Float(1,2,-29,-32)] -> hm[Float(1,2,-29,-32)]
Layer(CaskConvolution): wh.2.weight + QuantizeLinear_647_quantize_scale_node + Conv_649, Tactic: -7185527339793611699, 985[Int8(1,64,-29,-32)] -> Reformatted Output Tensor 0 to wh.2.weight + QuantizeLinear_647_quantize_scale_node + Conv_649[Float(1,2,-29,-32)]
Layer(Reformat): Reformatting CopyNode for Output Tensor 0 to wh.2.weight + QuantizeLinear_647_quantize_scale_node + Conv_649, Tactic: 0, Reformatted Output Tensor 0 to wh.2.weight + QuantizeLinear_647_quantize_scale_node + Conv_649[Float(1,2,-29,-32)] -> wh[Float(1,2,-29,-32)]
Layer(CaskConvolution): reg.2.weight + QuantizeLinear_668_quantize_scale_node + Conv_670, Tactic: -7185527339793611699, 1010[Int8(1,64,-29,-32)] -> Reformatted Output Tensor 0 to reg.2.weight + QuantizeLinear_668_quantize_scale_node + Conv_670[Float(1,2,-29,-32)]
Layer(Reformat): Reformatting CopyNode for Output Tensor 0 to reg.2.weight + QuantizeLinear_668_quantize_scale_node + Conv_670, Tactic: 0, Reformatted Output Tensor 0 to reg.2.weight + QuantizeLinear_668_quantize_scale_node + Conv_670[Float(1,2,-29,-32)] -> reg[Float(1,2,-29,-32)]
[08/25/2021-17:30:37] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 1396 MiB, GPU 726 MiB
TensorRT一般量化流程
簡單總結一下大家拿到模型想要在TensorRT量化部署的一般步驟吧:
1.大部分模型來說,PTQ工具就夠用了,准備好校准數據集,直接使用trt提供的接口進行PTQ量化(少量代碼)或者使用python-API接口進行PTQ量化
2.如果trt提供的PTQ集中量化方法對的模型效果不好,可以考慮使用自己的量化方式導出帶有量化信息的模型讓trt去加載(需要寫一些代碼,可以通過訓練框架比如pytorch導出已經量化好的模型讓trt加載),帶有量化信息的模型就是上文提到的QDQ的ONNX模型
量化模型轉換過程中的一些問題
簡單記錄了一下TensorRT量化過程中的一些問題,其實大部分問題大家可以在官方issue中搜到,關鍵詞int8或者quan。這里僅是記錄了一些遇到的。
- 如果使用TensorRT提供的Pytorch量化庫,需要修改resnet50的網絡結構代碼,參考 https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/tutorials/quant_resnet50.html:
def __init__(self,
inplanes: int,
planes: int,
stride: int = 1,
downsample: Optional[nn.Module] = None,
groups: int = 1,
base_width: int = 64,
dilation: int = 1,
norm_layer: Optional[Callable[..., nn.Module]] = None,
quantize: bool = False) -> None:
# other code...
self._quantize = quantize
if self._quantize:
self.residual_quantizer = quant_nn.TensorQuantizer(quant_nn.QuantConv2d.default_quant_desc_input)
def forward(self, x: Tensor) -> Tensor:
# other code...
if self._quantize:
out += self.residual_quantizer(identity)
else:
out += identity
out = self.relu(out)
return out
- QDQ結構中如果RELU后面有QDQ則會報錯(升級到TensorRT-8.2可以解決這個問題)QDQ結構中如果RELU后面有QDQ則會報錯
[TensorRT] ERROR: 2: [graphOptimizer.cpp::sameExprValues::587] Error Code 2: Internal Error (Assertion lhs.expr failed.)
Traceback (most recent call last):
File "yolov3_trt.py", line 678, in <module>
test()
File "yolov3_trt.py", line 660, in test
create_engine(engine_file, 'int8', qat=True)
File "yolov3_trt.py", line 601, in create_engine
ctx.build_engine(engine_file)
- 關於Deconvolution,量化INT8中反卷積的權重OIHW中I和O的通道必須大於1
[optimizer.cpp::computeCosts::1981] Error Code 10: Internal Error (Could not find any implementation for node quantize_per_channel_110_input + [QUANTIZE]-[acc_ops.quantize_per_channel]-[(Unnamed Layer* 647) [Constant]_output_per_channel_quant] + [DECONVOLUTION]-[acc_ops.conv_transpose2d]-[conv_transpose2d_9].)
這種反卷積結構量化會報錯
- 還有個問題,deconv輸入通道和輸出通道如果不一致在TensorRT8.2EA之前會報錯:
- 然后如果ConvTranspose的輸入channel和輸出channel如果有某種關系,也會報錯:
issue查了下,遇到相同問題的人還挺多: - https://github.com/NVIDIA/TensorRT/issues/1556
- https://github.com/NVIDIA/TensorRT/issues/1519
目前來看還是無法解決的:
thanks for update, we will check, and the c%4 will not work for ConvTranspose , it is for depthConv.
- 部分量化結果會錯誤解析 tactic : ampere_scudnn_128x64_relu_interior_nn_v1
后記
這篇文章整理了好些天,總算是搞完了,其實去年10月份的時候已經打好了草稿,但是一拖再拖就到現在了hh。
除了TensorRT,也用過一些其他的框架,不管是PPL還是TVM,發現INT8的性能在模型上還是不如TensorRT,或者一些case上沒有TensorRT支持全。但TensorRT比較麻煩的是INT8的plugin不好debug,坑比較多。
最近一段時間在使用TVM做一些INT8的優化,准備把torch.fx的已經PTQ后的模型搞到TVM上進行量化加速,之后也會寫一些相關的文章。
近期也在遷移自己的筆記(或者說草稿吧)到github.io上,用MKDocs做成了網頁,其中的文章會隨時更新,放個鏈接:
- https://ai.oldpan.me/
和博客不同,這里分類更加規整一些,重點是依舊AI部署加速優化這塊,現在可能比較亂,因為在隨時更新,近期也會找個時間整理一下。一些新的文章會先發在這里,大家閑來無事可以翻翻看。
量化這塊的文章會繼續寫,鴿了這么久了,之后的發文頻率也會上來。感謝大家的支持~
參考鏈接
- https://zhuanlan.zhihu.com/p/451105341
- https://github.com/NVIDIA/TensorRT/issues/1552
- https://github.com/NVIDIA/TensorRT/issues/1165
- https://mp.weixin.qq.com/s/-Vkvlxth-JzJApxWeAJz-A
參考資料
[1]量化框架: https://zhuanlan.zhihu.com/p/355598250
[2]工具: https://github.com/NVIDIA/TensorRT/tree/main/tools/pytorch-quantization