TVM量化路線圖roadmap

本文轉載自查看原文 2021-07-19 06:17 212

TVM量化路線圖roadmap

INT8量化方案

本文介紹了量化過程的原理概述，提出了在TVM中實現量化過程的建議。

l 介紹量子化的背景知識

l INT8量化-后端代碼生成

l 這個線程只

量子開發

基於搜索的自動量化

提出了一種新的量化框架，將硬件和訓練方法結合起來。

借鑒已有的一些量化框架的思想，選擇采用注釋annotation，校准calibration，實現熱啊；realization三階段設計。

l Annotation注釋：

注釋過程pass根據每個算子的重寫函數，重寫圖形並插入模擬量化操作。

模擬量化操作，模擬從浮點量化到整數的舍入誤差和飽和誤差，

l Calibration校准：

校准過程pass，將調整模擬量化操作的閾值，以減少精度下降。

l Realization實現：

實現過程pass，將實際用float32計算的仿真圖，轉化為一個真正的低精度整數圖。

TVM支持的量化框架

TF量化相關

TVM支持所有預量化TFLite托管

l 在Intel VNNI支持的C5.12xlarge Cascade lake機器上，對性能進行了評估

l 尚未自動調化整模型

PYTORCH量子化相關

如何通過relay將模型轉換為量化模型？

如何為torch.quantization.get\u default\u qconfig（'fbgemm'）設置qconfig

量化模型精度基准：PyTorch vs TVM

如何將量化pytorch模型轉換為tvm模型

比較resent18、resent5、mobilenet-v2、mobilenet-v3、inception\u v3和googlenet的准確度和速度。

在PYTORCH中包含靜態量化和eager模式：PYTORCH的量化turorial。

l gap量化

l PyTorch的GAP8導出和PyTorch量化module

l 包括squeezenet-v1.1的量化文件

MXNET RELATED

產品級神經網絡推理模型量化

l 以下CPU性能來自AWS EC2 C5.24xlarge實例，該實例具有定制的第二代Intel Xeon Scalable Processors (Cascade Lake)。

l 模型量化提供了比所有模型更穩定的加速比，例如ResNet 50 v1為3.66倍，ResNet 101 v1為3.82倍，SSD-VG16為3.77倍，這非常接近INT8的理論4倍加速比。

l Apache/MXNet量化solution精度，非常接近FP32模型，不需要保留模式。在圖8中，MXNet只確保了精度的小幅度降低，小於0.5%。

TENSOR CORE RELATED張量內核相關

[RFC][Tensor Core] Optimization of CNNs on Tensor Core基於Tensor Core的CNNs優化
[Perf] Enhance cudnn and cublas backend and enable TensorCore增強cudnn和cublas后端並啟用TensorCore

RELATED COMMIT相關提交

[OPT] Low-bit Quantization #2116低bit位量化

Benchmarking Quantization on Intel CPU英特爾CPU上的基准量化

[RFC][Quantization] Support quantized models from TensorflowLite#2351支持TensorflowLite的量化模型

After initial investigation and effort, in the Mobilenet V1 model, INT8 can get speed up about 30% when compared with FP32 on ARM CPU. 經過初步調查和努力，在Mobilenet V1模型中，INT8與ARM CPU上的FP32相比，可以獲得大約30%的速度。

[TFLite] Support TFLite FP32 Relay frontend. #2365支持TFLite FP32 relay前端

This is the first PR of #2351 to support importing exist quantized int8 TFLite model. The base version of Tensorflow / TFLite is 1.12. 這是#2351的第一個支持導入exist量化int8 TFLite模型的PR。Tensorflow/TFLite的基礎版本是1.12

[Strategy] Support for Int8 schedules - CUDA/x86 #5031

Recently introduce op strategy currently has some issues with task extraction with AutoTVM. This PR fixes them for x86/CUDA. 最近引入的op策略目前在使用AutoTVM extraction任務方面存在一些問題。這個PR為x86/CUDA修復了。

[Torch, QNN] Add support for quantized models via QNN #4977增加對量化模型的支持

[QNN][Legalize] Specialize for Platforms w/o fast Int8 support #4307
- QNN - Conv2D/Dense Legalize for platforms with no fast Int8 units
The inference time is longer after int8 quantization
- TVM-relay.quantize vs quantization of other Framework
- TVM FP32、TVM int8、TVM int8 quantization + AutoTVM，MXNet

SPEED UP

COMPARISON

AUTOMATIC INTEGER QUANTIZATION

Quantization int8 slower than int16 on skylake CPU

The int8 is always slower than int16 before and after the auto-tuning
Target: llvm -mcpu=skylake-avx512
Problem is solved by creating the int8 task explicitly
- create the task topi_x86_conv2d_NCHWc_int8
- set output dtype to int32, input dtype=uint8, weight dtype=int8

TVM學習筆記–模型量化(int8)及其測試數據

TVM FP32、TVM int8、TVM int8 quantization , MXNet, TF1.13
含測試代碼

8bit@Cuda: AutoTVMvs TensorRT vs MXNet

In this post, we show how to use TVM to automatically optimize of quantized deep learning models on CUDA.

ACCEPTING PRE-QUANTIZED INTEGER MODELS

Is there any speed comparison of quantization on cpu
- discuss a lot about speed comparison among torch-fp32, torch-int8, tvm-fp32, tvm-int16, tvm-int8

SPEED PROFILE TOOLS

How to profile speed in each layer with RPC?
- the debug runtime will give you some profiling information from the embedded device, e.g.:

        Node Name               Ops                                                                  Time(us)   Time(%)  Start Time       End Time         Shape                Inputs  Outputs

---------               ---                                                                  --------   -------  ----------       --------         -----                ------  -------

1_NCHW1c                fuse___layout_transform___4                                          56.52      0.02     15:24:44.177475  15:24:44.177534  (1, 1, 224, 224)     1       1

_contrib_conv2d_nchwc0  fuse__contrib_conv2d_NCHWc                                           12436.11   3.4      15:24:44.177549  15:24:44.189993  (1, 1, 224, 224, 1)  2       1

relu0_NCHW8c            fuse___layout_transform___broadcast_add_relu___layout_transform__    4375.43    1.2      15:24:44.190027  15:24:44.194410  (8, 1, 5, 5, 1, 8)   2       1

_contrib_conv2d_nchwc1  fuse__contrib_conv2d_NCHWc_1                                         213108.6   58.28    15:24:44.194440  15:24:44.407558  (1, 8, 224, 224, 8)  2       1

relu1_NCHW8c            fuse___layout_transform___broadcast_add_relu___layout_transform__    2265.57    0.62     15:24:44.407600  15:24:44.409874  (64, 1, 1)           2       1

_contrib_conv2d_nchwc2  fuse__contrib_conv2d_NCHWc_2                                         104623.15  28.61    15:24:44.409905  15:24:44.514535  (1, 8, 224, 224, 8)  2       1

relu2_NCHW2c            fuse___layout_transform___broadcast_add_relu___layout_transform___1  2004.77    0.55     15:24:44.514567  15:24:44.516582  (8, 8, 3, 3, 8, 8)   2       1

_contrib_conv2d_nchwc3  fuse__contrib_conv2d_NCHWc_3                                         25218.4    6.9      15:24:44.516628  15:24:44.541856  (1, 8, 224, 224, 8)  2       1

reshape1                fuse___layout_transform___broadcast_add_reshape_transpose_reshape    1554.25    0.43     15:24:44.541893  15:24:44.543452  (64, 1, 1)           2       1

參考鏈接：

https://www.freesion.com/article/3155559638/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 RoadMap：如何創建產品路線圖前端開發技術路線圖(Roadmap) PYTHON路線圖 java路線圖 Python路線圖我的JavaEE學習路線圖 Java技術路線圖 Hadoop學習路線圖 DevOps技術路線圖 Java進階路線圖