tensorflow模型量化實例

本文轉載自查看原文 2019-12-18 12:09 6710 模型壓縮

1，概述

　　模型量化應該是現在最容易實現的模型壓縮技術，而且也基本上是在移動端部署的模型的畢竟之路。模型量化基本可以分為兩種：post training quantizated和quantization aware training。在pyrotch和tensroflow中都提供了相應的實現接口。

　　對於量化用現在常見的min-max方式可以用公式概括為：

　　　　$r = S (q - Z)$

　　上面式子中q為量化后的值，r為原始浮點值，S為浮點類型的縮放系數，Z為和q相同類型的表示r中0點的值。根據：

　　　　$\frac{q - q_{min}}{q_{max} - q_{min}} = \frac{r - r_{min}}{r_{max} - r_{min}}$

　　可以推斷得到S和Z的值：

　　　　$S = \frac{r_{max} - r_{min}}{q_{max} - q_{min}}$

　　　　$Z = q_{min} - \frac{r_{min}}{S}$

2，實驗部分

　　基於tensorflow在LeNet上實驗了這兩種量化方式，代碼見GitHub：https://github.com/jiangxinyang227/model_quantization。

　　post training quantizated

　　在tensorflow中實現起來特別簡單，訓練后的模型可是選擇用savedModel保存的模型作為輸入進行量化並轉換成tflite，我們將這個版本稱為v1版本。

import tensorflow as tf

saved_model_dir = "./pb_model"

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir,
                                                     input_arrays=["inputs"],
                                                     input_shapes={"inputs": [1, 784]},
                                                     output_arrays=["predictions"])
converter.optimizations = ["DEFAULT"]
tflite_model = converter.convert()
open("tflite_model_v3/eval_graph.tflite", "wb").write(tflite_model)

　　但在實際過程中這份代碼轉換后的tflite模型大小並沒有縮小到1/4。所以非常奇怪，目前還不確定原因。在這基礎上我們引入了一行代碼，將這個版本稱為v2：

import tensorflow as tf

saved_model_dir = "./pb_model"

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir,
                                                     input_arrays=["inputs"],
                                                     input_shapes={"inputs": [1, 784]},
                                                     output_arrays=["predictions"])
converter.optimizations = ["DEFAULT"]  # 保存為v1,v2版本時使用
converter.post_training_quantize = True  # 保存為v2版本時使用
tflite_model = converter.convert()
open("tflite_model_v3/eval_graph.tflite", "wb").write(tflite_model)

　　這樣模型的大小縮小到了1/4。

　　之后再單獨轉為tflite的模型，這個稱為v3：

import tensorflow as tf

saved_model_dir = "./pb_model"

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir,
                                                     input_arrays=["inputs"],
                                                     input_shapes={"inputs": [1, 784]},
                                                     output_arrays=["predictions"])
tflite_model = converter.convert()
open("tflite_model_v3/eval_graph.tflite", "wb").write(tflite_model)

　　很顯然，直接轉為tflite，模型大小肯定不會壓縮的，我們再來看看推斷速度，推斷代碼再GitHub上，具體結果如下：

　　上面checkpoint是在cpu上直接加載checkpoint進行預測。在這里看到只有v2版本的模型壓縮到了原來的1/4，但是推斷速度卻不如v1和v3版本，且tflite模型的推斷速度明顯優於checkpoint。我猜原因可能是：

　　　　1，tflite本身的解釋器對tflite模型是有加速的。

　　　　2，至於為什么量化后的模型反而效果不好，是因為post training quantized本質上計算時是將int轉換成float計算的，因此中間存在量化和反量化的操作占絕了些時間。

　　quantization aware training

　　在訓練中引入量化的操作要復雜很多，首先在訓練時在損失計算后面，優化器定義前面要要引入tf.contrib.quantize.create_training_graph()。如下：

self.loss = slim.losses.softmax_cross_entropy(self.train_digits, self.input_labels)

# 獲取當前的計算圖，用於后續的量化
self.g = tf.get_default_graph()

if self.is_train:
    # 在損失函數之后，優化器定義之前，在這里會自動選擇計算圖中的一些operation和activation做偽量化
    tf.contrib.quantize.create_training_graph(self.g, 80000)
    self.lr = cfg.LEARNING_RATE
    self.train_op = tf.train.AdamOptimizer(self.lr).minimize(self.loss)

　　訓練完之后模型會保存為checkpoint文件，該文件中含有偽量化信息。這個里面的變量還是float類型，我們需要將其轉換成只含int類型的模型文件，具體做法如下：

　　1，保存為freeze pb文件，並使用tf.contrib.quantize.create_eval_graph()來轉換成推斷模式

with tf.Session() as sess:
    le_net = Lenet(False)
    saver = tf.train.Saver()  # 不可以導入train graph，需要重新創建一個graph，然后將train graph圖中的參數來填充該圖
    saver.restore(sess, cfg.PARAMETER_FILE)

    frozen_graph_def = graph_util.convert_variables_to_constants(
        sess, sess.graph_def, ['predictions'])
    tf.io.write_graph(
        frozen_graph_def,
        "pb_model",
        "freeze_eval_graph.pb",
        as_text=False)

　　注意上面的注釋，在這里的saver一定不能用類似tf.train.import_meta_graph的方式導入訓練時的計算圖，而是通過再次調用Lenet類初始一個計算圖，然后將訓練圖中的參數變量賦給該計算圖。

　　2，轉換成tflite文件

import tensorflow as tf

path_to_frozen_graphdef_pb = 'pb_model/freeze_eval_graph.pb'
converter = tf.contrib.lite.TFLiteConverter.from_frozen_graph(path_to_frozen_graphdef_pb,
                                                              ["inputs"],
                                                              ["predictions"])

converter.inference_type = tf.contrib.lite.constants.QUANTIZED_UINT8
converter.quantized_input_stats = {"inputs": (0., 1.)}
converter.allow_custom_ops = True
converter.default_ranges_stats = (0, 255)
converter.post_training_quantize = True
tflite_model = converter.convert()
open("tflite_model/eval_graph.tflite", "wb").write(tflite_model)

　　注意幾點：

　　1)，["inputs"], ["predictions"]是freeze pb中的輸入節點和輸出節點

　　2)，quantized_input_states是定義輸入的均值和方差，tensorflow lite的文檔中說這個mean和var的計算方式是：mean 是 0 到 255 之間的整數值，映射到浮點數 0.0f。std_dev = 255 /（float_max - float_min）但我發現再這里采用0. 和 1.的效果也是不錯的。

　　3)，default_ranges_states是指量化后的值的范圍，其中255就是2^8 - 1。

　　3，使用tflite預測

import time
import tensorflow as tf
import numpy as np
import tensorflow.examples.tutorials.mnist.input_data as input_data


mnist = input_data.read_data_sets('MNIST_data/', one_hot=True)
labels = [label.index(1) for label in mnist.test.labels.tolist()]
images = mnist.test.images

"""
預測的時候需要將輸入歸一化到標准正態分布
"""
means = np.mean(images, axis=1).reshape([10000, 1])
std = np.std(images, axis=1, ddof=1).reshape([10000, 1])
images = (images - means) / std
"""
需要將輸入的值轉換成uint8的類型才可以
"""
images = np.array(images, dtype="uint8")

interpreter = tf.contrib.lite.Interpreter(model_path="tflite_model/eval_graph.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

start_time = time.time()


predictions = []
for image in images:
    interpreter.set_tensor(input_details[0]['index'], [image])
    interpreter.invoke()
    score = interpreter.get_tensor(output_details[0]['index'])[0][0]
    predictions.append(score)

correct = 0
for prediction, label in zip(predictions, labels):
    if prediction == label:
        correct += 1
end_time = time.time()
print((end_time - start_time) / len(labels) * 1000)
print(correct / len(labels))

　　同樣要注意兩點：

　　1)，輸入要歸一化到標准正態分布，這個我認為是和之前設定的quantized_inputs_states保持一致的。

　　2)，輸入要轉換成uint8類型，不然會會報錯。

　　4，性能對比

　　模型大小降低到之前的1/4，這個是沒什么問題的，性能下降2%，可以接受，推斷速度提升了3倍左右。

　　我們再和之前post training quantized中對比下：大小和v2一樣，性能較v2差2%，推斷速度快0.02。個人認為原因可能如下：

　　1，首先可能LeNet在mnist數據集上算是大模型，因此post training quantized對性能損失不大，因此和quantization aware training比並沒有劣勢，反而還有些優勢。

　　2，quantization aware training的推斷速度要快一些（注：這個值不是偶然，我測試過很多次，推斷速度基本都穩定在一個值，平均上差0.02），但是快的不明顯，而且較v1和v3還有所下降，因為在卷積網絡中，計算復雜度主要受卷積的影響，而在這里的卷積並不大，量化后對推斷速度的影響並不明顯，其次引入量化操作還會損耗一些時間，且v2中還有反量化操作，因此時間消耗更多一點。最后就是可能硬件上並沒有特別支持int8的計算。

　　總之上面只是測試了整個tensorflow中量化的流程。因為選擇的網絡比較簡單，並沒有看到在諸如Inception3，mobileNet上那樣明顯一點的差距。另外tflite確實能加速。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 TensorFlow 8 bit模型量化使用Tensorflow對模型進行量化模型量化 tensorflow 量化和裁剪的資料 keras模型量化輕量化模型設計深度學習之模型量化基於caffe模型的模型裁剪和量化模型輕量化《TensorFlow實例》