神經網絡量化--per-channel量化

本文轉載自查看原文 2022-03-18 09:49 1045 深度學習/ 量化壓縮

(本文首發於公眾號，沒事來逛逛)

之前寫的關於網絡量化的文章都是基於 per-layer 實現的，最近有小伙伴詢問關於 per-channel 量化的問題，我發現有些同學對這個東西存在一些誤解，包括我以前也被 per-channel 的字面意義誤導過，所以今天簡單聊一下 per-channel 量化是怎么回事。

回顧一下Per-layer量化

在介紹 per-channel 量化之前，我們先回顧一下 per-layer 量化是怎么做的。

假設 \(r_1\)、\(r_2\) 分別表示輸入的 feature 和卷積的 kernel，\(r_3\) 表示輸出，那么卷積運算可以表示為：

\[r_3^{i,j,oc}=\sum_{ic}\sum_{m}\sum_{n}r_1^{i-m,j-n,ic}r_2^{i-m,j-n,ic} \tag{1} \]

公式里面，\(oc\) 表示輸出通道的 index，\(ic\) 表示輸入通道的 index。下面為了公式簡潔，我會省略 \(i\)、\(j\) 這些跟位置相關的 index。

per-layer 量化下，整個 tensor 會共用一個 scale 和 zero point。

就像下面這張圖給出的這樣：

因此量化后的卷積運算為：

\[S_3(q_3^{oc}-Z_3)=\sum_{ic}\sum_{m}\sum_{n}S_1(q_1^{ic}-Z_1)S_2(q_2^{ic}-Z_2) \tag{2} \]

可以得出：

\[q_3^{oc}=\frac{S_1S_2}{S_3}\sum_{ic}\sum_{m}\sum_{n}(q_1^{ic}-Z_1)(q_2^{ic}-Z_2)+Z_3 \tag{3} \]

什么是Per-channel量化

對於 per-channel 量化來說，很多同學第一感覺就是，給 feature 和 kernel 的每一個 channel 都單獨計算一個 scale 和 zeropoint，如下圖所示：

當然，從數學上看這是完全可以實現的，精度也會比較高，但在工程實現上，這種方式就行不通了。我們來看看為什么。

假設給 feature 和 kernel 的每一個 channel 都算一個 scale 和 zero point，那么公式 (2) 就變成了：

\[S_3(q_3^{oc}-Z_3)=\sum_{ic}\sum_{m}\sum_{n}S_1^{ic}(q_1^{ic}-Z_1^{ic})S_2^{ic}(q_2^{ic}-Z_2^{ic}) \tag{4} \]

最后可以算出：

\[q_3^{oc}=\sum_{ic}\frac{S_1^{ic}S_2^{ic}}{S_3}\sum_{m}\sum_{n}(q_1^{ic}-Z_1^{ic})(q_2^{ic}-Z_2^{ic})+Z_3 \tag{5} \]

這里和前面 per-layer 最大的區別就在於，\(\frac{S_1^{ic}S_2^{ic}}{S_3}\) 這一項我們沒辦法在整體求和之后再做了，需要每個 input channel 計算的時候，都先用這一項 requant 一下，最后再把每個 channel 的結果相加，這樣一來，卷積就沒法加速了，計算開銷會成倍上升。

因此，在實踐中，per-channel 量化其實是按照下圖的方式做的：

這其中的差別就在於，feature 還是整個 tensor 共用一個 scale 和 zeropoint，但每個 kernel 會單獨統計一個 scale 和 zeropoint（注意是每個 kernel，而不是 kernel 的每個 channel）。

在谷歌的白皮書上，也強調了這一點：

Improved accuracy can be obtained by adapting the quantizer parameters to each kernel within the tensor....per-channel quantization has a different scale and offset for each convolutional kernel. We do not consider per-channel quantization for activations as this would complicate the inner product computations at the core of conv and matmul operations.

在這種定義下，per-channel 量化和 per-layer 就變得很相似了：

\[S_3(q_3^{oc}-Z_3)=\sum_{ic}\sum_{m}\sum_{n}S_1(q_1^{ic}-Z_1)S_2(q_2^{ic}-Z_2) \tag{6} \]

換算一下得到：

\[q_3^{oc}=\frac{S_1S_2}{S_3}\sum_{ic}\sum_{m}\sum_{n}(q_1^{ic}-Z_1)(q_2^{ic}-Z_2)+Z_3 \tag{7} \]

仔細對比一下前面的公式 (3)，你會發現公式 (7) 和 (3) 幾乎一模一樣。不過，這里面的差別在於，對於不同的 \(oc\)，\(q_3^{oc}\) 對應的 \(S_2\) 是不一樣的，因為每個 kernel 都會有自己專屬的 \(S_2\)。因此，對於每一個 \(oc\)，需要單獨用 \(\frac{S_1S_2}{S_3}\) 重新 requant 一下。而在 per-layer 量化里面，我們是可以把整個 output feature 都算完，再統一 requant 的。

工程實現

由於 pytorch 的限制，我沒法在 python 層面模擬 per-channel 量化，只能做一下量化訓練時的 fake quantize，而這一步並不難，因此，我們還是直接看一下一些底層推理庫是怎么實現的。

這里以 NCNN 和 tflite 為例。

NCNN

首先看 NCNN 相關的實現。其實 NCNN 的量化方式本身就是 per-channel 實現的。下面這段代碼片段分別是 NCNN 對 kernel 和 feature 的量化操作（感謝知乎@田子宸的解讀）

// ========= 量化kernel =========
for (int n=0; n<num_output; n++)    // 每個kernel單獨量化
{
  Layer* op = ncnn::create_layer(ncnn::LayerType::Quantize);

  ncnn::ParamDict pd;
  pd.set(0, weight_data_int8_scales[n]);// 設置scale參數

  op->load_param(pd);

  op->create_pipeline(opt_cpu);

  ncnn::Option opt;
  opt.blob_allocator = int8_weight_data.allocator;
  // 拆開計算后組合
  const Mat weight_data_n = weight_data.range(weight_data_size_output * n, weight_data_size_output);
  Mat int8_weight_data_n = int8_weight_data.range(weight_data_size_output * n, weight_data_size_output);
  op->forward(weight_data_n, int8_weight_data_n, opt);    // 計算量化值

  delete op;
}

weight_data = int8_weight_data; // 替代原來的weight_data


// =========== 量化輸入feature ============

// initial the quantize,dequantize op layer
// 初始化輸入/輸出的量化/反量化Op
if (use_int8_inference)
{
  // 創建量化Op，不run
  quantize = ncnn::create_layer(ncnn::LayerType::Quantize);
  {
    ncnn::ParamDict pd;
    pd.set(0, bottom_blob_int8_scale);// 所有輸入用同一個scale

    quantize->load_param(pd);

    quantize->create_pipeline(opt_cpu);
  }

  // 創建反量化Op，不Run
  dequantize_ops.resize(num_output);  // 由於不同kernel weight的scale是不同的
  for (int n=0; n<num_output; n++)    // 因此反量化scale也是不同的
  {
    dequantize_ops[n] = ncnn::create_layer(ncnn::LayerType::Dequantize);

    float top_rescale = 1.f;

    if (weight_data_int8_scales[n] == 0)
      top_rescale = 0;
    else    // 反量化scale=1/(輸入scale*權重scale)，即一個反映射
      top_rescale = 1.f / (bottom_blob_int8_scale * weight_data_int8_scales[n]);

    dequantize_ops[n]->load_param(pd);
    
    // 省略若干代碼
    ....

    dequantize_scales.push_back(top_rescale);
  }
}

quantize->forward(bottom_blob, bottom_blob_int8, opt_g);    // 量化計算

可以看到，對 kernel 的量化會針對每個 kernel 設置 scale，而對輸入的 feature 則是用同一個 scale 進行量化。

此外，在 dequantize 這一步則是每個 channel 也會設置一個 requant（反量化）的 scale，這個 scale 對應的數值是 \(\frac{1}{S_1S_2}\)。前面說了，per-channel 里面每個 kernel 的 \(S_2\) 都是不一樣的，所以這里需要對每個 channel 進行反量化。

從代碼里面也可以看出，NCNN 采用的是對稱量化的方式（因為沒有用到 zero point），並且在量化運算結束后，會將得到的 int32 數值 requant 回 float32，並不是全量化的方式。所以在 dequantize 這一步其實對應的是公式（6），而非公式（7）。

tflite

每次看 tflite 的代碼都感覺不適，從這個角度講，NCNN 算是一個很了不起的框架了，代碼結構整齊划一，模塊分得很清晰，非常適合小白入門學習。而 tflite 的代碼由於做了大量優化，而且整個項目的模塊划分很混亂，所以你甚至不知道要從哪個地方開始閱讀。

下面這段代碼是我從 tflite 中截取的一段（鏈接：https://github.com/tensorflow/tensorflow/blob/v1.15.0/tensorflow/lite/kernels/internal/reference/conv.h#L101）

inline void Conv(const ConvParams& params, const RuntimeShape& input_shape,
                 const uint8* input_data, const RuntimeShape& filter_shape,
                 const uint8* filter_data, const RuntimeShape& bias_shape,
                 const int32* bias_data, const RuntimeShape& output_shape,
                 uint8* output_data, const RuntimeShape& im2col_shape,
                 uint8* im2col_data, void* cpu_backend_context) {
  // 省略若干代碼
  ....
   
  for (int batch = 0; batch < batches; ++batch) {
    for (int out_y = 0; out_y < output_height; ++out_y) {
      for (int out_x = 0; out_x < output_width; ++out_x) {
        // 單獨計算每個output channel
        for (int out_channel = 0; out_channel < output_depth; ++out_channel) {  
          const int in_x_origin = (out_x * stride_width) - pad_width;
          const int in_y_origin = (out_y * stride_height) - pad_height;
          int32 acc = 0;
          for (int filter_y = 0; filter_y < filter_height; ++filter_y) {
            for (int filter_x = 0; filter_x < filter_width; ++filter_x) {
              for (int in_channel = 0; in_channel < input_depth; ++in_channel) {
                const int in_x = in_x_origin + dilation_width_factor * filter_x;
                const int in_y =
                    in_y_origin + dilation_height_factor * filter_y;
                // If the location is outside the bounds of the input image,
                // use zero as a default value.
                if ((in_x >= 0) && (in_x < input_width) && (in_y >= 0) &&
                    (in_y < input_height)) {
                  int32 input_val = input_data[Offset(input_shape, batch, in_y,
                                                      in_x, in_channel)];
                  int32 filter_val =
                      filter_data[Offset(filter_shape, out_channel, filter_y,
                                         filter_x, in_channel)];
                  acc +=
                      (filter_val + filter_offset) * (input_val + input_offset);
                }
              }
            }
          }
          if (bias_data) {
            acc += bias_data[out_channel];
          }
          
          // 每個channel算完都做一次requant，
          // 這里采用fixed multiplier + bitshift的形式，不需要反量化回fp32
          acc = MultiplyByQuantizedMultiplier(acc, output_multiplier,
                                              output_shift);
          acc += output_offset;
          acc = std::max(acc, output_activation_min);
          acc = std::min(acc, output_activation_max);
          output_data[Offset(output_shape, batch, out_y, out_x, out_channel)] =
              static_cast<uint8>(acc);
        }
      }
    }
  }
}

從這里可以看出，tflite 采用的也是 per-channel 量化的方式（至少 1.5 這個版本是這樣）。不過相比 NCNN 有一點優化是不需要反量化回 float，而是直接通過 fixed multiplier+bitshift 的形式直接算出下一步的輸入，對應公式 (7)。

總結

這篇文章介紹了 per-channel 量化的過程，以及這么做的緣由。簡單概括就是，per-channel 量化是對每個 kernel 計算不同的量化參數，其余的和 per-layer 沒有區別。這么做最主要是出於計算性能的考慮。從這里我們又再次看到，模型量化是和底層實現緊密結合的技術。

參考

https://zhuanlan.zhihu.com/p/71881443
Quantizing deep convolutional networks for efficient inference: A whitepaper
https://github.com/tensorflow/tensorflow/blob/v1.15.0/tensorflow/lite/kernels/internal/reference/conv.h#L101

歡迎關注我的公眾號：大白話AI，立志用大白話講懂AI。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 神經網絡量化實踐-1 卷積神經網絡中的通道 channel 神經網絡推理加速之模型量化神經網絡量化入門--量化感知訓練神經網絡與BP神經網絡神經網絡量化入門--Folding BN ReLU代碼實現神經網絡及其訓練卷積神經網絡神經網絡 LSTM BP神經網絡