論文筆記：Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Aggregation

本文轉載自查看原文 2019-04-24 16:58 582 論文閱讀/ 深度學習/ Semantic Segmentation

Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Aggregation

2019-04-24 16:53:25

Paper：https://arxiv.org/pdf/1903.02120.pdf

Code（unofficial PyTorch Implementation）：https://github.com/LinZhuoChen/DUpsampling

1. Background and Motivation:

常規的 encoder-decoder 模型中，decoder 部分采用的是雙線性插值的方法，進行分辨率的提升。但是，這種粗暴的方式，對分割問題適應嗎？作者提出一種新穎的模型來替換掉雙線性插值的方式，即依賴於數據的上采樣模型（data-dependent upsampling (DUpsampling) to replace bilinear）。這么做的好處是：充分利用了語義分割問題 label space 的冗余性，並且可以恢復出 pixel-wise prediction。那么，具體該怎么做呢？在 DeepLabv3+ 中，decoder 的定義如下圖所示：

這種框架帶來了如下的問題：

1). encode 的總體步長必須用多個空洞卷積來降低。這種操作需要很多的計算代價。

2). decoder 通常需要在底層融合特征。因為 bilinear 的問題，導致最終融合的擬合程度是由融合的底層特征分辨率決定的。這就導致，為了得到高分辨率的預測結果，decoder 就必須融合底層高分辨率的特征。這種約束，限制了 feature aggregation 的設計空間，從而得到的是 suboptimcal 的特征組合。在本文的實驗中，作者發現：如果可以進行不受到分辨率約束的特征聚合，那么就可以設計更好的特征聚合的方法。

2. Our Approach:

2.1 Beyond Bilinear: Data-dependent Upsampling:

我們用 F 表示用 encoder 對輸入圖像進行卷積之后的輸出特征，Y 表示其真值。常規的分割任務中，用到的損失函數如下所示：

此處，損失函數通常是 cross-entropy loss，而 bilinear 用於上采樣 F 得到與 Y 相同分辨率的圖像。作者認為此處用雙線性插值的方式進行上采樣，並非是最好的選擇。所以，作者在這里不去計算 bilinear(F) 和 Y 之間的誤差，而是去計算將 Y 降低分辨率后的圖像和 F 之間的誤差。注意到，這里 F 和降低分辨率后的 Y，是具有相同分辨率的。為了將 Y 進行壓縮，作者用一種在一些度量方式下的轉換，來最小化 Y 和低分辨率 Y 之間的重構誤差。具體來說，作者首先將 Y 進行划分，對於每一個 sub-window S，將其 reshape 成一個 {0, 1} 向量 v。最終，我們壓縮 v 為低維度的向量 x，然后水平和豎直的進行堆疊 x，構成最終。

所以，這里的轉換可以用矩陣 P 和 W 來表示，即：

我們可以在訓練集上通過最小化重構誤差，來學習得到 P 和 W：

作者用 PCA 的方法可以得到該函數的閉合解。從而，可以得到關於真值 Y 的壓縮版本真值。有了這個作為學習的目標，我們可以 pre-train 一個網絡模型，通過計算其回歸損失函數，如下所示：

所以，任何的回歸損失，l2 可以用於上述公式 4。但是，作者認為更直觀的一種方式是計算在 Y 空間內的損失。所以，作者用學習到的重構矩陣 W 來上采樣 F，然后計算反壓縮的 F 和 Y 之間的誤差，而不是對 Y 進行壓縮處理：

這里的 DUpsample（F）的過程，如下圖所示：

有了這個線性轉換的過程，DUpsample (F) 采用線性上采樣的方式對每一個 feature f 進行處理 Wf。與公式 1 相比，我們已經用一種 data-dependent upsampling 的方式，替換掉了 the bilinear upsampling 的方法，而這種轉換矩陣，是從真值 labels 上進行學習。這種上采樣的過程，與 1*1 卷積的相同，也是沿着 spatial dimension，卷積核是存在 W 中。這個 decompression 的過程，如上圖 3 所示。

2.2 Incorporatinig DUpsampling with Adaptive-temperature Softmax :

截止目前為止，我們已經介紹了如何將這種 DUpsampling 結合到 decoder 中，接下來將會介紹如何將其結合到 encoder-decoder framework 中。由於 DUpsampling 可以用 1*1 的卷積操作來實現，將其直接結合到該框架中，會遇到優化的問題，即：收斂非常慢。為了解決這個問題，我們采用 softmax function with temperature, 即在原始的 softmax 函數中，添加一個 temperature T，以 soften/sharpen the activation of softmax:

我們發現，T 可以自動的用反向傳播進行學習，而不需要微調。

2.3 Flexible Aggregation of Covolutional Features :

當然原始 feature aggregation 的方法是存在如下問題的：

主要問題是：

1). f is applied after upsampling. Since f is a CNN, whose amount of computation depends on the spatial size of inputs, this arrangement would render the decoder inefficient computationally. Moreover, the computational overhead prevents the decoder from exploiting features at a very low level.

2). The resolution of fused low-level features Fi is equivalent to that of F, which is typically around 1/4 resolution of the final prediction due to the incapable bilinear used to produce the final pixel-wise prediction. In order to obtain high-resolution prediction, the decoder can only choose the feature aggregation with high-resolution low-level features.

此外，本文的另外一個亮點在於：本文的多層特征的融合，不在局限於底層特征。作者將原始的 feature map，都進行降采樣成統一的維度，然后沿着 channel 的方向，進行拼合。

作者發現，僅當使用了作者提出的 DUpsampling layer 的時候，這種下采樣拼合的方式，才可以提升最終分割的精度。

3. Experiment：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 論文閱讀|Decoders Matter for Semantic Segmentation：Data-Dependent Decoding Enables Flexible Feature Aggregation 論文筆記-Understanding Convolution for Semantic Segmentation 論文筆記（一）---翻譯 Rich feature hierarchies for accurate object detection and semantic segmentation 論文筆記4：Segmenter: Transformer for Semantic Segmentation 論文筆記《Fully Convolutional Networks for Semantic Segmentation》論文筆記3：SegFormer Simple and Efficient Design for Semantic Segmentation with Transformers 論文筆記：Semantic Segmentation using Adversarial Networks 論文筆記-OCR-Object-Contextual Representations for Semantic Segmentation 《Image Generation with PixelCNN Decoders》論文筆記 Flow-Guided Feature Aggregation for Video Object Detection論文筆記