CCNet: Criss-Cross Attention for Semantic Segmentation

本文轉載自查看原文 2019-12-25 21:16 696 論文理解

很久前就想開始學習記錄本文閱讀筆記，一直在整理它的代碼，拖到現在。

《摘要》

Long-range dependencies can capture useful contextual information to benefit visual understanding problems.

“大（長）范圍的依賴關系可以捕捉到有用的內容信息幫助視頻理解問題”，這是作者的論據。

接着提出作者的工作成果CCNet可以包含元素周圍的相關信息。CCNet的特點是：

1、GPU memory friendly. Compared with the non-local block, the recurrent criss-cross attention module requires 11× less GPU memory usage.阡陌注意力模塊與使用non-local模塊比，GPU內存減少11倍。

2、High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the nonlocal block in computing long-range dependencies。阡陌注意力模塊與non-local模塊比，計算量減少85%

3、The state-of-the-art performance。實驗效果好，分別在Cityscapes, ADE20K, and instance segmentation benchmark COCO上進行了實驗。

開源代碼地址：https://github.com/speedinghzl/CCNet

《介紹》

語義分割是計算機視覺的一個基礎研究任務，是給圖像中的每個像素分類。該任務在過去幾年內一直被研究，現在已經運用到自動駕駛、虛擬現實和圖像編輯中。

然后從自己的優勢出發進行分類講解。先講基於FCN（fully convolutional network）改進的一些算法框架的缺點，they are inherently limited to local receptive fields and short-range contextual information，然后再講進行長依賴關系捕捉的一些算法方法s及各自的不足，從而引出CCNet算法。方法有：

1、atrous spatial pyramid pooling module with multiscale dilation convolutions for contextual information aggregation

2、PSPNet with pyramid pooling module to capture contextual information

不足有：

1、dilated convolution based methods collect information from a few surrounding pixels and can not generate dense contextual information

2、pooling based methods aggregate contextual information in a non-adaptive manner and the homogeneous contextual information is adopted by all image pixels 同構上下文信息

進行產生密集、點與點上下文信息，方法有：

1、aggregate contextual information for each position via a predicted attention map

2、utilizes a self-attention mechanism [10, 29], which enable a single feature from any position to perceive features of all the other positions,

以上兩種方法在效率及降低復雜度欠優化。有沒有一種方法可以事半功倍呢？於是提出了criss-cross attention module十字交叉注意力模塊，這個東西把算法復雜度從O((H×W)×(H×W)) 降到了 O((H×W)×(H+W−1))，基本上相當於加快了(HxW)/(H+W-1)倍,而且顯存友好

作者總結CCNet的優點是：

1、We propose a novel criss-cross attention module in this work, which can be leveraged to capture contextual information from long-range dependencies in a more efficient and effective way

2、We propose a CCNet by taking advantages of two recurrent criss-cross attention modules, achieving leading performance on segmentation-based benchmarks, including Cityscapes, ADE20K and MSCOCO.

《相關工作》

從語義分割和注意力模型兩方面進行介紹。

語義分割：從FCN以來，算法的發展以及其中的細節改變，可以看算法原文。

注意力模型：以幾種算法為例，分別說明注意力的做法。有：

1、enhanced the representational power of the network by modeling channel-wise relationships in an attention mechanism.

2、use several attention masks to fuse feature maps or predictions from different branches.

3、applied a self-attention model on machine translation.

4、proposed the non-local module to generate the huge attention map by calculating the correlation matrix between each spatial point in the feature map, then the attention guided dense contextual information aggregation.

5、utilized self-attention mechanism to harvest the contextual information.

6、learned an attention map to aggregate contextual information for each individual point adaptively and specifically.

本文的不同之處在於不generate huge attention map to record the relationship for each pixel-pair in feature map。

《方法》

首先介紹網絡結構，其次介紹cca module(criss-cross attention module which captures long-range contextual information in horizontal and vertical direction)，最后介紹捕捉密集的、全局上下文的recurrent criss-cross attention module.

網絡結構

其中的x是特征，reduction模塊是進行降維，criss-cross Attention Module是集成大范圍的（水平、垂直）內容信息。循環兩次可以得到全局的內容信息。

criss-cross attention (CCA) module工作原理如圖所示

已知特征圖H ∈ R C×W×H, CCA模塊首先在H的兩個卷積層上使用1 × 1 卷積核分別生成特征圖 Q 和 K, {Q, K} ∈ R C0×W×H. C0是特征圖的通道數，比C小，操作起到維數減小的作用。得到K和Q之后，進一步通過Affinity操作產生注意力特征圖(attention maps) A,A ∈ R (H+W−1)×W×H。Affinity具體做法是:

在特征圖Q空間維度上的每個小方格u ，可以從特征圖K上，對應方格u所在的行和列，得到一個向量Qu ∈ R C1，進一步得到Qu的集合歐姆u，歐姆u的通道數是（H+W-1）*C1，u所在的行和列的（除了u）方格數是H+W-1，歐姆u的第i個元素歐姆iu的通道數是C1，Affinity操作是：

di,u ∈ D ，D ∈ R (H+W−1)×W×H

同樣在H上用1x1卷積層可以得到特征圖V，V ∈ R C×W×H，A相對於V上的每一個方格u，與V上對應的向量進行乘積等操作就是Aggregation operation。

H1 ∈ R C×W×H。之后的部分理解就簡單了。

《實驗》

略

《總結》

略

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 CCNet: Criss-Cross Attention for Semantic Segmentation 里的Criss-Cross Attention計算方法 Squeeze-and-Attention Networks for Semantic Segmentation semantic segmentation 和instance segmentation 【Semantic Segmentation】Segmentation綜述論文閱讀：Semi-Supervised Semantic Segmentation with Cross-Consistency Training CIAN: Cross-Image Affinity Net for Weakly Supervised Semantic Segmentation論文閱讀 [論文][半監督語義分割]Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision 【Semantic segmentation】Fully Convolutional Networks for Semantic Segmentation 論文解析 Coherent Semantic Attention for Image Inpainting 語義分割（semantic segmentation）——U-Net