論文閱讀：Learning Visual Question Answering by Bootstrapping Hard Attention

本文轉載自查看原文 2018-08-05 20:27 453 論文閱讀/ 目標檢測與跟蹤/ 深度學習/ Visual Tracking

Learning Visual Question Answering by Bootstrapping Hard Attention

Google DeepMind ECCV-2018

Updated on 2020-03-11 14:58:12

Paper：https://arxiv.org/abs/1808.00300

Code: https://github.com/gnouhp/PyTorch-AdaHAN

1. Background and Motivation:

本文嘗試僅僅用 hard attention 的方法來摳出最有用的 feature，進行 VQA 任務的學習。

Soft Attention：

Existing attention models are predominantly based on soft attention, in which all information is adaptively re-weighted before being aggregated. This can improve accuracy by isolating important information and avoiding interference from unimportant information.

Hard Attention：

It has the potential to improve accuracy and learning efficiency by focusing computation on the important parts of an image. But beyond this, it offers better computational efficiency because it only fully processes the information deemed most relevant. 但是，hard attention 有一個很致命的缺陷：由於圖像中信息的選擇是離散的，這導致基於梯度的學習方法，如 deep learning based methods，不可求導。然后，就無法利用 back-propagation 的方法進行區域的選擇，來支持基於梯度的優化（because the choice of which information to process is discrete and thus non-differentiable, gradients cannot be backpropagated into the selection mechanism to support gradient-based optimization.）。當然有一些基於 Policy Gradient 的方法可以通過采樣的方法，來處理梯度不可導的問題，但是這方面的研究，也仍然是非常的火熱。

2. Approach Details:

如圖 2 所示，作者用 CNN 模型來編碼給定的圖像，用 LSTM 來編碼固定長度的單詞。然后將句子的特征進行擴充，和圖像的特征進行拼接。在進行特征融合后，作者在空間上進行 attention 處理，最終，作者利用 sum-pooling 或 relational modules 來進行特征聚合。該框架可以利用標准的邏輯回歸損失進行端到端的訓練。

2.1. Attention Mechanism：

作者介紹了 soft attention 的常規做法，然后引入本文的 hard attention 方法：

1). Soft Attention. 略

2). Hard Attention. 本文的主要貢獻是提出一種新的 hard attention 機制。其可以在空間位置上產生 binary mask，只通過這部分的 feature 進行后續的處理。作者也將這種新的模型稱為 hard attention network (HAN)。關鍵的想法是利用每一個空間位置上激活的 L2-norm 來控制對應位置的相關性。L2-norm 和 relevance 之間的關系是訓練的 CNN feature 的新興屬性，不需要額外的約束或者目標。有一個工作表明：in an ImageNet-pretrained representation of an image of a cat and a dog, the largest feature norms appear above the cat and dog face, even though the representation was trained purely for classification.

利用 $x_{ij}$ 和 q 分別表示在空間 ij 的 CNN cell，以及問題的特征表示。作者先將 q 和 x 映射到相同維度的特征空間：

其中，$CNN^{1*1}$ 代表 1*1 卷積網絡，MLP 代表多層感知機。然后將句子的 feature 進行擴充，再和圖像的 feature map 進行元素級相加：

元素級相加可以使得每一個輸入的維度不變。然后計算 presence vector p，這個指標衡量了給定問題后的示例的相關性：

其中，$||*||_2$ 代表 L2-norm。從 m 中選擇 k entities 以進行進一步的處理，前 k 個示例的索引 l = [l1, l2, ... , lk] 被用於構成這些特征被傳遞到 decoder 模塊中，然后通過選擇的 feature 就可以進行梯度的傳遞。

以上這些部分就是 HAN，作者還提出了 Adaptive-HAN 來解決自適應選擇示例個數的問題，而不是固定的 k 個。這個主要是因為在 VQA 中，不同的問題所需要關注的區域是不一樣的，而且大小也不一，所以，就需要這個自適應機制。這個想法的核心是：make the presence vector p "compete" against a threshold。然而，由於上述的 norm 未加約束，為了避免 trivial solutions，即：網絡設置非常高的 presence vector，並且選擇所有的示例，作者這里在 p 上添加了 softmax operator。作者將所有的示例都放進來進行選擇，並且只選擇超越設定閾值的那些示例：

雖然閾值可以通過超參數選擇來得到，作者這里用的是 1/(w*h)，其中 w 和 h 分別是輸入向量 xij 的空間維度。跟流行的 soft-attention 機制相比，本文所提出的方法不需要額外的學習參數。HAN 需要一個超參數，adaHAN 不需要任何超參數。

2.2. Feature Aggregation：

Sum Pooling. 在利用 attention 之后，一種簡單的減少 feature vector 的方法是：將其處理成固定長度的向量。在 soft-attention module 中，利用的是 attention weight vector w, 這樣就可以直接進行將權重向量和輸入進行相乘，然后相加。給定 hard attention 選擇的特征，一種類似的 pooling 操作可以是：

Non-local Pairwise Operator. 為了在 sum pooling 上進行改進，作者探索通過 non-local and pairwise computations 來進行推理的方法。這類 non-local pairwise methods 的一個重要方面是：the computation is quadratic（二次方） in the number of features, 所以，hard attention 可以明顯的降低計算量。給定一些 embedding vectors xij，我們可以利用三種簡單的線性映射來產生 a matrix of

queries, ,

keys, ,

and values, at each spatial location.

然后，對於每一個空間位置 ij，作者和所有其他位置比較了 the query qij with the keys，然后通過相似性將 values v 進行相加。具體來說，

此處，softmax 操作是在所有的 ij 位置。輸入的最終表達可以通過將所有的進行相加，即我們利用 sum-pooling 來達到這個目標。所以，該機制是計算不同 embeddings 之間的 non-local pairwise relations，與空間和時序上的近鄰是獨立的。The separation between keys，queries 和 values allows semantic information about each object to remain separated from the information that binds objects together across space.

3. Experimental Results：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 論文：Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering-閱讀總結 Hierarchical Question-Image Co-Attention for Visual Question Answering 論文閱讀：《MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering》論文筆記：Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering 【論文閱讀】Deep Clustering for Unsupervised Learning of Visual Features 論文閱讀：Learning Attention-based Embeddings for Relation Prediction in Knowledge Graphs(2019 ACL) 《Graph-Based Reasoning over Heterogeneous External Knowledge for Commonsense Question Answering》論文整理【論文筆記】Question Answering over Freebase with Multi-Column Convolutional Neural Networks 【自然語言處理】--視覺問答（Visual Question Answering，VQA）從初始到應用論文：Show, Attend and Tell: Neural Image Caption Generation with Visual Attention-閱讀總結