Hierarchical Question-Image Co-Attention for Visual Question Answering

本文轉載自查看原文 2018-05-17 17:58 900

Hierarchical Question-Image Co-Attention for Visual Question Answering

NIPS 2016

Paper: https://arxiv.org/pdf/1606.00061.pdf

Code: https://github.com/jiasenlu/HieCoAttenVQA

Introduction：

　　本文提出了一種新的聯合圖像和文本特征的協同顯著性的概念，使得兩個不同模態的特征可以相互引導。

　　此外，作者也對輸入的文本信息，從多個角度進行加權處理，構建多個不同層次的 image-question co-attention maps，即：word-level，phrase-level and question-level。

　　最后，在 phrase level，我們提出一種新穎的卷積-池化策略（convolution-pooling strategy）來自適應的選擇 the phase size。

Methods：

1. Notation：

　　問題 Q = {q₁, ... , q_T}，其中 q_t 是第 t 個單詞的特征向量。我們用 q_t^w, q_t^p, q_t^s 分別表示在位置 t 處的 Word embedding，phrase embedding 以及 question embedding。

　　圖像特征表示為 V = {v₁, ... ,v_N}，其中，v_n 是空間位置 n 處的特征向量。

　　圖像和問題的 co-attention features 在每一個層次，都可以表示為：v^, q^。

　　不同模塊和層的權重可以表示為 W。

2. Question Hierarchy：

　　給定 the 1-hot encoding of the question words Q, 我們首先將單詞映射到單詞空間，以得到：Q^w. 為了計算詞匯的特征，我們采用在單詞映射向量上采用 1-D 卷積。具體來說，在每一個單詞位置，我們計算 the Word vectors with filters of three window sizes 的內積：unigram, bigram and trigram. 對於第 t 個單詞，在窗口大小為 s 時的卷積輸出為：

　　其中，W_c^s 是權重參數。單詞級別的向量 Q^w是 approximately 0-padding before feeding into bigram and trigram convolutions to maintain the length of the sequence after convolution. 給定卷積的結果，我們然后在每一個單詞位置，跨越不同的 n-grams 采用 max-pooling 以得到 phrase-level features：

　　我們的 pooling method 不同於前人的方法，可以自適應的選擇 different gram features at each time step, 並且可以保持原始序列的長度和序列。我們利用 LSTM 來編碼 max-pooling 之后的 sequence 。對應的 question-level feature 是第 t 個時間步驟的 LSTM hidden vector。

3. Co-Attention：

　　我們提出兩種協同顯著的機制（two co-attention mechanism），第一種是 parallel co-attention，同時產生 image 和 question attention。第二種是 alternating co-attention，順序的產生 image 和 question attentions。如圖2所示，這些 co-attention mechanisms 可以在所有問題等級上執行。

　　【Parallel Co-Attention】 這種 attention 機制嘗試同時對 image 和 question 進行 attend。我們通過計算圖像和問題特征在所有的 image-locations and question-locations 進行相似度的計算。具體來說，給定一個圖像特征圖 V，以及問題的表達 Q，放射矩陣（the affinity matrix）C 可以計算如下：

　　其中，W_b 包括了權重。在計算得到 affinity matrix 之后，計算 image attention 的一種可能的方法是：simply maximize out the affinity over the locations of other modality, i.e.

　　並非選擇 the max activation，我們發現如果我們將這個 affinity matrix 看做是一個 feature，然后學習去預測 image 和 question attention maps 可以提升最終的結果：

　　其中 Wv 和 Wq，w_hv，w_hq 是權重參數。a^v 和 a^q 是每一個圖像區域 v_n 和單詞 q_t 的 attention probability。放射矩陣 C 將 question attention space 轉換為 image attention space. 基於上述 attention weights，圖像和問題 attention vectors 可以看做是 image feature 和 question feature 的加權求和：

　　【Alternating Co-Attention】分步的協同 attention ，簡單來講，包括三個步驟：

　　1）summarize the question into a single vecror q;

　　2）attend to the image based on the question summary q ;

　　3）attend to the question based on the attended image feature.

　　我們定義 attention operation x^ = A(X; g)，將圖像特征 X 以及從問題得到的 attention guidance g 作為輸入，然后輸出 the attended image vector。這些操作可以表達為：

　　其中，空心符號1 是元素全為 1 的向量。

4. Encoding for Predicting Answers :

　　我們將 VQA 看做是一個 classification task，我們從所有的三個層次的 attended image and question features 來預測答案。我們用 MLP 來迭代的編碼 the attention features：

Experiments：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 論文：Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering-閱讀總結論文閱讀：Learning Visual Question Answering by Bootstrapping Hard Attention 論文筆記：Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering 【自然語言處理】--視覺問答（Visual Question Answering，VQA）從初始到應用論文閱讀：《MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering》問答系統總結(Question Answering System, QA) NeurIPS 2019 | 基於Co-Attention和Co-Excitation的少樣本目標檢測《Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering》一文的理解和總結《Graph-Based Reasoning over Heterogeneous External Knowledge for Commonsense Question Answering》論文整理【論文筆記】Question Answering over Freebase with Multi-Column Convolutional Neural Networks