[Paper Reading] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

本文轉載自查看原文 2019-06-06 23:24 967 Paper Reading

論文鏈接：https://arxiv.org/pdf/1502.03044.pdf

代碼鏈接：https://github.com/kelvinxu/arctic-captions & https://github.com/yunjey/show-attend-and-tell & https://github.com/jazzsaxmafia/show_attend_and_tell.tensorflow

主要貢獻

在這篇文章中，作者將“注意力機制（Attention Mechanism）”引入了神經機器翻譯（Neural Image Captioning）領域，提出了兩種不同的注意力機制：‘Soft’ Deterministic Attention Mechanism & ‘Hard’ Stochastic Attention Mechanism。下圖展示了"Show, Attend and Tell"模型的整體框架。

注意力機制的關鍵點在於，如何從圖像的特征向量a_i中計算得到上下文向量z_t。對於每一個位置i，注意力機制能夠產生一個權重e_ti。在Hard Attention機制中，權重α_ti所扮演的角色是圖像區域向量a_i在t時刻被選中作為解碼器的信息的概率，有且只有一個區域會被選中，為此，引入變量s_t,i，當區域i被選中時為1，否則為0；在Soft Attention機制中，權重α_ti所扮演的角色是圖像區域向量a_i在t時刻輸入解碼器的信息中所占的比例。（參考Attention機制論文閱讀——Soft和Hard Attention，Multimodal —— 看圖說話（Image Caption）任務的論文筆記（二）引入attention機制）

實驗細節

在文章中，作者提出使用在ImageNet數據集上預訓練好、不進行微調的VGGNet提取圖像特征，將block5_conv4（Conv2D）提取到的feature map（14×14×512）reshape為196×512（L×D，L=196，D=512，即196個圖像區域，每個區域特征向量的維度是512）的圖像區域向量a_i。

To create the annotations a_i used by our decoder, we used the Oxford VGGnet pretrained on ImageNet without finetuning.

In our experiments we use the 14×14×512 feature map of the fourth convolutional layer before max pooling. This means our decoder operates on the flattened 196×512 (i.e L × D) encoding.

在文章中，作者指出，解碼器LSTM初始的細胞狀態（init_c）與隱層狀態（init_h）由從圖像中提取到的特征向量及兩個獨立的多層感知機（Multi-Layer Perception, MLP）決定。

The initial memory state and hidden state of the LSTM are predicted by an average of the annotation vectors fed through two separate MLPs(init,c and init,h).

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 論文：Show, Attend and Tell: Neural Image Caption Generation with Visual Attention-閱讀總結論文：Show and Tell: A Neural Image Caption Generator-閱讀總結讀paper:image caption with global-local attention… Attention-over-Attention Neural Networks for Reading Comprehension論文總結文獻閱讀_image caption_IEEE2021_Caption Generation From Road Images for Traffic Scene Modeling Paper Reading: Neural Machine Translation by Jointly Learning to Align and Translate Multimodal —— 看圖說話（Image Caption）任務的論文筆記（二）引入attention機制 Paper Reading:TridentNet Image2Caption