Stylized Image Caption論文筆記

本文轉載自查看原文 2019-06-26 23:14 532

Neural Storyteller (Krios et al. 2015)

: NST breaks down the task into two steps, which first generate unstylish captions than apply style shift techniques to generate stylish descriptions.

SentiCap: Generating Image Descriptions with Sentiments (AAAI 2016)

代碼和數據都有公布. （代碼用的是比較老的框架，沒有讀。）

Supervised Image Caption

Style: Positive, Negtive

Datasets:

MSCOCO

SentiCap Dataset:作者自己收集的一個數據集 (數據量不大,Positive: 998 images/2873 captions for train, 673 images/2019 captions for test, Negtive: 997 images/2468 captions for train, 503 images/ 1509 captions for test) 3 positive and 3 negative captions per image

This is done in a caption re-writing task based upon objective captions from MSCOCO by asking AMT workers to choose among ANPs of the desired sentiment, and incorporate one or more of them into any one of the five existing captions.

Evaluation Metrics:

Automatic metrics: BLEU, ROUGEL, METEOR, CIDEr

Human evaluation

Model

Shortcomings: requires paired image-sentiment caption data, but also world-level supervison to emphsize the sentiment words(e.g., sentiment strengths of each word in the sentiment caption), which makes the approach very expensive and difficult to scale up.(StyleNet)

StyleNet: Generating Attractive Visual Captions with Styles (CVPR2017)

代碼沒有公布，有第三方Pytorch實現，數據集公布了FlickrStyle9K(1k測試數據沒有公開)

Unsupervised(without using supervised style-specific image-caption paired data): factual image caption pairs + stylized language corpus(only text)

Produce attractive visual captions with styles only using monolingual stylized language corpus(without paired images) and standard factual image/video-caption pairs.

Style:Romantic, Humorous

Datasets:

FlickrStyle10K(built on Flickr 30K image caption dataset, show a standard factual caption for a image, to revise the caption to make it romantic or humorous)(這里雖然有image-stylized caption pairs，但訓練的時候作者並沒有用這些成對的數據，而是用image-factual caption pairs + stylized text corpora，在evaluate的時候會用到image-stylized caption pairs，用作Ground Truth.)

Evaluation Metrics:

Automatic Metrics:BLEU, METEOR, ROUGE, CIDEr

Human evaluation

Model

關鍵點:

1.將LSTM中參數W_x拆分成3項，U_x，S_x，V_x，模型中所有的LSTM網絡除S之外的參數都是共享的,參數S用來記憶特定的風格。

2.類似於Multi-task sequence to sequence training. First task, train to generate factual captions given the paired images，更新所有的參數. Second, factored LSTM is trained as a language model，只更新S_R或者S_H.

“Factual” and “Emotional”: Stylized Image Captioning with Adaptive Learning and Attention (ECCV 2018)

Style-factual LSTM block: S_x, S_h and g_xt, g_ht

Two-stage learning strategy

MLE loss + KL divergence

Image Captioning at Will: A Versatile Scheme for Effectively Injecting Sentiments into Image Descriptions (Preprint 30 Jan 2018)

SENTI-ATTEND: Image Captioning using Sentiment and Attention (Preprint 24 Nov 2018)

這篇文章可以看作是SentiCap的后續工作，采用的是Supervised的方式。

Datasets

MS COCO: 用於生成generic image captions

SentiCap dataset:

Evaluation Metrics

standard image caption evaluation metrics: BLEU, ROUGE-L, METEOR, CIDEr, SPICE

Entropy

Model

損失函數:

文章沒有公布代碼，實驗部分對比的是SentiCap以及Image Caption at Will

疑問: SentiCap數據集很小，利用image-caption pairs來Cross entropy loss訓練會有效果嗎？？？

LSTM多加了E₁和E₂兩個輸入，每一步LSTM拿h_t來預測s這個操作在SentiCap里也有，然后文章一直處於PrePrint狀態。

SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text (CVPR 2018)

公布了部分代碼和數據

Style: Story

Learns on existing image caption datasets with only factual descriptions + a large set of styled texts without aligned images

Two-stage training strategy for the term generator and language generator

Dataset:

Descriptive Image Captions: MSCOCO

The Styled Text: bookcorpus

Evaluation:

Automatic relevance metrics: Widely-used captioning metrics (BLEU, METEOR, CIDEr, SPICE)

Automatic style metrics: 作者自己提出的LM(4-gram model)、GRULM(GRU language model)、CLF(binary classifier)

Human evaluations of relevance and style

Unsupervised Stylish Image Description Generation via Domain Layer Norm (AAAI 2019)

Unsupervised Image Caption

Four different styles: fairy tale, romance, humor, country song lyrics(lyrics)

Our model is jointly trained with a paired unstylish image description corpus(source domain) and a monolingual corpus of the specific style(target domain)

代碼和數據集均未公開

Datasets:

Source Domain:VG-Para(Krause et al. 2017)

Target: BookCorpus(humor and romance), 作者自己收集的country song lyrics and fairy tale

Evaluation Metircs:

Metrics of Semantic Relevance: 作者自己提出的p和r，SPICE

Metrics of Stylishness: transfer accuracy

Human evaluation

Approach Key Point

E_I和E_T分別將圖片和目標風格的描述映射到同一個隱空間，Gs用來生成非風格化的描述，即Source domain里的句子，E_I和G_S組合起來就是傳統的Image Caption的Encoder-Decoder模型，訓練數據是有監督的Image-Caption對。G_T用來生成風格化的描述，E_T將風格化的句子編碼到隱空間Z，G_T則根據隱空間內的編碼z_T重新生成風格化的句子(Reconstruction)，訓練數據是風格化的句子。模型訓練完成之后，將E_I和G_T組合，就可以生成風格化的圖像描述。

關鍵點1：作者假設存在一個隱空間Z使得可以將圖片, 不帶風格的源描述以及帶風格的目標描述映射到這個空間。

關鍵點2：G_S和G_T只在層規范化的參數不同，其他參數是共享的。即G_S和G_T的LN-LSTM是共享的，其中只有參數{g_S,b_S}和{g_T,b_T}不同，作者將這種機制稱為Domain Layer Norm(DLN)。層規范化操作(layer norm operation)作用在LSTM的每一個Gate(input gate，forget gate, output gate)上。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Multimodal —— 看圖說話（Image Caption）任務的論文筆記（一）評價指標和NIC模型 Multimodal —— 看圖說話（Image Caption）任務的論文筆記（二）引入attention機制《Image Generation with PixelCNN Decoders》論文筆記 Image Caption論文合輯2 《Image-to-Image Translation with Conditional Adversarial Networks》論文筆記論文筆記——Deep Residual Learning for Image Recognition 論文筆記5：TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation Natural Image Stitching with the Global Similarity Prior 論文筆記（一）論文筆記之：Generative Adversarial Text to Image Synthesis 《StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation》論文筆記