『論文筆記』Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

本文轉載自查看原文 2020-09-20 23:16 524 ML/DL論文及原理

論文地址：https://arxiv.org/abs/1804.02516

研究領域

文本-視頻檢索。

存在問題

缺乏大規模的標注數據。

One difficulty with this approach, however, is the lack of large-scale annotated video-caption datasets for training. To address this issue, we aim at learning text-video embeddings from heterogeneous data sources.

本文創新

提出Mixture-of-Embedding-Experts (MEE) model，可以處理缺失一部分信息的“視頻”，將之正常的與文本進行匹配，增加訓練集大小。

論文的出發點下圖表示的很清楚，就是將不同形式的數據映射到相同的特征空間，使得最大化的利用數據（即使缺失部分構成要素，例如圖像相對視頻，也能充分的學習僅剩的部分）。

具體來說，視頻數據被拆成了多個源，每個源和句子的一種特征表示進行相似度計算，最終結果為加權平均：

文本先經過NetVLAD提取特征（This is motivated by the recent results [34] demonstrating superior performance of NetVLAD aggregation over other common aggregation architectures such as long short-term memory (LSTM) [48] or gated recurrent units (GRU) [49].），然后文本經過下面的映射：

（1）就是維度映射，而（2）將Z原文解釋為：

The second layer, given by (2), performs context gating [34], where individual dimensions of Z1 are reweighted using learnt gating weights σ(W2Z1 + b2) with values between 0 and 1, where W2 and b2 are learnt parameters.

The motivation for such gating is two-fold: (i) we wish to introduce nonlinear interactions among dimensions of Z1 and (ii) we wish to recalibrate the strengths of different activations of Z1 through a self-gating mechanism. Finally, the last layer, given by (3), performs L2 normalization to obtain the final output Z.

[34] Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905 (2017)

上面的這個結構式多個並行的，每個針對視頻的一種源，作者解釋這是因為不同的視頻源形式關注文字中的不同部分，實際上（2）也是在用文本自身調節自身的特征分布。

作者還介紹了相似度計算過程，很簡單，值得一提的是，不同的視頻源描述符（different streams of input descriptors）的權重完全由句子計算，作者認為句子可以作為先驗決定描述的視頻更側重哪方面——這也是種自注意力機制：

最后用檢索任務常用的margin損失函數收尾：

實驗部分

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 論文筆記：Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering 論文筆記之：Heterogeneous Face Attribute Estimation: A Deep Multi-Task Learning Approach 論文筆記 2016-CIKM Learning Graph-based POI Embedding for Location-based Recommendation 論文筆記之： Deep Metric Learning via Lifted Structured Feature Embedding 超圖 embedding 相關論文筆記圖 embedding & clustering 相關論文筆記 [論文閱讀筆記] Adversarial Learning on Heterogeneous Information Networks PredNet --- Deep Predictive coding networks for video prediction and unsupervised learning --- 論文筆記【論文筆記】Learning to Estimate 3D Human Pose and Shape from a Single Color Image(CVPR 2018) 『論文筆記』Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning