BERT在多模態領域中的應用
Source: Link
BERT (Bidrectional Encoder Representations from Transformers) 自提出后,憑借着 Transformer 強大的特征學習能力以及通過掩碼語言模型實現的雙向編碼,其大幅地提高了各項 NLP 任務的基准表現。鑒於其強大的學習能力,2019 年開始逐漸被用到多模態領域。其在多模態領域的應用主要分為了兩個流派:一個是單流模型,在單流模型中文本信息和視覺信息在一開始便進行了融合;另一個是雙流模型,在雙流模型中文本信息和視覺信息一開始先經過兩個獨立的編碼模塊,然后再通過互相的注意力機制來實現不同模態信息的融合。本文主要介紹和對比五個在圖片與文本交互領域應用的 BERT 模型:VisualBert, Unicoder-VL, VL-Bert, ViLBERT, LXMERT。
單流模型
1. VisualBERT
論文標題:VisualBERT: A Simple and Performant Baseline for Vision and Language
論文鏈接:https://arxiv.org/abs/1908.03557
源碼鏈接:https://github.com/uclanlp/visualbert
2. Unicoder-VL
論文標題:Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
論文鏈接:https://arxiv.org/abs/1908.06066
3. VL-BERT
論文標題:VL-BERT: Pre-training of Generic Visual-Linguistic Representations
論文鏈接:https://arxiv.org/abs/1908.08530
源碼鏈接:https://github.com/jackroos/VL-BERT
雙流模型
1. ViLBERT
論文標題:ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
論文鏈接:https://arxiv.org/abs/1908.02265
源碼鏈接:https://github.com/facebookresearch/vilbert-multi-task
2. LXMERT
論文標題:LXMERT: Learning Cross-Modality Encoder Representations from Transformers
論文鏈接:https://arxiv.org/abs/1908.07490
源碼鏈接:https://github.com/airsplay/lxmert
基於視頻的 BERT:
1. VideoBERT
論文標題:VideoBERT: A Joint Model for Video and Language Representation Learning
論文鏈接:ICCV-2019
Reference
[1] VL-BERT: Pre-training of generic visual linguistic representations. Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai
[2] Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, Ming Zhou
[3] VisualBERT: A simple and performant baseline for vision and language. Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Change
[4] LXMERT: Learning cross-modality encoder representations from transformers. Hao Tan, Mohit Bansal
[5] ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee
[6] VideoBERT: A Joint Model for Video and Language Representation Learning
