BERT在多模態領域中的應用


BERT在多模態領域中的應用

Source: Link 

 

BERT (Bidrectional Encoder Representations from Transformers) 自提出后,憑借着 Transformer 強大的特征學習能力以及通過掩碼語言模型實現的雙向編碼,其大幅地提高了各項 NLP 任務的基准表現。鑒於其強大的學習能力,2019 年開始逐漸被用到多模態領域。其在多模態領域的應用主要分為了兩個流派:一個是單流模型,在單流模型中文本信息和視覺信息在一開始便進行了融合;另一個是雙流模型,在雙流模型中文本信息和視覺信息一開始先經過兩個獨立的編碼模塊,然后再通過互相的注意力機制來實現不同模態信息的融合。本文主要介紹和對比五個在圖片與文本交互領域應用的 BERT 模型:VisualBert, Unicoder-VL, VL-Bert, ViLBERT, LXMERT。 

 

單流模型

1. VisualBERT 

論文標題:VisualBERT: A Simple and Performant Baseline for Vision and Language

論文鏈接:https://arxiv.org/abs/1908.03557

源碼鏈接:https://github.com/uclanlp/visualbert

2. Unicoder-VL

論文標題:Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

論文鏈接:https://arxiv.org/abs/1908.06066

3. VL-BERT 

論文標題:VL-BERT: Pre-training of Generic Visual-Linguistic Representations

論文鏈接:https://arxiv.org/abs/1908.08530

源碼鏈接:https://github.com/jackroos/VL-BERT

 

 

雙流模型

1. ViLBERT 

論文標題:ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

論文鏈接:https://arxiv.org/abs/1908.02265

源碼鏈接:https://github.com/facebookresearch/vilbert-multi-task

2. LXMERT 

論文標題:LXMERT: Learning Cross-Modality Encoder Representations from Transformers

論文鏈接:https://arxiv.org/abs/1908.07490

源碼鏈接:https://github.com/airsplay/lxmert

 

 

 

基於視頻的 BERT:

1. VideoBERT

論文標題:VideoBERT: A Joint Model for Video and Language Representation Learning

論文鏈接:ICCV-2019

 

 

Reference

[1] VL-BERT: Pre-training of generic visual linguistic representations. Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai 

[2] Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, Ming Zhou 

[3] VisualBERT: A simple and performant baseline for vision and language. Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Change

[4] LXMERT: Learning cross-modality encoder representations from transformers. Hao Tan, Mohit Bansal 

[5] ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee 

[6] VideoBERT: A Joint Model for Video and Language Representation Learning 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM