VL-BERT: PRE-TRAINING OF GENERIC VISUALLINGUISTIC REPRESENTATIONS
We introduce a new pre-trainable generic representation for visual-linguistic tasks,
called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both
visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from
the input image. It is designed to fit for most of the visual-linguistic downstream
tasks. To better exploit the generic representation, we pre-train VL-BERT on the
massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better
align the visual-linguistic clues and benefit the downstream tasks, such as visual
commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single
model on the leaderboard of the VCR benchmark.
The pre-trainable generic representation for visual-linguistic tasks is becoming more and more important.
We have released the code for VL-BERT: https://github.com/jackroos/VL-BERT. Thanks for your attention!
@jackroos Awesome! Thanks for your work!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@thomwolf @srush Any plans to integrate any visual BERT kind of model (like VilBERT, LXMERT or VL-BERT) in the near future?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
@thomwolf @srush Any plans to integrate any visual BERT kind of model (like VilBERT, LXMERT or VL-BERT) in the near future?