Data Efficient Masked Language Modeling for Vision and Language
Repository for the paper "Data Efficient Masked Language Modeling for Vision and Language", accepted to Findings of EMNLP 2021.
The baseline MLM masks a random token with 15% probability (where ~50% of the masked tokens are stop-words or punctuation). Our method masks words that require the image in order to be predicted (e.g., physical objects). Our experiments show that our pretrain masking strategy consistently improves over the baseline strategy in two evaluation setups.
The code for pretraining is based on the great LXMERT repository: https://github.com/airsplay/lxmert
This repository includes:
data_directory is available here: https://drive.google.com/drive/folders/1smFCIwNbIm4QhNHf4gn5RKRfcvGh4_Vl?usp=sharing
Pretrained models and fine-tuned models are available here:
Sets of annotated Objects, Attributes, Relationships from GQA and Visual Genome
Aggregated data, where we extracted Δ Validation loss (loss without the image, minus the loss with the image) for LXMERT validation set. This is used to define the necessity of the image for a masked word prediction during MLM. Available in
The structure of the csv is as follows:
We can see the sentence, the image, and the masked token (motorcycle).
ind_loss_with_imgis the loss with the image,
ind_loss_false_imgis the loss without the image, and
loss_gapis the Δ Validation loss.
conf_gap_of_label_false_img, and 'conf_gap' - it is the confidence of the model (logits at location of the masked word).
top_5_preds_token_false_img- predictions of the model, with and without the image.
tagged_pos- there is also the spacy pos tag for the sentence.
label_in_top_5_false_img- A boolean value for whether the label is among the top 5 predictions. In this example, without the image, the label is not among the top 5 predictions, but with the image, it is.
Code for the alternative masking strategies, available in
Semantic classes information, including functions to detect Objects, Attributes, and Relationships, available in