Transformers: Training masked language model with Tensorflow

Created on 29 Nov 2019 · 8Comments · Source: huggingface/transformers

❓ Questions & Help

I'm trying to fine-tune a masked language model starting from bert-base-multilingual-cased with Tensorflow using the PyTorch-based example _examples/run_lm_finetuning_ as starting point. I'd like to take the multilingual model and adapt it to the Italian language.
Unfortunately I'm unable to find examples over the internet for the TFBertForMaskedLM model in training mode, so I hope this is the appropriate place for this question.

System and libraries

Platform Linux-5.0.0-36-generic-x86_64-with-debian-buster-sid
Python 3.7.5 (default, Oct 25 2019, 15:51:11)
[GCC 7.3.0]
PyTorch 1.3.1
Tensorflow 2.0.0
Transformers 2.2.0

I first convert my train sentences in 4 arrays:
1) train_ids_masked: tokens ids with special tokens and masking + padding up to max_seq_length = 10
2) train_attnmasks: masks for attention (padding masks)
3) train_segments: masks for sentence (constant array since sentences are independent)
4) train_labels: original masked tokens + UNK tokens everywhere else

Every array has shape (num sentences, max_seq_length) = (72,10)

Then I define the model and print the summary

pre_trained_model = 'bert-base-multilingual-cased'

config = transformers.BertConfig.from_pretrained(pre_trained_model)

model = transformers.TFBertForMaskedLM.from_pretrained(pre_trained_model, config=config)

model.compile(optimizer=tf.optimizers.Adam(lr=params['learning_rate']), loss='binary_crossentropy')

print(model.summary())

which outputs

Model: "tf_bert_for_masked_lm_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
bert (TFBertMainLayer)       multiple                  177853440
_________________________________________________________________
mlm___cls (TFBertMLMHead)    multiple                  92920059
=================================================================
Total params: 178,565,115
Trainable params: 178,565,115
Non-trainable params: 0

Then I try to train the model

model.fit([train_ids_masked, train_attnmasks, train_segments], train_labels, epochs=1, batch_size=20)

The model trains over the first batch but returns the following error

Train on 72 samples
20/72 [=======>......................] - ETA: 7sTraceback (most recent call last):
  File "/home/andrea/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1610, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimensions must be equal, but are 10 and 119547 for 'loss/output_1_loss/mul' (op: 'Mul') with input shapes: [?,10], [?,?,119547].

when calculating the loss, trying to compare the padding length max_seq_length (= 10) to the vocabulary size (= 119547).

I've also tried to define the model in the following way

inp_ids = tf.keras.layers.Input(shape=(max_seq_length, ), dtype='int32', name="bert_input_ids")
inp_attnmasks = tf.keras.layers.Input(shape=(max_seq_length, ), dtype='int32', name="bert_input_attention_masks")
inp_segments = tf.keras.layers.Input(shape=(max_seq_length, ), dtype='int32', name="bert_input_segment_ids")
inputs = [inp_ids, inp_attnmasks, inp_segments]
outputs = transformers.TFBertForMaskedLM.from_pretrained(pre_trained_model)(inputs)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=tf.optimizers.Adam(lr=params['learning_rate']), loss='binary_crossentropy')

but I get the same error.
My input and label arrays have the same shape as the ones in the _run_lm_finetuning_ example and my model is simply the Tensorflow equivalent to the model used there.

What am I doing wrong?
Is it possible that this is related to the loss calculation rather than the definition of the model?
I've noticed that in the _run_lm_finetuning_ example the model has an additional argument masked_lm_labels

outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)

that allows to compute the loss only on masked tokens using PyTorch, but this option is not present in TFBertForMaskedLM, how can I achieve that?

wontfix

Source

blackcat84

Most helpful comment

I made an attempt on kaggle: https://www.kaggle.com/riblidezso/finetune-xlm-roberta-on-jigsaw-test-data-with-mlm

riblidezso on 30 Jun 2020

🎉1 👍1

All 8 comments

I've noticed that in the run_lm_finetuning example the model has an additional argument masked_lm_labels

Yes, I have the same issue here. Did you manage to port the example code to TF?

In the torch models the argument is interpreted as follows:

        if masked_lm_labels is not None:
            loss_fct = CrossEntropyLoss(ignore_index=-1)  # -1 index = padding token
            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1))
            outputs = (masked_lm_loss,) + outputs

which means that one has to define a custom cross-entropy loss in Tensorflow.

btel on 16 Dec 2019

Unfortunately no, I had a look around in order to implement the custom cross-entropy we are talking about. I switched to Pytorch since it wasn't clear to me whether switching to the custom loss would solve all the problems I had.

blackcat84 on 16 Dec 2019

I see. I guess I will take the same road ;-) At least I can do the finetuning in torch and later convert the model to TF. Thanks for sharing the info!

BTW I found the implementation of the custom loss that we are talking about in google repo:

    # The `positions` tensor might be zero-padded (if the sequence is too
    # short to have the maximum number of predictions). The `label_weights`
    # tensor has a value of 1.0 for every real prediction and 0.0 for the
    # padding predictions.
    per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
    numerator = tf.reduce_sum(label_weights * per_example_loss)
    denominator = tf.reduce_sum(label_weights) + 1e-5
    loss = numerator / denominator

Here is the link to the original code:

https://github.com/google-research/bert/blob/cc7051dc592802f501e8a6f71f8fb3cf9de95dc9/run_pretraining.py#L273-L280

btel on 16 Dec 2019

👎1 👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 14 Feb 2020

While everyone correctly pointed out that you need a loss function which handles masks, the original error message posted here is actually unrelated to that.

model.compile(optimizer=tf.optimizers.Adam(lr=params['learning_rate']), loss='binary_crossentropy')

Your model is compiled with binary crossenropy, e.g. one hot encoded binary labels of shape (batch size x len x len(dict)), while you provide the labels as integers representing the token values (5673 etc) with shape (batchsize x len). This leads to a shape mismatch.
The error message comes from the comparison of the last values of the shapes len(dict) vs textlen.

tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimensions must be equal, but are 10 and 119547 for 'loss/output_1_loss/mul' (op: 'Mul') with input shapes: [?,10], [?,?,119547].

Using tf.keras.losses.SparseCategoricalCrossentropy solves the error message, but of course you will still need to implement a masked loss function to use it properly.

riblidezso on 6 May 2020

👍1

Is there anyone who went on with tensorflow? I don't want to switch to pytorch. I will try to implement a masked loss function. If there is anyone already did this, I would be happy to know.