Transformers: What to do about this warning message: "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification"

Created on 1 Jul 2020 · 26Comments · Source: huggingface/transformers

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

returns this warning message:

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

This just started popping up with v.3 so I'm not sure what is the recommended action to take here. Please advise if you can. Basically, any of my code using the AutoModelFor<X> is throwing up this warning now.

Thanks.

Source

ohmeow

👍17

Most helpful comment

@ohmeow you're loading the bert-base-cased checkpoint (which is a checkpoint that was trained using a similar architecture to BertForPreTraining) in a BertForSequenceClassification model.

This means that:

The layers that BertForPreTraining has, but BertForSequenceClassification does not have will be discarded
The layers that BertForSequenceClassification has but BertForPreTraining does not have will be randomly initialized.

This is expected, and tells you that you won't have good performance with your BertForSequenceClassification model before you fine-tune it :slightly_smiling_face:.

@fliptrail this warning means that during your training, you're not using the pooler in order to compute the loss. I don't know how you're finetuning your model, but if you're not using the pooler layer then there's no need to worry about that warning.

LysandreJik on 1 Jul 2020

👍8

All 26 comments

Not sure what's happening with the multiple duplicate opened issues, @ohmeow?

Is GitHub flaky again? :)

julien-c on 1 Jul 2020

I am also encountering the same warning.

When loading the model

Some weights of the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFBertModel for predictions without further training.

When attempting to fine tune it:

WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model_1/bert/pooler/dense/kernel:0', 'tf_bert_model_1/bert/pooler/dense/bias:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model_1/bert/pooler/dense/kernel:0', 'tf_bert_model_1/bert/pooler/dense/bias:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model_1/bert/pooler/dense/kernel:0', 'tf_bert_model_1/bert/pooler/dense/bias:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model_1/bert/pooler/dense/kernel:0', 'tf_bert_model_1/bert/pooler/dense/bias:0'] when minimizing the loss.

Is the model correctly fine-tuning? Are the pre-trained model weights also getting updated (fine-tuned) or only the layers outside(above) the pre-trained model are changing their weights while training?

fliptrail on 1 Jul 2020

Not sure what's happening with the multiple duplicate opened issues, @ohmeow?

Is GitHub flaky again? :)

I noticed the same thing. Not sure what is going on ... but I swear I only opened this one :)

ohmeow on 1 Jul 2020

@ohmeow you're loading the bert-base-cased checkpoint (which is a checkpoint that was trained using a similar architecture to BertForPreTraining) in a BertForSequenceClassification model.

This means that:

The layers that BertForPreTraining has, but BertForSequenceClassification does not have will be discarded
The layers that BertForSequenceClassification has but BertForPreTraining does not have will be randomly initialized.

This is expected, and tells you that you won't have good performance with your BertForSequenceClassification model before you fine-tune it :slightly_smiling_face:.

LysandreJik on 1 Jul 2020

👍8

@LysandreJik Thank you for your response.
I am using the code:

def main_model():
  encoder = ppd.TFBertModel.from_pretrained("bert-base-uncased")
  input_ids = tf.keras.layers.Input(shape=(max_seq_len,), dtype=tf.int32)
  token_type_ids = tf.keras.layers.Input(shape=(max_seq_len,), dtype=tf.int32)
  attention_mask = tf.keras.layers.Input(shape=(max_seq_len,), dtype=tf.int32)

  embedding = encoder(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)[0]

  pooling = tf.keras.layers.GlobalAveragePooling1D()(embedding)
  normalization = tf.keras.layers.BatchNormalization()(pooling)
  dropout = tf.keras.layers.Dropout(0.1)(normalization)

  out = tf.keras.layers.Dense(1, activation="sigmoid", name="final_output_bert")(dropout)

  model = tf.keras.Model(inputs=[input_ids, token_type_ids, attention_mask], outputs=out)

  loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
  optimizer = tf.keras.optimizers.Adam(lr=2e-5)
  metrics=['accuracy', tf.keras.metrics.FalseNegatives(), tf.keras.metrics.FalsePositives()]

  model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
  return model

model = main_model()
model.summary()

I am only using the TFBertModel.from_pretrained("bert-base-uncased") pre-built class. I am not initializing it from any other class. Still, I am encountering the warning. From what I can understand this should only appear when initializing given pre-trained model inside another class.
Am I fine-tuning correctly? Are the BERT layer weights also getting updated?

Warning while loading model:

Some weights of the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFBertModel for predictions without further training.

While attempting to train:

WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model_1/bert/pooler/dense/kernel:0', 'tf_bert_model_1/bert/pooler/dense/bias:0'] when minimizing the loss.

This warning only started to appear from yesterday in all my codes and other sample codes given.

fliptrail on 1 Jul 2020

👍1

Hello everyone,
I also start getting this error today. before today it was working fine. Are there any changes that take place in colab?
This is the code I am using:

!pip install transformers
import TensorFlow as to
import transformers
from transformers import TFBertForSequenceClassification, BertConfig
tokenizer = transformers.BertTokenizer('gdrive/My Drive/Colab Notebooks/vocab.txt', do_lower_case=True)

max_seq_length = 128

bert = 'bert-large-uncased'
config = BertConfig.from_pretrained('bert-large-uncased', output_hidden_states=True, hidden_dropout_prob=0.2, 
attention_probs_dropout_prob=0.2)

transformer_model = TFBertForSequenceClassification.from_pretrained(bert, config=config)

input_ids_in = tf.keras.layers.Input(shape=(max_seq_length,), name='input_token', dtype='int32')
input_masks_in = tf.keras.layers.Input(shape=(max_seq_length,), name='masked_token', dtype='int32')
input_segments_in = tf.keras.layers.Input(shape=(max_seq_length,), name='segment_ids', dtype='int32') 

embedding_layer = transformer_model(input_ids_in, attention_mask=input_masks_in, token_type_ids=input_segments_in)

I have been using this same code for more than 2 weeks and no problem till yesterday.
Please if anyone finds the solution, share it.
Thank you

VaibhavBhatnagar17 on 1 Jul 2020

Thanks @LysandreJik

This is expected, and tells you that you won't have good performance with your BertForSequenceClassification model before you fine-tune it

Makes sense.

Now, how do we know what checkpoints are available that were trained on BertForSequenceClassification?

ohmeow on 1 Jul 2020

@fliptrail in your code you have the following:

embedding = encoder(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)[0]

which means you're only getting the first output of the model, and using that to compute the loss. The first output of the model is the hidden states:

https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_tf_bert.py#L716-L738

    Returns:
        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
            Sequence of hidden-states at the output of the last layer of the model.
        pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`):
            Last layer hidden-state of the first token of the sequence (classification token)
            further processed by a Linear layer and a Tanh activation function. The Linear
            layer weights are trained from the next sentence prediction (classification)
            objective during Bert pretraining. This output is usually *not* a good summary
            of the semantic content of the input, you're often better with averaging or pooling
            the sequence of hidden-states for the whole input sequence.
        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
            tuple of :obj:`tf.Tensor` (one for each layer) of shape
            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
        """

You're ignoring the second value which is the pooler output. The warnings are normal in your case.

LysandreJik on 1 Jul 2020

@VaibhavBhatnagar17, these are warnings, not errors. What exact warning are you not understanding?

LysandreJik on 1 Jul 2020

@ohmeow that really depends on what you want to do! Sequence classification is a large subject, with many different tasks. Here's a list of all available checkpoints fine-tuned on sequence classification (not all are for BERT, though!)

Please be aware that if you have a specific task in mind, you should fine-tune your model to that task.

LysandreJik on 1 Jul 2020

👍1

@LysandreJik Hey, What I am not able to understand is that I was using this code for more than 2 weeks and no warning came up till yesterday. I haven't changed anything but suddenly this warning came up is confusing.
I am not getting the same output dimension as before and not able to complete my project.

VaibhavBhatnagar17 on 2 Jul 2020

The warning came up yesterday because version 3.0.0 was released yesterday. It's weird that you saw an output dimension changed since yesterday. What's the error you get?

LysandreJik on 3 Jul 2020

I see this same warning when initializing BertForMaskedLM, pasted in below for good measure. As other posters have mentioned, this warning began appearing only after upgrading to v3.0.0.

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMaskedLM were not initialized from the model checkpoint at bert-large-uncased-whole-word-masking and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Note that my module imports/initializations essentially duplicate the snippet demonstrating cloze task usage at https://huggingface.co/bert-large-uncased-whole-word-masking?text=Paris+is+the+%5BMASK%5D+of+France.

from transformers import BertTokenizer, BertForMaskedLM

_tokenizer = BertTokenizer.from_pretrained(
    'bert-large-uncased-whole-word-masking')
_model = BertForMaskedLM.from_pretrained(
    'bert-large-uncased-whole-word-masking')

Am I correct in assuming that nothing has changed in the behavior of the relevant model, but that perhaps this warning should have been being printed all along?

tarskiandhutch on 7 Jul 2020

You're right, this has always been the behavior of the models. It wasn't clear enough before, so we've clarified it with this warning.

LysandreJik on 9 Jul 2020

👍1

Thanks, @LysandreJik .

tarskiandhutch on 9 Jul 2020

Anyone knows how to suppress this warning? I am aware that the model needs fine-tuning and I am fine-tuning it so, it becomes annoying to see this over and over again.

ehalit on 25 Sep 2020

You can manage the warnings with the logging utility introduced in version 3.1.0:

from transformers import logging

logging.set_verbosity_warning()

LysandreJik on 25 Sep 2020

👍4

@LysandreJik Thanks for the rapid response, I set it with set_verbosity_error()

ehalit on 25 Sep 2020

👍6 ❤1

@LysandreJik - So , by default bert-base-uncased loading from TFBertModel has 199 variables [ 3embedding + 2 layer norms + (16 x 12 layers) + 2 (pooler kernel and bias )].

But when loading from TFBertForMaskedLM, it has 204 variables. Below are the 5 extra variables

tf_bert_for_masked_lm_1/mlm___cls/predictions/bias:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/dense/kernel:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/dense/bias:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/LayerNorm/gamma:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/LayerNorm/beta:0

So that means , these 5 variables are randomly initialising right.
Are these 5 variables required for MLM ( is this how it is in official tensorflow models )

can we take output token embeddings ( before passing to mlm___cls ) ( batch x sequence x embedding_dimension ), multiply it with word_embedding matrix to produce ( batch x sequence x vocab_size ) and then use that for MLM loss .

s4sarath on 12 Oct 2020

@LysandreJik I'm having a slightly different issue here - I'm loading a sequence classification checkpoint in a AutoModelForSequenceClassification model. But I still get the warning. Here's my code:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('roberta-large-mnli')

Output:

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

I believe it's NOT expected because I'm indeed initializing from a model that I expect to be exactly identical.

I'm only starting to get this warning after upgrading to transformers v3 as well. I'm using 3.3.1 currently. Could you please help? Thanks!

veronica320 on 23 Oct 2020

@s4sarath I'm not sure I understand your question.

@veronica320, the pooler layer is not used when doing sequence classification, so there's nothing to be worried about.

The pooler is the second output of the RobertaModel:
https://github.com/huggingface/transformers/blob/v3.4.0/src/transformers/modeling_roberta.py#L691

But only the first output is used in the sequence classification model:
https://github.com/huggingface/transformers/blob/v3.4.0/src/transformers/modeling_roberta.py#L1002

LysandreJik on 23 Oct 2020

👍1

Thanks a lot!

veronica320 on 23 Oct 2020

@LysandreJik - Sorry to make you confused .

tf_bert_for_masked_lm_1/mlm___cls/predictions/bias:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/dense/kernel:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/dense/bias:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/LayerNorm/gamma:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/LayerNorm/beta:0

The above 4 variables are randomly initialising right, means they were not a part of official BERT .
Am i right?

s4sarath on 27 Oct 2020

Thank you for your explanation.

Actually these four variables shouldn't be initialized randomly, as they're part of BERT. The official BERT checkpoints contain two heads: the MLM head and the NSP head.

You can see it here:

>>> from transformers import TFBertForMaskedLM
>>> model = TFBertForMaskedLM.from_pretrained("bert-base-cased")

Among the logging, you should find this:

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertForMaskedLM: ['nsp___cls']
- This IS expected if you are initializing TFBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-cased.

This tells you two things:

Some layers of the checkpoints are not used. These are ['nsp___cls'], corresponding to the CLS head. Since we're using a ***ForMaskedLM, it makes sense not to use the CLS head
All the layers of the model were initialized from the model checkpoint, as both the transformer layers and the MLM head were present in the checkpoint.

If you're getting those variables randomly initialized:

tf_bert_for_masked_lm_1/mlm___cls/predictions/bias:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/dense/kernel:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/dense/bias:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/LayerNorm/gamma:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/LayerNorm/beta:0

then it means you're using a checkpoint that does not contain these variables. These are the MLM layers, so you're probably loading a checkpoint that was saved using an architecture that does not contain these layers. This can happen if you do the following:

>>> from transformers import TFBertModel, TFBertForMaskedLM
>>> model = TFBertModel.from_pretrained("bert-base-cased")
>>> model.save_pretrained(directory)
>>> mlm_model = TFBertForMaskedLM.from_pretrained(directory)

I hope this answers your question!

LysandreJik on 27 Oct 2020

Oh okay. Thank you so much for the clarification. When I looked at bert
models from tf-hub , these 4 variables were not present. That was the
reason for the confusion .

On Tue, Oct 27, 2020, 7:02 PM Lysandre Debut notifications@github.com
wrote:

Thank you for your explanation.

Actually these four variables shouldn't be initialized randomly, as
they're part of BERT. The official BERT checkpoints contain two heads: the
MLM head and the NSP head.

You can see it here:

from transformers import TFBertForMaskedLM>>> model = TFBertForMaskedLM.from_pretrained("bert-base-cased")

Among the logging, you should find this:

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertForMaskedLM: ['nsp___cls']

This IS expected if you are initializing TFBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).

This IS NOT expected if you are initializing TFBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-cased.

This tells you two things:

Some layers of the checkpoints are not used. These are ['nsp___cls'],
corresponding to the CLS head. Since we're using a *ForMaskedLM, it
makes sense not to use the CLS head

All the layers of the model were initialized from the model
checkpoint, as both the transformer layers and the MLM head were present in
the checkpoint.

If you're getting those variables randomly initialized:

tf_bert_for_masked_lm_1/mlm___cls/predictions/bias:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/dense/kernel:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/dense/bias:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/LayerNorm/gamma:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/LayerNorm/beta:0

then it means you're using a checkpoint that does not contain these
variables. These are the MLM layers, so you're probably loading a
checkpoint that was saved using an architecture that does not contain these
layers. This can happen if you do the following:

from transformers import TFBertModel, TFBertForMaskedLM>>> model = TFBertModel.from_pretrained("bert-base-cased")>>> model.save_pretrained(directory)>>> mlm_model = TFBertForMaskedLM.from_pretrained(directory)

I hope this answers your question!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/transformers/issues/5421#issuecomment-717245807,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ACRE6KEEQACWSAEO3GK3CL3SM3DYNANCNFSM4OM5S2SQ
.

s4sarath on 27 Oct 2020

Hi @LysandreJik . I had a look at the official BERT repo . There are only 199 variables in the official model checkpoints. Which means, of 204 variables ( last 5 variables for MLM layer ) is initialised randomly. These variables are not a part of official checkpoints I think.

s4sarath on 7 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings