Transformers: Why there are no such 'cls/' layers in roberta pytorch checkpoints

Created on 24 Nov 2020  路  7Comments  路  Source: huggingface/transformers

In pytorch checkpoints of roberta in huggingface transformers, the last two layers are the "pooler layers":
pooler.dense.weight
pooler.dense.bias

However, In original roberta tensorflow checkpoints, the last few layers are not the pooler layers, instead, they are:
cls/predictions/output_bias (DT_FLOAT) [21128]
cls/predictions/transform/LayerNorm/beta (DT_FLOAT) [768]
cls/predictions/transform/LayerNorm/gamma (DT_FLOAT) [768]
cls/predictions/transform/dense/bias (DT_FLOAT) [768]
cls/predictions/transform/dense/kernel (DT_FLOAT) [768,768]
cls/seq_relationship/output_bias (DT_FLOAT) [2]
cls/seq_relationship/output_weights (DT_FLOAT) [2,768]

these 'cls/' layers came after the pooler layers.
I converted the pytorch checkpoints into tensorflow checkpoints. Then when I try to load the weights, all I was told was:
tensorflow.python.framework.errors_impl.NotFoundError: Key cls/predictions/transform/dense/kernel not found in checkpoint

which means the 'cls/' layers do not exist at all! so why these layers are gone in pytorch checkpoints provided by huggingface transformers? What should I do to get the weights of these 'cls/' layers? I am trying to use a roberta checkpoint that is trained by someone else using huggingface transformers, however, I have to convert it to a tensorflow version for my code is in tensorflow version , but this problem occurs. how can I correctly convert the checkpoints?

Most helpful comment

You're welcome! Good night

All 7 comments

The authors of RoBERTa removed the next sentence prediction task during pre-training, as it didn't help much. See section 1 of the paper.

The authors of RoBERTa removed the next sentence prediction task during pre-training, as it didn't help much. See section 1 of the paper.

Really appreciate your apply! However, the 2 'cls/seq_relationship/' layers are responsible for the NSP task. The rest should be responsible for the MLM task. What is more, these layers are the exact layers that I extract from the original roberta TensorFlow checkpoint published by the author of the paper... This is confusing. I am just wondering why the huggingface pytorch checkpoints just don't stay the weights of the MLM task, in UNILM, these weights are precious. Of course NSP is not that important.

Yes you're right, sorry. I think that the masked language modeling head has a different name in Huggingface Transformers. It is simply called lm_head. See here for the PyTorch implementation of RoBERTa. Note that you should use RobertaForMaskedLM rather than RobertaModel, since the latter does not have a masked language modeling head on top.

I think that the masked language modeling head has a different name in Huggingface Transformers. It is simply called lm_head. See here: https://huggingface.co/transformers/_modules/transformers/modeling_tf_roberta.html#TFRobertaForMaskedLM

Appreciate again! I will have a look tomorrow, and in fact it is 2 a.m. in my city right now and I am
totally in bed hahahh

Yes you're right, sorry. I think that the masked language modeling head has a different name in Huggingface Transformers. It is simply called lm_head. See here for the PyTorch implementation of RoBERTa. Note that you should use RobertaForMaskedLM rather than RobertaModel, since the latter does not have a masked language modeling head on top.

That really makes sense to me, even I am in bed. thanks a lot!

You're welcome! Good night

You're welcome! Good night
Your approach sovled my problem perfectly, now I have successfully converted the pytorch weights into tensorflow weights. Time to close the issue now. ^_^

Was this page helpful?
0 / 5 - 0 ratings