In pytorch checkpoints of roberta in huggingface transformers, the last two layers are the "pooler layers":
pooler.dense.weight
pooler.dense.bias
However, In original roberta tensorflow checkpoints, the last few layers are not the pooler layers, instead, they are:
cls/predictions/output_bias (DT_FLOAT) [21128]
cls/predictions/transform/LayerNorm/beta (DT_FLOAT) [768]
cls/predictions/transform/LayerNorm/gamma (DT_FLOAT) [768]
cls/predictions/transform/dense/bias (DT_FLOAT) [768]
cls/predictions/transform/dense/kernel (DT_FLOAT) [768,768]
cls/seq_relationship/output_bias (DT_FLOAT) [2]
cls/seq_relationship/output_weights (DT_FLOAT) [2,768]
these 'cls/' layers came after the pooler layers.
I converted the pytorch checkpoints into tensorflow checkpoints. Then when I try to load the weights, all I was told was:
tensorflow.python.framework.errors_impl.NotFoundError: Key cls/predictions/transform/dense/kernel not found in checkpoint
which means the 'cls/' layers do not exist at all! so why these layers are gone in pytorch checkpoints provided by huggingface transformers? What should I do to get the weights of these 'cls/' layers? I am trying to use a roberta checkpoint that is trained by someone else using huggingface transformers, however, I have to convert it to a tensorflow version for my code is in tensorflow version , but this problem occurs. how can I correctly convert the checkpoints?
The authors of RoBERTa removed the next sentence prediction task during pre-training, as it didn't help much. See section 1 of the paper.
The authors of RoBERTa removed the next sentence prediction task during pre-training, as it didn't help much. See section 1 of the paper.
Really appreciate your apply! However, the 2 'cls/seq_relationship/' layers are responsible for the NSP task. The rest should be responsible for the MLM task. What is more, these layers are the exact layers that I extract from the original roberta TensorFlow checkpoint published by the author of the paper... This is confusing. I am just wondering why the huggingface pytorch checkpoints just don't stay the weights of the MLM task, in UNILM, these weights are precious. Of course NSP is not that important.
Yes you're right, sorry. I think that the masked language modeling head has a different name in Huggingface Transformers. It is simply called lm_head. See here for the PyTorch implementation of RoBERTa. Note that you should use RobertaForMaskedLM rather than RobertaModel, since the latter does not have a masked language modeling head on top.
I think that the masked language modeling head has a different name in Huggingface Transformers. It is simply called
lm_head. See here: https://huggingface.co/transformers/_modules/transformers/modeling_tf_roberta.html#TFRobertaForMaskedLM
Appreciate again! I will have a look tomorrow, and in fact it is 2 a.m. in my city right now and I am
totally in bed hahahh
Yes you're right, sorry. I think that the masked language modeling head has a different name in Huggingface Transformers. It is simply called
lm_head. See here for the PyTorch implementation of RoBERTa. Note that you should useRobertaForMaskedLMrather thanRobertaModel, since the latter does not have a masked language modeling head on top.
That really makes sense to me, even I am in bed. thanks a lot!
You're welcome! Good night
You're welcome! Good night
Your approach sovled my problem perfectly, now I have successfully converted the pytorch weights into tensorflow weights. Time to close the issue now. ^_^
Most helpful comment
You're welcome! Good night