Running this code :
from transformers import BartModel
x = BartModel.from_pretrained('bart-large')
x2 = BartModel.from_pretrained('bart-large-cnn')
print(x.shared)
print(x2.shared)
Gives :
Embedding(50265, 1024, padding_idx=1)
Embedding(50264, 1024, padding_idx=1)
Why the vocabulary size is different ? Isn't it supposed to be the same ? Is it just from the original authors' checkpoint ?
@sshleifer
Good catch. There is no mask token in the second checkpoint. I believe that is the same as the authors' implementation.
Completely off topic: if you still have the xsum data you used I would love a copy. I'm sam [at] huggingface.co .
Thanks for your fast answer !
Do you know why there is no mask token in the second checkpoint ? And if it has any impact on score ?
I have a hunch the there is no <mask> token because of fairseq's --find-unused-parameters clarg, but I'm not certain.
I would guess no impact on score because <mask> does not show up in the finetuning data.
Most helpful comment
Good catch. There is no mask token in the second checkpoint. I believe that is the same as the authors' implementation.
Completely off topic: if you still have the xsum data you used I would love a copy. I'm sam [at] huggingface.co .