Transformers: ❓ [BART] Different embedding sizes between pre-trained / fine-tuned checkpoint

Created on 25 May 2020 · 3Comments · Source: huggingface/transformers

❓ Questions & Help

Running this code :

from transformers import BartModel

x = BartModel.from_pretrained('bart-large')
x2 = BartModel.from_pretrained('bart-large-cnn')
print(x.shared)
print(x2.shared)

Gives :

Embedding(50265, 1024, padding_idx=1)
Embedding(50264, 1024, padding_idx=1)

Why the vocabulary size is different ? Isn't it supposed to be the same ? Is it just from the original authors' checkpoint ?

@sshleifer

Source

astariul-colanim

Most helpful comment

Good catch. There is no mask token in the second checkpoint. I believe that is the same as the authors' implementation.

Completely off topic: if you still have the xsum data you used I would love a copy. I'm sam [at] huggingface.co .

sshleifer on 25 May 2020

👍2

All 3 comments

Good catch. There is no mask token in the second checkpoint. I believe that is the same as the authors' implementation.

Completely off topic: if you still have the xsum data you used I would love a copy. I'm sam [at] huggingface.co .

sshleifer on 25 May 2020

👍2

Thanks for your fast answer !

Do you know why there is no mask token in the second checkpoint ? And if it has any impact on score ?

astariul-colanim on 26 May 2020

I have a hunch the there is no <mask> token because of fairseq's --find-unused-parameters clarg, but I'm not certain.

I would guess no impact on score because <mask> does not show up in the finetuning data.

sshleifer on 26 May 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Need a Restore training mechenisim in run_lm_finetuning.py

chuanmingliu · 3Comments

Weights not initialized from pretrained model

lemonhu · 3Comments

GPT2 tokenizer is so slow because of sum()

iedmrc · 3Comments

Dataset format and Best Practices For Language Model Fine-tuning

HanGuo97 · 3Comments

Limit on the input text length?

lcswillems · 3Comments