Hi! We're scratching our heads with RoBERTa and the way it handles its inputs.
The following matrix is of size 514x768:
from fairseq.models.roberta import RobertaModel
model = RobertaModel.from_pretrained("../roberta.base")
print(model.model.decoder.sentence_encoder.embed_positions.weight.size())
# torch.Size([514, 768])
Why is it different from the maximum embedding size which is 512? Furthermore, we observe that the second column of this matrix is full of zeros:
print(model.model.decoder.sentence_encoder.embed_positions.weight[1, :])
# tensor([0., 0., 0., 0., 0., 0., 0., 0. ...
Why is that? Thank you.
Yes padding_idx is usually equal to 1, so it should always be a vector of all zeros. The positional embeddings then start at padding_idx+1 (i.e. 2 - 514). Hope this clears it up!
Ah okay, that makes sense. In this case, what is the first column, is it full of randomly initialized values? In that case, why not use padding_idx = 0? Thank you for your answer.
Yeah the first vector is randomly initialized values that never get used. There's really no particular reason why padding_idx is 1 instead of 0, other than it's the first token added to the dictionary. We need to use the same padding_idx value for both embed_tokens and embed_positions
Okay, I understand. Thank you very much for your help!
Most helpful comment
Yeah the first vector is randomly initialized values that never get used. There's really no particular reason why
padding_idxis 1 instead of 0, other than it's the first token added to the dictionary. We need to use the samepadding_idxvalue for bothembed_tokensandembed_positions