I was not able to reproduce the fill mask provided here with the huggingface interface, wondering if they are using an outdated model (if a new one has been released) :)
import torch
roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
roberta.eval()
roberta.fill_mask('Anything with a hint of swollen bud
Sorry, could you provide more details? I'm confused what the issue is
Thanks so much for the follow up. I re-implemented the mask_predict function with Huggingface release of the Roberta, and what I observe is that the predictions of the model are slightly different and I see better ones on here. I wonder if there was multiple releases of the model and if they are using the older one etc, and I understand this is a weird problem though.
There was only one release of the model, so it's concerning if you're seeing differences. Can you share a bit more about your implementation? Can you confirm that the Tensor input fed to the model is the same in both cases?
Hi @myleott. I tracked the difference and describe it over at transformers (https://github.com/huggingface/transformers/issues/1874#issuecomment-588359143), and added a PR to make transformers RoBERTa identical to fairseq's implementation.
I am a bit confused about the fairseq implementation, though. Why is the weight of the LMHead the same as the weight of the token embeddings? What's the reasoning behind this?
I need to look at the code again, but I wonder if it is the same as the
weight tying trick in LMs.
On Wed, Feb 19, 2020 at 6:36 PM Bram Vanroy notifications@github.com
wrote:
Hi @myleott https://github.com/myleott. I tracked the difference and
describe it over at transformers (huggingface/transformers#1874 (comment)
https://github.com/huggingface/transformers/issues/1874#issuecomment-588359143),
and added a PR to make transformers RoBERTa identical to fairseq's
implementation.I am a bit confused about the fairseq implementation, though. Why is the
weight of the LMHead the same as the weight of the token embeddings? What's
the reasoning behind this?—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/pytorch/fairseq/issues/1399?email_source=notifications&email_token=AGFGTTSAETK62C6CIRBWZLTRDW67XA5CNFSM4JPJEH5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMKEGBQ#issuecomment-588530438,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AGFGTTWQE5QVXQVBINTHG43RDW67XANCNFSM4JPJEH5A
.
Also thanks so much for following up on this guys.
On Wed, Feb 19, 2020 at 6:42 PM ramin hamedizaviehgard <
[email protected]> wrote:
I need to look at the code again, but I wonder if it is the same as the
weight tying trick in LMs.On Wed, Feb 19, 2020 at 6:36 PM Bram Vanroy notifications@github.com
wrote:Hi @myleott https://github.com/myleott. I tracked the difference and
describe it over at transformers (huggingface/transformers#1874 (comment)
https://github.com/huggingface/transformers/issues/1874#issuecomment-588359143),
and added a PR to make transformers RoBERTa identical to fairseq's
implementation.I am a bit confused about the fairseq implementation, though. Why is the
weight of the LMHead the same as the weight of the token embeddings? What's
the reasoning behind this?—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/pytorch/fairseq/issues/1399?email_source=notifications&email_token=AGFGTTSAETK62C6CIRBWZLTRDW67XA5CNFSM4JPJEH5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMKEGBQ#issuecomment-588530438,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AGFGTTWQE5QVXQVBINTHG43RDW67XANCNFSM4JPJEH5A
.
I need to look at the code again, but I wonder if it is the same as the weight tying trick in LMs.
…
On Wed, Feb 19, 2020 at 6:36 PM Bram Vanroy @.*> wrote: Hi @myleott https://github.com/myleott. I tracked the difference and describe it over at transformers (huggingface/transformers#1874 (comment) <huggingface/transformers#1874 (comment)>), and added a PR to make transformers RoBERTa identical to fairseq's implementation. I am a bit confused about the fairseq implementation, though. Why is the weight of the LMHead the same as the weight of the token embeddings? What's the reasoning behind this? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1399?email_source=notifications&email_token=AGFGTTSAETK62C6CIRBWZLTRDW67XA5CNFSM4JPJEH5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMKEGBQ#issuecomment-588530438>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFGTTWQE5QVXQVBINTHG43RDW67XANCNFSM4JPJEH5A .
I was wondering about that, too, but why would you tie the weights of the embeddings and those of the LMHead? What would be the reasoning behind that?
EDIT: it's been pointed out to me that this is common practice for a while now. Cf. Using the Output Embedding to Improve Language Models.