Fairseq: Disparity with huggingface implementation when reproducing the fill_mask

Created on 19 Nov 2019  Â·  8Comments  Â·  Source: pytorch/fairseq

I was not able to reproduce the fill mask provided here with the huggingface interface, wondering if they are using an outdated model (if a new one has been released) :)

bug help wanted

All 8 comments

import torch
roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
roberta.eval()
roberta.fill_mask('Anything with a hint of swollen bud my prey.', 40)

Sorry, could you provide more details? I'm confused what the issue is

Thanks so much for the follow up. I re-implemented the mask_predict function with Huggingface release of the Roberta, and what I observe is that the predictions of the model are slightly different and I see better ones on here. I wonder if there was multiple releases of the model and if they are using the older one etc, and I understand this is a weird problem though.

There was only one release of the model, so it's concerning if you're seeing differences. Can you share a bit more about your implementation? Can you confirm that the Tensor input fed to the model is the same in both cases?

Hi @myleott. I tracked the difference and describe it over at transformers (https://github.com/huggingface/transformers/issues/1874#issuecomment-588359143), and added a PR to make transformers RoBERTa identical to fairseq's implementation.

I am a bit confused about the fairseq implementation, though. Why is the weight of the LMHead the same as the weight of the token embeddings? What's the reasoning behind this?

I need to look at the code again, but I wonder if it is the same as the
weight tying trick in LMs.

On Wed, Feb 19, 2020 at 6:36 PM Bram Vanroy notifications@github.com
wrote:

Hi @myleott https://github.com/myleott. I tracked the difference and
describe it over at transformers (huggingface/transformers#1874 (comment)
https://github.com/huggingface/transformers/issues/1874#issuecomment-588359143),
and added a PR to make transformers RoBERTa identical to fairseq's
implementation.

I am a bit confused about the fairseq implementation, though. Why is the
weight of the LMHead the same as the weight of the token embeddings? What's
the reasoning behind this?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/pytorch/fairseq/issues/1399?email_source=notifications&email_token=AGFGTTSAETK62C6CIRBWZLTRDW67XA5CNFSM4JPJEH5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMKEGBQ#issuecomment-588530438,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AGFGTTWQE5QVXQVBINTHG43RDW67XANCNFSM4JPJEH5A
.

Also thanks so much for following up on this guys.

On Wed, Feb 19, 2020 at 6:42 PM ramin hamedizaviehgard <
[email protected]> wrote:

I need to look at the code again, but I wonder if it is the same as the
weight tying trick in LMs.

On Wed, Feb 19, 2020 at 6:36 PM Bram Vanroy notifications@github.com
wrote:

Hi @myleott https://github.com/myleott. I tracked the difference and
describe it over at transformers (huggingface/transformers#1874 (comment)
https://github.com/huggingface/transformers/issues/1874#issuecomment-588359143),
and added a PR to make transformers RoBERTa identical to fairseq's
implementation.

I am a bit confused about the fairseq implementation, though. Why is the
weight of the LMHead the same as the weight of the token embeddings? What's
the reasoning behind this?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/pytorch/fairseq/issues/1399?email_source=notifications&email_token=AGFGTTSAETK62C6CIRBWZLTRDW67XA5CNFSM4JPJEH5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMKEGBQ#issuecomment-588530438,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AGFGTTWQE5QVXQVBINTHG43RDW67XANCNFSM4JPJEH5A
.

I need to look at the code again, but I wonder if it is the same as the weight tying trick in LMs.
…
On Wed, Feb 19, 2020 at 6:36 PM Bram Vanroy @.*> wrote: Hi @myleott https://github.com/myleott. I tracked the difference and describe it over at transformers (huggingface/transformers#1874 (comment) <huggingface/transformers#1874 (comment)>), and added a PR to make transformers RoBERTa identical to fairseq's implementation. I am a bit confused about the fairseq implementation, though. Why is the weight of the LMHead the same as the weight of the token embeddings? What's the reasoning behind this? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1399?email_source=notifications&email_token=AGFGTTSAETK62C6CIRBWZLTRDW67XA5CNFSM4JPJEH5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMKEGBQ#issuecomment-588530438>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFGTTWQE5QVXQVBINTHG43RDW67XANCNFSM4JPJEH5A .

I was wondering about that, too, but why would you tie the weights of the embeddings and those of the LMHead? What would be the reasoning behind that?

EDIT: it's been pointed out to me that this is common practice for a while now. Cf. Using the Output Embedding to Improve Language Models.

Was this page helpful?
0 / 5 - 0 ratings