Fairseq: Disparity with huggingface implementation when reproducing the fill_mask

Created on 19 Nov 2019 · 8Comments · Source: pytorch/fairseq

I was not able to reproduce the fill mask provided here with the huggingface interface, wondering if they are using an outdated model (if a new one has been released) :)

bug help wanted

Source

hamediramin

All 8 comments

import torch
roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
roberta.eval()
roberta.fill_mask('Anything with a hint of swollen bud my prey.', 40)

hamediramin on 19 Nov 2019

Sorry, could you provide more details? I'm confused what the issue is

huihuifan on 26 Nov 2019

Thanks so much for the follow up. I re-implemented the mask_predict function with Huggingface release of the Roberta, and what I observe is that the predictions of the model are slightly different and I see better ones on here. I wonder if there was multiple releases of the model and if they are using the older one etc, and I understand this is a weird problem though.

hamediramin on 27 Nov 2019

There was only one release of the model, so it's concerning if you're seeing differences. Can you share a bit more about your implementation? Can you confirm that the Tensor input fed to the model is the same in both cases?

myleott on 4 Dec 2019

Hi @myleott. I tracked the difference and describe it over at transformers (https://github.com/huggingface/transformers/issues/1874#issuecomment-588359143), and added a PR to make transformers RoBERTa identical to fairseq's implementation.

I am a bit confused about the fairseq implementation, though. Why is the weight of the LMHead the same as the weight of the token embeddings? What's the reasoning behind this?

BramVanroy on 20 Feb 2020

❤1

I need to look at the code again, but I wonder if it is the same as the
weight tying trick in LMs.

On Wed, Feb 19, 2020 at 6:36 PM Bram Vanroy notifications@github.com
wrote:

Hi @myleott https://github.com/myleott. I tracked the difference and
describe it over at transformers (huggingface/transformers#1874 (comment)
https://github.com/huggingface/transformers/issues/1874#issuecomment-588359143),
and added a PR to make transformers RoBERTa identical to fairseq's
implementation.

I am a bit confused about the fairseq implementation, though. Why is the
weight of the LMHead the same as the weight of the token embeddings? What's
the reasoning behind this?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/pytorch/fairseq/issues/1399?email_source=notifications&email_token=AGFGTTSAETK62C6CIRBWZLTRDW67XA5CNFSM4JPJEH5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMKEGBQ#issuecomment-588530438,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AGFGTTWQE5QVXQVBINTHG43RDW67XANCNFSM4JPJEH5A
.

hamediramin on 20 Feb 2020

Also thanks so much for following up on this guys.

On Wed, Feb 19, 2020 at 6:42 PM ramin hamedizaviehgard <
[email protected]> wrote:

I need to look at the code again, but I wonder if it is the same as the
weight tying trick in LMs.

On Wed, Feb 19, 2020 at 6:36 PM Bram Vanroy notifications@github.com
wrote:

Hi @myleott https://github.com/myleott. I tracked the difference and
describe it over at transformers (huggingface/transformers#1874 (comment)
https://github.com/huggingface/transformers/issues/1874#issuecomment-588359143),
and added a PR to make transformers RoBERTa identical to fairseq's
implementation.

I am a bit confused about the fairseq implementation, though. Why is the
weight of the LMHead the same as the weight of the token embeddings? What's
the reasoning behind this?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/pytorch/fairseq/issues/1399?email_source=notifications&email_token=AGFGTTSAETK62C6CIRBWZLTRDW67XA5CNFSM4JPJEH5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMKEGBQ#issuecomment-588530438,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AGFGTTWQE5QVXQVBINTHG43RDW67XANCNFSM4JPJEH5A
.

hamediramin on 20 Feb 2020

I need to look at the code again, but I wonder if it is the same as the weight tying trick in LMs.
…
On Wed, Feb 19, 2020 at 6:36 PM Bram Vanroy @.*> wrote: Hi @myleott https://github.com/myleott. I tracked the difference and describe it over at transformers (huggingface/transformers#1874 (comment) <huggingface/transformers#1874 (comment)>), and added a PR to make transformers RoBERTa identical to fairseq's implementation. I am a bit confused about the fairseq implementation, though. Why is the weight of the LMHead the same as the weight of the token embeddings? What's the reasoning behind this? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1399?email_source=notifications&email_token=AGFGTTSAETK62C6CIRBWZLTRDW67XA5CNFSM4JPJEH5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMKEGBQ#issuecomment-588530438>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFGTTWQE5QVXQVBINTHG43RDW67XANCNFSM4JPJEH5A .

I was wondering about that, too, but why would you tie the weights of the embeddings and those of the LMHead? What would be the reasoning behind that?

EDIT: it's been pointed out to me that this is common practice for a while now. Cf. Using the Output Embedding to Improve Language Models.

BramVanroy on 20 Feb 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings