Transformers: BERT model for Machine Translation

Created on 18 Nov 2018 · 12Comments · Source: huggingface/transformers

Is there a way to use any of the provided pre-trained models in the repository for machine translation task?

Thanks

Source

KeremTurgutlu

👍12

Most helpful comment

Hey!

FAIR has demonstrated that using BERT for unsupervised translation greatly improves BLEU.

Paper: https://arxiv.org/abs/1901.07291

Repo: https://github.com/facebookresearch/XLM

Older papers showing pre-training with LM (not MLM) helps Seq2Seq: https://arxiv.org/abs/1611.02683

Hope this helps!

SinghJasdeep on 20 Feb 2019

👍20 👀1

All 12 comments

Hi Kerem, I don't think so. Have a look at the fairsep repo maybe.

thomwolf on 18 Nov 2018

@thomwolf hi there, I couldn't find out anything about the fairsep repo. Could you post a link? Thanks!

JasonVann on 26 Nov 2018

Hi, I am talking about this repo: https://github.com/pytorch/fairseq.
Have a look at their Transformer's models for machine translation.

thomwolf on 26 Nov 2018

I have conducted several MT experiments which fixed the embeddings by using BERT, UNFORTUNATELY, I find it makes performance worse. @JasonVann @thomwolf

alphadl on 20 Feb 2019

Hey!

FAIR has demonstrated that using BERT for unsupervised translation greatly improves BLEU.

Paper: https://arxiv.org/abs/1901.07291

Repo: https://github.com/facebookresearch/XLM

Older papers showing pre-training with LM (not MLM) helps Seq2Seq: https://arxiv.org/abs/1611.02683

Hope this helps!

SinghJasdeep on 20 Feb 2019

👍20 👀1

These links are useful.

Does anyone know if BERT improves things also for supervised translation?

Thanks.

gtesei on 1 Mar 2019

👍6

Does anyone know if BERT improves things also for supervised translation?

Also interested

echan00 on 13 Apr 2019

Because BERT is an encoder, I guess we need a decoder. I looked here: https://jalammar.github.io/
and it seems Openai Transformer is a decoder. But I cannot find a repo for it.
https://www.tensorflow.org/alpha/tutorials/text/transformer
I think Bert outputs a vector of size 768. Can we just do a reshape and use the decoder in that transformer notebook? In general can I just reshape and try out a bunch of decoders?

nyck33 on 5 May 2019

These links are useful.

Does anyone know if BERT improves things also for supervised translation?

Thanks.

https://arxiv.org/pdf/1901.07291.pdf seems to suggest that it does improve the results for supervised translation as well. However this paper is not about using BERT embeddings, rather about pre-training the encoder and decoder on an Masked Language Modelling objective. The biggest benefit comes from initializing the encoder with the weights from BERT, and surprisingly using it to initialize the decoder also brings small benefits, even though if I understand correctly you still have to randomly initialize the weights for the encoder attention module, since it's not present in the pre-trained network.

EDIT: of course the pre-trained network needs to have been trained on multi-lingual data, as stated in the paper

tacchinotacchi on 3 Jun 2019

👍1

I have managed to replace transformer's encoder with a pretrained bert encoder, however experiment results were very poor. It dropped BLEU score by about 4

The source code is available here: https://github.com/torshie/bert-nmt , implemented as a fairseq user model. It may not work out of box, some minor tweeks may be needed.