Fairseq: Back-Translation training

Created on 8 Oct 2018 · 9Comments · Source: pytorch/fairseq

Could you please release the training scripts of the paper "Understanding Back-Translation at Scale"? I Can only find the prediction scripts~ Thanks~~

Source

maydaygmail

Most helpful comment

I don't have a script to do the entire pipeline and even if I write one, it will be tailored to our environment. It's easier to train these models in multiple steps:

1) Train reverse model using only parallel data. For the paper we used the standard setup, you can find the details here: https://github.com/pytorch/fairseq/tree/master/examples/translation#replicating-results-from-scaling-neural-machine-translation you only need to swap the language direction

2) Translate your target side monolingual data with the mode you've trained at stage 1. I used interactive.py and manually sharded the data to run it in parallel on 100s of GPUs. Something like this would work:

3) Finally, I combined available bitext and generated data, preprocessed it with preprocess.py and trained the final model. Here the training setup is very similar to p.1 except I trained for more updates.

Hope that makes sense.

edunov on 9 Oct 2018

👍8

All 9 comments

I don't have a script to do the entire pipeline and even if I write one, it will be tailored to our environment. It's easier to train these models in multiple steps:

Hope that makes sense.

edunov on 9 Oct 2018

👍8

@edunov Thanks for the interesting paper and the documentation.

The settings on https://github.com/pytorch/fairseq/tree/master/examples/translation#replicating-results-from-scaling-neural-machine-translation what is the hardware environment (how many GPUs? Which GPU? And how much RAM each GPU has?) you were training on? Knowing that will help us understand how to tune --max-tokens and the appropriate --update-freq.

Thanks in advance!

alvations on 9 Oct 2018

@edunov Thanks!
Is the command you wrote above named "unrestricted sampling" in the paper? And could you please list the steps of the "beam+noise" method?

maydaygmail on 9 Oct 2018

Five methods are presented in your parper. The samping and beam_noise have reached the best performance .The conmand you mentioned above can be used to reproduce the samping, however, what about the other one ?

KelleyYin on 9 Oct 2018

@alvations the settings are very similar to those described in this paper: https://arxiv.org/abs/1806.00187 For the final model we used 128 Volta GPUs with 16G memory, --max-tokens 3584 and --update-freq 1

@maydaygmail @KelleyYin the noising script I used: https://gist.github.com/edunov/d67d09a38e75409b8408ed86489645dd it's a bit raw, I should probably clean it up and make available in the repository.
It works on top of beam, so you'll have to do steps 1 and 3 as usual, but for the step 2, you need to generate with beam 5:

edunov on 9 Oct 2018

👍6

@edunov Thanks a lot for your reply .

KelleyYin on 9 Oct 2018

@edunov Thanks~~

maydaygmail on 9 Oct 2018

@edunov just to confirm, did you use the fconv_de_en architecture to train the model that back translates monolingual data using unrestricted sampling?