Could you please release the training scripts of the paper "Understanding Back-Translation at Scale"? I Can only find the prediction scripts~ Thanks~~
I don't have a script to do the entire pipeline and even if I write one, it will be tailored to our environment. It's easier to train these models in multiple steps:
1) Train reverse model using only parallel data. For the paper we used the standard setup, you can find the details here: https://github.com/pytorch/fairseq/tree/master/examples/translation#replicating-results-from-scaling-neural-machine-translation you only need to swap the language direction
2) Translate your target side monolingual data with the mode you've trained at stage 1. I used interactive.py and manually sharded the data to run it in parallel on 100s of GPUs. Something like this would work:
cat monolingual.de | python $BPEROOT/apply_bpe.py -c code | python interactive.py $DATA --path model.pt --buffer-size 1024 --sampling --beam 1 --nbest 1 --batch-size 16 |grep -P '^H' |cut -f3- | sed 's/@@\s*//g' > translation.en
3) Finally, I combined available bitext and generated data, preprocessed it with preprocess.py and trained the final model. Here the training setup is very similar to p.1 except I trained for more updates.
Hope that makes sense.
@edunov Thanks for the interesting paper and the documentation.
The settings on https://github.com/pytorch/fairseq/tree/master/examples/translation#replicating-results-from-scaling-neural-machine-translation what is the hardware environment (how many GPUs? Which GPU? And how much RAM each GPU has?) you were training on? Knowing that will help us understand how to tune --max-tokens and the appropriate --update-freq.
Thanks in advance!
@edunov Thanks!
Is the command you wrote above named "unrestricted sampling" in the paper? And could you please list the steps of the "beam+noise" method?
Five methods are presented in your parper. The samping and beam_noise have reached the best performance .The conmand you mentioned above can be used to reproduce the samping, however, what about the other one ?
@alvations the settings are very similar to those described in this paper: https://arxiv.org/abs/1806.00187 For the final model we used 128 Volta GPUs with 16G memory, --max-tokens 3584 and --update-freq 1
@maydaygmail @KelleyYin the noising script I used: https://gist.github.com/edunov/d67d09a38e75409b8408ed86489645dd it's a bit raw, I should probably clean it up and make available in the repository.
It works on top of beam, so you'll have to do steps 1 and 3 as usual, but for the step 2, you need to generate with beam 5:
cat monolingual.de | python $BPEROOT/apply_bpe.py -c code | python interactive.py $DATA --path model.pt --buffer-size 1024 --beam 5 --batch-size 16 |grep -P '^H' |cut -f3- | sed 's/@@\s*//g' | python addnoise.py > translation.en
@edunov Thanks a lot for your reply .
@edunov Thanks~~
@edunov just to confirm, did you use the fconv_de_en architecture to train the model that back translates monolingual data using unrestricted sampling?
@ykl7 , no I used transformer_wmt_en_de_big
Most helpful comment
I don't have a script to do the entire pipeline and even if I write one, it will be tailored to our environment. It's easier to train these models in multiple steps:
1) Train reverse model using only parallel data. For the paper we used the standard setup, you can find the details here: https://github.com/pytorch/fairseq/tree/master/examples/translation#replicating-results-from-scaling-neural-machine-translation you only need to swap the language direction
2) Translate your target side monolingual data with the mode you've trained at stage 1. I used interactive.py and manually sharded the data to run it in parallel on 100s of GPUs. Something like this would work:
cat monolingual.de | python $BPEROOT/apply_bpe.py -c code | python interactive.py $DATA --path model.pt --buffer-size 1024 --sampling --beam 1 --nbest 1 --batch-size 16 |grep -P '^H' |cut -f3- | sed 's/@@\s*//g' > translation.en3) Finally, I combined available bitext and generated data, preprocessed it with preprocess.py and trained the final model. Here the training setup is very similar to p.1 except I trained for more updates.
Hope that makes sense.