Fairseq: How to reproduce the result of WMT14 en-de on transformer BASE model?

Created on 4 Nov 2018  路  20Comments  路  Source: pytorch/fairseq

Hi

I want to replicate the WMT14 en-de translation result on transformer BASE model of the paper "attention is all you need". Following the last instructions here, I downloaded and preprocessed the data. Then I trained the model with this:

CUDA_VISIBLE_DEVICES=0,1,2,3  python train.py data-bin/wmt16_en_de_bpe32k \
        --arch transformer_wmt_en_de --share-all-embeddings \
          --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
            --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
              --lr 0.0005 --min-lr 1e-09 \
             --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --weight-decay 0.0\
              --max-tokens  4096   --save-dir checkpoints/en-de\
              --update-freq 2 --no-progress-bar --log-format json --log-interval 50\
             --save-interval-updates  1000 --keep-interval-updates 20

I averaged last 5 checkpoints and generated the translation with this:

model=model.pt
subset="test"

   CUDA_VISIBLE_DEVICES=0 python generate.py data-bin/wmt16_en_de_bpe32k  \
         --path checkpoints/$model --gen-subset $subset\
           --beam 4 --batch-size 128 --remove-bpe  --lenpen 0.6

However, after about 120k updates, I got :
| Generate test with beam=4: BLEU4 = 26.38, 57.8/32.0/20.0/13.1 (BP=1.000, ratio=1.020, syslen=64352, reflen=63078)

After about 250k updates, I got:
| Generate test with beam=4: BLEU4 = 26.39, 57.8/32.0/20.0/13.1 (BP=1.000, ratio=1.017, syslen=64123, reflen=63078)

Far away from the result in "attention is all you need"(27.3). Can you think of any reasons for this?
Thanks a lot!

Most helpful comment

Can you share the training log?

A couple other things to note:

All 20 comments

Please try LR 0.0007 for the base model to match Vaswani et al. We parameterize the LR differently than Vaswani, in particular they adjust the learning rate automatically based on embed_dim, but we require the peak LR to be given explicitly.

Thanks for your help!
I have tried this new LR and after about 260k updates, I got:
Generate test with beam=4: BLEU4 = 27.11, 58.3/32.8/20.7/13.6 (BP=1.000, ratio=1.015, syslen=64048, reflen=63078)
Dose there any other changes can be made to further improve the translation performance?

Great! The last step to reproduce results from Vaswani et al. is to split compound words. This step gives a moderate increase in BLEU but is somewhat hacky. In general it鈥檚 preferable to report detokenized BLEU via tools like sacrebleu, although detok. BLEU is usually lower than tokenized BLEU. See this paper: https://arxiv.org/abs/1804.08771

Here is the script: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/get_ende_bleu.sh
The compound splitting is near the bottom of the script.

That's so interesting!
After using this script, I got:
BLEU = 27.70, 58.9/33.4/21.2/14.1 (BP=1.000, ratio=1.015, hyp_len=65442, ref_len=64496)
Meanwhile, I find that the BLEU score of the averaged model which has about 180k updates has already achieved:
BLEU = 27.37, 58.6/33.0/21.0/13.8 (BP=1.000, ratio=1.016, hyp_len=65500, ref_len=64496)
Thanks again for your help! 馃憤

Hi I am trying to replicate the same @myleott but I am using a single GPU setup what is the ideal update-freq and dropout and max-tokens. Initially I plan using update-freq = 16 no dropout and max-tokens = 4096 is this a good idea for single gpu and transformer base setting. Please suggest.

To match the original Vaswani paper, with 1 GPU, you should use --update-freq 8 --max-tokens 4096 --dropout 0.1.

Since you have more GPU memory available, you can probably improve training speed by increasing --max-tokens and decreasing --update-freq. Just try to keep the reported words-per-batch (wpb in the training log) to around 25k.

Thanks much !!

Hi @myleott when I followed command to generate translations
python generate.py data-bin/wmt16_en_de_bpe32k/ --path checkpoints/en-de/checkpoint_best.pt --beam 4 --remove-bpe | tee /tmp/gen.out

and then

$ grep ^H /tmp/gen.out | cut -f3- > /tmp/gen.out.sys
$ grep ^T /tmp/gen.out | cut -f2- > /tmp/gen.out.ref

$ python score.py --sys /tmp/gen.out.sys --ref /tmp/gen.out.ref

I downloaded data from https://github.com/pytorch/fairseq/tree/master/examples/translation#replicating-results-from-scaling-neural-machine-translation and preprocessed the same way.
I got BLEU4 = 24.66 but when i saw the translations I see words like &quot &apos etc but in the reference file they are proper quotations and apostrophies. So I think that's the reason why I got lesser BLEU score. Did I miss something ? How to test and score properly so that I can report the BLEU score mentioned .
Thank you.

Can you share the training log?

A couple other things to note:

Thanks a lot for the insights I will implement it.

Hi @myleott
I have trained an en-de transformer_base model and achieved good BLEU score , but while inference there were some tokens and many of them are around hyphen(-) as in "<unk> -@ " for few raw-text data.
image

According to this https://github.com/pytorch/fairseq/tree/master/examples/translation#replicating-results-from-scaling-neural-machine-translation I downloaded the data (train,valid,test) from the drive and preproccesed it and trained using it.

But while inference I pass raw text through Moses tokenizer --> apply_bpe (subword-nmt) --> model --> debpe (subword-nmt ) --> Moses detokenizer .

Is this pipeline different from provided data from Google , if so please state how PRE+POST processing is done so that unknown tokens do not occur during interactive inference.

Re: pre/post-processing of the Google data, I don't know unfortunately, since they did not share this. If it's important to be able to reverse the tokenization, you can instead follow the preprocessing here (adding --joined-dictionary): https://github.com/pytorch/fairseq/tree/master/examples/translation#prepare-wmt14en2desh. This will download the raw WMT'17 training data and preprocesses it using the Moses tokenizer. This won't be directly comparable to the WMT'16 data that Google uses, but typically results in slightly better BLEU.

Re: unknown words, this can always happen. You can penalize this using the --unkpen option to generate.py.

While using Google data, during inference if the tokenisation is done using Moses ( without aggressive hyphen splitting) followed by apply_bpy.py the issue doesn't occur . I think during use of Moses tokenisation even Google did the same i.e., without aggressive hyphen splitting followed by apply_bpe.py.

hi, I am wondering if I don't set the max-update or max-epoch, when will the training progress stop? Since I got the result reported in the paper, I am not sure when I should stop and will the result gets better if it continues training? @myleott

The automatic stopping conditions are not very robust; they either depend on the learning rate schedule, or number of updates/epochs, but usually it's best to pick a stopping condition yourself based on when the validation loss plateaus.

hey, @myleott sorry for bothering you again. Now I have 8 V100 with 32G RAM , and I want to make the best use of the RAM to reproduce Transformer base model in WMT16 EN-DE dataset.So I try to increase mini-batch. could you please give some examples or command lines used in your experiments for big batch? It seems the paper Scaling Neural Machine Translation didn't give model configuration for transformer base model. And any tricks to make the training process as fast as possible can you share ? Thx!

@Raghava14, just curious if you were able to reproduce the results on a single GPU using the comments given by @myleott on 1/9. Thank you.

Hi @vman049 , Yes with the comments given by myleott I could reach a decent BLEU score of 26.5 on transformer_base .

@Raghava14 how did you preprocess wmt14 data ? As far as I see, there are two different ways to do it:
One is the way described in sample script here for en-fr and the other is here where --joined-dictionary and some other additional arguments are used.

Hi @ereday sorry for the late reply I used both --joined-dictionary as well as didn't limit the number of words using --nwordssrc or --nwordstgt. The exact command I used is
python preprocess.py --source-lang en --target-lang de --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test --destdir data-bin/wmt14_en_de --thresholdtgt 0 --thresholdsrc 0 --joined-dictionary --workers 16 .
Hope it is helpful.

Was this page helpful?
0 / 5 - 0 ratings