Fairseq: Facebook FAIR鈥檚 WMT19 News Translation Task Submission

Created on 29 Jul 2019  路  8Comments  路  Source: pytorch/fairseq

Hello,
Would you by chance share the hyper-parameters and training logs of these results ?

En鈫扲u
System news2017 news2018
baseline 35.42 31.53

  • langid filtering 35.69 31.77

thanks.

All 8 comments

Also in the paper, it says SacreBleu is used all the way through, but usually sacreBleu spits only one decimal.

Hi,
Here is the command:

python train.py /private/home/edunov/wmt19/data/wmt19_en_ru_sep_langid/processed -a transformer_wmt_en_de_big --clip-norm 0 --share-decoder-input-output-embed --fp16 --optimizer adam --lr 0.0007 --source-lang en --target-lang ru --label-smoothing 0.1 --dropout 0.2 --max-tokens 3584 --no-progress-bar --log-interval 100 --seed 1 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0 --criterion label_smoothed_cross_entropy --max-update 100000 --encoder-ffn-embed-dim 8192 --warmup-updates 4000 --warmup-init-lr '1e-07' --adam-betas '(0.9, 0.98)' --distributed-port 12597 --distributed-world-size 128

And here is the training log: https://gist.github.com/edunov/70f392ef79c70701cade1c335dcdbadf

For the second question about decimals, it varies for different versions of sacrebleu (1.2.11), the one we used shows two decimal places

do you confirm you used the exact same command line for without back translation and with back translation ?

No, for training with back-translation we used more updates (200k instead of 100k), we also used --upsample-primary ratio that we finetuned on dev sets.

okay, but what about learning rate, warm-up steps and more specifically dropout.
In the paper you refer to the previous paper and it said "same as Vaswani" but for big dropout used to be 0.3
I was wondering if with more data (eg back translation) you reduced the dropout even further.

OMG~ 128GPU 馃槀 I have the same question as @vince62s, why setting dropout rate 0.2 rather than 0.3 that used at Vaswani's paper @edunov

@vince62s @alphadl dropout is one of the most sensitive to dataset size parameters, we always sweep over most common values: 0.1, 0.2, 0.3 and pick the one that works best on dev

@edunov
I got another question on the same paper, same language pair.
On Table 2, top of page 3, it saysNT17: 36.77 and NT18 34.72
It seems to include back translated data.
How does this compare to Table 7 page 5 on the Line BT NewsCrawl where you get NT17 40.09 and NT18 37.07 ?
(where does the gain come from?)
Cheers.

Was this page helpful?
0 / 5 - 0 ratings