Fairseq: Facebook FAIR’s WMT19 News Translation Task Submission

Created on 29 Jul 2019 · 8Comments · Source: pytorch/fairseq

Hello,
Would you by chance share the hyper-parameters and training logs of these results ?

En→Ru
System news2017 news2018
baseline 35.42 31.53

langid filtering 35.69 31.77

thanks.

Source

vince62s

All 8 comments

Also in the paper, it says SacreBleu is used all the way through, but usually sacreBleu spits only one decimal.

vince62s on 30 Jul 2019

Hi,
Here is the command:

python train.py /private/home/edunov/wmt19/data/wmt19_en_ru_sep_langid/processed -a transformer_wmt_en_de_big --clip-norm 0 --share-decoder-input-output-embed --fp16 --optimizer adam --lr 0.0007 --source-lang en --target-lang ru --label-smoothing 0.1 --dropout 0.2 --max-tokens 3584 --no-progress-bar --log-interval 100 --seed 1 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0 --criterion label_smoothed_cross_entropy --max-update 100000 --encoder-ffn-embed-dim 8192 --warmup-updates 4000 --warmup-init-lr '1e-07' --adam-betas '(0.9, 0.98)' --distributed-port 12597 --distributed-world-size 128

And here is the training log: https://gist.github.com/edunov/70f392ef79c70701cade1c335dcdbadf

For the second question about decimals, it varies for different versions of sacrebleu (1.2.11), the one we used shows two decimal places

edunov on 2 Aug 2019

do you confirm you used the exact same command line for without back translation and with back translation ?

vince62s on 2 Aug 2019

No, for training with back-translation we used more updates (200k instead of 100k), we also used --upsample-primary ratio that we finetuned on dev sets.

edunov on 2 Aug 2019

okay, but what about learning rate, warm-up steps and more specifically dropout.
In the paper you refer to the previous paper and it said "same as Vaswani" but for big dropout used to be 0.3
I was wondering if with more data (eg back translation) you reduced the dropout even further.

vince62s on 2 Aug 2019

👀1 👍1

OMG~ 128GPU 😂 I have the same question as @vince62s, why setting dropout rate 0.2 rather than 0.3 that used at Vaswani's paper @edunov

alphadl on 6 Aug 2019

@vince62s @alphadl dropout is one of the most sensitive to dataset size parameters, we always sweep over most common values: 0.1, 0.2, 0.3 and pick the one that works best on dev

edunov on 6 Aug 2019

@edunov
I got another question on the same paper, same language pair.
On Table 2, top of page 3, it saysNT17: 36.77 and NT18 34.72
It seems to include back translated data.
How does this compare to Table 7 page 5 on the Line BT NewsCrawl where you get NT17 40.09 and NT18 37.07 ?
(where does the gain come from?)
Cheers.

vince62s on 19 Nov 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings