Fairseq: the hyperparameter about non-autoregressive model

Created on 5 Dec 2019 · 14Comments · Source: pytorch/fairseq

Hi,

I tried to reproduce the iwslt-deen results of Non-autoregressive Neural Machine Translation(Gu et al., 2017) . How should I set the hyperparameters when training the iwslt model?

Thank you in advance!

Source

fengkaineu

Most helpful comment

@fengkaineu It happens when the validation loss is not strongly correlated with the NAT performance. I also found that sometimes I can get better scores with checkpoint_last.
Another alternative is to implement validation with BLEU score

MultiPath on 8 Dec 2019

🎉2

All 14 comments

What have you tried so far? Did you follow the instructions here? If so, what BLEU score did you achieve and what command did you use to compute that BLEU score.

lematt1991 on 5 Dec 2019

👍1

Thanks for your reply !

I trained NAT on distilled iwslt de-en dataset with following script:

fairseq-train \
    data-bin \
    --ddp-backend=no_c10d \
    --task translation_lev \
    --criterion nat_loss \
    --arch nonautoregressive_transformer \
    --noise full_mask \
    --src-embedding-copy
    --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9,0.98)' \
    --lr 0.0005 --lr-scheduler inverse_sqrt \
    --label-smoothing 0.1 \
    --dropout 0.3 --weight-decay 0.01 \
    --decoder-learned-pos \
    --encoder-learned-pos \
    --pred-length-offset \
    --length-loss-factor 0.1 \
    --apply-bert-init \
    --log-format 'simple' --log-interval 100 \
    --fixed-validation-seed 7 \
    --max-tokens 4096 \
    --max-update 50000 \
    --warmup-updates 10000 \
    --encoder-embed-dim 512 \
    --decoder-embed-dim 512 \
    --encoder-layers 6 \
    --decoder-layers 6 \
    --encoder-attention-heads 4 \
    --decoder-attention-heads 4 \
    --encoder-ffn-embed-dim 1024 \
    --decoder-ffn-embed-dim 1024 \
    --save-dir output_dir"

At inference, I use the following script:

fairseq-generate \
   data-bin  \
  --path  checkpoint_best.pt  \
  --gen-subset   test  \
  --task   translation_lev  \
  --iter-decode-max-iter  0  \
  --iter-decode-eos-penalty   0  \
  --beam 1  \
  --batch-size  128  \
  --remove-bpe

But BLEU on test set is only 15.94

fengkaineu on 5 Dec 2019

👍1

CC @kahne

lematt1991 on 5 Dec 2019

Hi @fengkaineu I haven't tried on IWSLT en-de dataset yet. How do you get your distilled dataset?

MultiPath on 5 Dec 2019

I think in the original paper, we tried a smaller architecture for this dataset

MultiPath on 5 Dec 2019

Hi @MultiPath The Bleu score is based on the distilled dataset ( teacher model with 6 layers, 4 heads, 512 embed-dim, 1024 hidden dim, bleu score 34.86).

In the training process, the valid loss will increase after 12 epoch. So whether some hyperparameters are not suitable for the iwslt dataset ?

fengkaineu on 6 Dec 2019

Hi @fengkaineu did you distill the validation set, too? Or using the real dataset?

MultiPath on 6 Dec 2019

@fengkaineu Can you try last checkpoints instead of using the checkpoint_best.pt?

MultiPath on 6 Dec 2019

👍1

Thanks for the kind explanation. 👍 It is helpful !

I haven't distill the validation set before. When used the last checkpoint , I get 22.33 Bleu .

fengkaineu on 6 Dec 2019

Hi @MultiPath After distilling the validation set , I get 16.59 Bleu with the checkpoint_best.

Why does this happen ？

fengkaineu on 7 Dec 2019

MultiPath on 8 Dec 2019

🎉2

Thanks so much for your help !

fengkaineu on 8 Dec 2019

Hi @fengkaineu have you implemented validation with BLEU score? I do not know how to use it.
I tried to add command like
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric
but it didn't work. And the error seems like "the stats has no attribute 'bleu'"

sfxjh on 4 Apr 2020

Hi @fengkaineu,

How many GPU do you use to train the model? Do you train the model with batch size of only 4k tokens? And how many updates do you use to reach the accuracy of 22 bleu. Many thanks!