Hi,
I tried to reproduce the iwslt-deen results of Non-autoregressive Neural Machine Translation(Gu et al., 2017) . How should I set the hyperparameters when training the iwslt model?
Thank you in advance!
What have you tried so far? Did you follow the instructions here? If so, what BLEU score did you achieve and what command did you use to compute that BLEU score.
Thanks for your reply !
I trained NAT on distilled iwslt de-en dataset with following script:
fairseq-train \
data-bin \
--ddp-backend=no_c10d \
--task translation_lev \
--criterion nat_loss \
--arch nonautoregressive_transformer \
--noise full_mask \
--src-embedding-copy
--share-all-embeddings \
--optimizer adam --adam-betas '(0.9,0.98)' \
--lr 0.0005 --lr-scheduler inverse_sqrt \
--label-smoothing 0.1 \
--dropout 0.3 --weight-decay 0.01 \
--decoder-learned-pos \
--encoder-learned-pos \
--pred-length-offset \
--length-loss-factor 0.1 \
--apply-bert-init \
--log-format 'simple' --log-interval 100 \
--fixed-validation-seed 7 \
--max-tokens 4096 \
--max-update 50000 \
--warmup-updates 10000 \
--encoder-embed-dim 512 \
--decoder-embed-dim 512 \
--encoder-layers 6 \
--decoder-layers 6 \
--encoder-attention-heads 4 \
--decoder-attention-heads 4 \
--encoder-ffn-embed-dim 1024 \
--decoder-ffn-embed-dim 1024 \
--save-dir output_dir"
At inference, I use the following script:
fairseq-generate \
data-bin \
--path checkpoint_best.pt \
--gen-subset test \
--task translation_lev \
--iter-decode-max-iter 0 \
--iter-decode-eos-penalty 0 \
--beam 1 \
--batch-size 128 \
--remove-bpe
But BLEU on test set is only 15.94
CC @kahne
Hi @fengkaineu I haven't tried on IWSLT en-de dataset yet. How do you get your distilled dataset?
I think in the original paper, we tried a smaller architecture for this dataset
Hi @MultiPath The Bleu score is based on the distilled dataset ( teacher model with 6 layers, 4 heads, 512 embed-dim, 1024 hidden dim, bleu score 34.86).
In the training process, the valid loss will increase after 12 epoch. So whether some hyperparameters are not suitable for the iwslt dataset ?
Hi @fengkaineu did you distill the validation set, too? Or using the real dataset?
@fengkaineu Can you try last checkpoints instead of using the checkpoint_best.pt?
Thanks for the kind explanation. 👍 It is helpful !
I haven't distill the validation set before. When used the last checkpoint , I get 22.33 Bleu .
Hi @MultiPath After distilling the validation set , I get 16.59 Bleu with the checkpoint_best.
Why does this happen ?
@fengkaineu It happens when the validation loss is not strongly correlated with the NAT performance. I also found that sometimes I can get better scores with checkpoint_last.
Another alternative is to implement validation with BLEU score
Thanks so much for your help !
Hi @fengkaineu have you implemented validation with BLEU score? I do not know how to use it.
I tried to add command like
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric
but it didn't work. And the error seems like "the stats has no attribute 'bleu'"
Hi @fengkaineu,
How many GPU do you use to train the model? Do you train the model with batch size of only 4k tokens? And how many updates do you use to reach the accuracy of 22 bleu. Many thanks!
Most helpful comment
@fengkaineu It happens when the validation loss is not strongly correlated with the NAT performance. I also found that sometimes I can get better scores with checkpoint_last.
Another alternative is to implement validation with BLEU score