Fairseq: LayerDrop reduces the blue ?

Created on 12 Feb 2020 · 14Comments · Source: pytorch/fairseq

Does Someone find that when using the LayerDrop reduces the bleu? My bash script is:

python3 -u $BIN/train.py $ALLDATA --save-dir $saveDir --log-format 'simple' --ddp-backend 'no_c10d' \
                    --source-lang 'en' --target-lang 'zh' --left-pad-source False --left-pad-target False \
                    --max-source-positions 512 --max-target-positions 512 \
                    --arch $ARCH \
                    --encoder-layerdrop 0.2 --decoder-layerdrop 0.3 \
                    --max-tokens 1024 --max-sentences 2000 --max-epoch 1000 --max-update 1000000 --save-interval-updates 1000 --save-interval 10000 --log-interval 50 --update-freq 8 \
                    --lr-scheduler 'inverse_sqrt' --learning-rate 0.001 --min-lr 1e-10 \
                    --warmup-updates 4000 --warmup-init-lr 1e-7 \
                    --criterion 'label_smoothed_cross_entropy' --label-smoothing 0.1 \
                    --optimizer 'adam' --adam-betas '(0.9, 0.997)' --fp16

And my model's valid blue is 44（decode with full model param）, but it could be 46 when I don't use LayerDrop。

question

Source

jiezhangGt

Most helpful comment

How large is the training dataset? At training time, layerdrop has a strong regularization effect, so if you apply it on a small dataset where there is not much overfitting, I would expect reduced performance. We tested it only on large scale datasets.

I recommend training with smaller values (0.1 or 0.2) in the encoder and decoder - larger values may be overregularization. Also, you can turn down the amount of normal dropout as layerdrop provides regularization as well. I recommend reducing this by at least 0.1.

huihuifan on 13 Feb 2020

👍3

All 14 comments

Did you try other values of layerdrop? The tips section says smaller values (0.1 or 0.2) may work better. What's the model architecture you're using? Can you describe the data more? How are you launching the evaluation script/computing bleu. In the future, please follow the issue templates.

lematt1991 on 12 Feb 2020

CC @huihuifan

lematt1991 on 12 Feb 2020

I didn't try any other values of layerdrop because I want to prune the model half even more small, so I think I need set the layerdrop a little biger than 0.2. and the model architecture is transformer_wmt_en_de_big. I wonder whether LayerDrop can probably reduce the performance？The following is my evaluation script:

python3 generate.py data/train_data --log-format 'simple' --source-lang 'en' --target-lang 'zh' --left-pad-source False --left-pad-target False  --gen-subset valid --path checkpoints_best.pt --beam 5 --remove-bpe --quiet

what's more, I am always using mteval-v13a.pl to compute bleu.

jiezhangGt on 13 Feb 2020

Does Someone find that when using the LayerDrop reduces the bleu? My bash script is:

python3 -u $BIN/train.py $ALLDATA --save-dir $saveDir --log-format 'simple' --ddp-backend 'no_c10d' \
                  --source-lang 'en' --target-lang 'zh' --left-pad-source False --left-pad-target False \
                  --max-source-positions 512 --max-target-positions 512 \
                  --arch $ARCH \
                  --encoder-layerdrop 0.2 --decoder-layerdrop 0.3 \
                  --max-tokens 1024 --max-sentences 2000 --max-epoch 1000 --max-update 1000000 --save-interval-updates 1000 --save-interval 10000 --log-interval 50 --update-freq 8 \
                  --lr-scheduler 'inverse_sqrt' --learning-rate 0.001 --min-lr 1e-10 \
                  --warmup-updates 4000 --warmup-init-lr 1e-7 \
                  --criterion 'label_smoothed_cross_entropy' --label-smoothing 0.1 \
                  --optimizer 'adam' --adam-betas '(0.9, 0.997)' --fp16

And my model's valid blue is 44（decode with full model param）, but it could be 46 when I don't use LayerDrop。

I also encountered this problem. The parameters are as fellows: layerdrop=0.2, encoder_layers=12, decoder_layers=6.
BLEU decreased after using layerdrop. Do you have any suggestions?

FangxuLiu on 13 Feb 2020

Does Someone find that when using the LayerDrop reduces the bleu? My bash script is:

python3 -u $BIN/train.py $ALLDATA --save-dir $saveDir --log-format 'simple' --ddp-backend 'no_c10d' \
                    --source-lang 'en' --target-lang 'zh' --left-pad-source False --left-pad-target False \
                    --max-source-positions 512 --max-target-positions 512 \
                    --arch $ARCH \
                    --encoder-layerdrop 0.2 --decoder-layerdrop 0.3 \
                    --max-tokens 1024 --max-sentences 2000 --max-epoch 1000 --max-update 1000000 --save-interval-updates 1000 --save-interval 10000 --log-interval 50 --update-freq 8 \
                    --lr-scheduler 'inverse_sqrt' --learning-rate 0.001 --min-lr 1e-10 \
                    --warmup-updates 4000 --warmup-init-lr 1e-7 \
                    --criterion 'label_smoothed_cross_entropy' --label-smoothing 0.1 \
                    --optimizer 'adam' --adam-betas '(0.9, 0.997)' --fp16

And my model's valid blue is 44（decode with full model param）, but it could be 46 when I don't use LayerDrop。

I also encountered this problem. The parameters are as fellows: layerdrop=0.2, encoder_layers=12, decoder_layers=6.
BLEU decreased after using layerdrop. Do you have any suggestions?

jiezhangGt on 13 Feb 2020

I sincerely look forward to your reply

jiezhangGt on 13 Feb 2020

Does Someone find that when using the LayerDrop reduces the bleu? My bash script is:
python3 -u $BIN/train.py $ALLDATA --save-dir $saveDir --log-format 'simple' --ddp-backend 'no_c10d' \
                  --source-lang 'en' --target-lang 'zh' --left-pad-source False --left-pad-target False \
                  --max-source-positions 512 --max-target-positions 512 \
                  --arch $ARCH \
                  --encoder-layerdrop 0.2 --decoder-layerdrop 0.3 \
                  --max-tokens 1024 --max-sentences 2000 --max-epoch 1000 --max-update 1000000 --save-interval-updates 1000 --save-interval 10000 --log-interval 50 --update-freq 8 \
                  --lr-scheduler 'inverse_sqrt' --learning-rate 0.001 --min-lr 1e-10 \
                  --warmup-updates 4000 --warmup-init-lr 1e-7 \
                  --criterion 'label_smoothed_cross_entropy' --label-smoothing 0.1 \
                  --optimizer 'adam' --adam-betas '(0.9, 0.997)' --fp16
And my model's valid blue is 44（decode with full model param）, but it could be 46 when I don't use LayerDrop。
I also encountered this problem. The parameters are as fellows: layerdrop=0.2, encoder_layers=12, decoder_layers=6.
BLEU decreased after using layerdrop. Do you have any suggestions?

I'm sorry to say I can't either，when I used LayerDrop, the performance decreased dramatically

jiezhangGt on 13 Feb 2020

huihuifan on 13 Feb 2020

👍3

How large is the training dataset? At training time, layerdrop has a strong regularization effect, so if you apply it on a small dataset where there is not much overfitting, I would expect reduced performance. We tested it only on large scale datasets.

I recommend training with smaller values (0.1 or 0.2) in the encoder and decoder - larger values may be overregularization. Also, you can turn down the amount of normal dropout as layerdrop provides regularization as well. I recommend reducing this by at least 0.1.

My dataset is large enough（more than 20 billion）, and now I am training another model with the args of layerdrop=0.2, dropout=0.1, and the arch is transformer_wmt_ende_big. I'll report the result soon.

jiezhangGt on 14 Feb 2020

👍2

How large is the training dataset? At training time, layerdrop has a strong regularization effect, so if you apply it on a small dataset where there is not much overfitting, I would expect reduced performance. We tested it only on large scale datasets.
I recommend training with smaller values (0.1 or 0.2) in the encoder and decoder - larger values may be overregularization. Also, you can turn down the amount of normal dropout as layerdrop provides regularization as well. I recommend reducing this by at least 0.1.

My dataset is large enough（more than 20 billion）, and now I am training another model with the args of layerdrop=0.2, dropout=0.1, and the arch is transformer_wmt_ende_big. I'll report the result soon.

Hi, Have you tested the most recent result using layerdrop? Thanks.

MichaelZhouwang on 9 Mar 2020

How large is the training dataset? At training time, layerdrop has a strong regularization effect, so if you apply it on a small dataset where there is not much overfitting, I would expect reduced performance. We tested it only on large scale datasets.
I recommend training with smaller values (0.1 or 0.2) in the encoder and decoder - larger values may be overregularization. Also, you can turn down the amount of normal dropout as layerdrop provides regularization as well. I recommend reducing this by at least 0.1.

My dataset is large enough（more than 20 billion）, and now I am training another model with the args of layerdrop=0.2, dropout=0.1, and the arch is transformer_wmt_ende_big. I'll report the result soon.

Hi, Have you tested the most recent result using layerdrop? Thanks.

Yes， It is. The result of my experiment is that layeDrop will not affect the effect of the model, that is to say, when the original BLEU of your model is 43.52, the BLEU of the model will also be 43.52 after adding the training of layerDrop. However, if some of the layer is dropped during decoding, the effect will decrease significantly.

43.11428571 | encode{full}; decode{0,2,3,5}
-- | --
42.35571429 | encode{0,2,3,5}; decode{0,2,3,5}
42.23285714 | encode{0,1,4,5}; decode{0,2,3,5}
42.05714286 | encode{0,1,2,5}; decode{0,2,3,5}
42.96142857 | encode{full}; decode{0,1,4,5}
42.6342857 | encode{full}; decode{0,3,5}
40.97428571 | encode{full}; decode{0,5}
41.86142857 | encode{0,2,3,5}; decode{0,3,5}

jiezhangGt on 10 Mar 2020

👍1

For NMT tasks, try to keep the encoder full size and shrink the decoder -
in NMT, encoder size has been found to be very important. The experiments
where you shrink the decoder size by 2 layers and only lose like 0.1 BLEU
(from 43.1 to 42.9) seems encouraging. If you want better results but are
ok with a larger model, try doing something like doubling the encoder size
while keeping decoder size fixed.

huihuifan on 10 Mar 2020

where you shrink the decoder size by 2 layers and only lose like 0.1 BLEU (from 43.1 to 42.9)

Thank you for your reply and suggestions. The aim of my experiment was to compress the model to a quarter of its original size,. This was done by adjusting the model parameters and then training the model from scratch, for now, however, the effect of layerdrop cannot exceed this method.

The full version of the results of the above experiment is:
(encode layer drop rate = decode layer drop rate = 0.2)

43.52428571 | encode{full}; decode{full}
-- | --
43.11428571 | encode{full}; decode{0,2,3,5}
42.35571429 | encode{0,2,3,5}; decode{0,2,3,5}
42.23285714 | encode{0,1,4,5}; decode{0,2,3,5}
42.05714286 | encode{0,1,2,5}; decode{0,2,3,5}
42.96142857 | encode{full}; decode{0,1,4,5}
42.6342857 | encode{full}; decode{0,3,5}
40.97428571 | encode{full}; decode{0,5}
41.86142857 | encode{0,2,3,5}; decode{0,3,5}

jiezhangGt on 11 Mar 2020

Thanks for sharing your results @jiezhangGt . A few questions:

1) Could you share the final command for the second row in the most recent table/roughly how long it took to train?
2) what data were you using?
3) Do you know the score of a model the same shapes as row2 trained from scratch?