Hi!
If we share decoder parameters in the multilingual transformer, we need to tell shared decoder in which language to decode.
It might be done by (embedding and) passing target language id directly to the decoder.
Alternatively, one might need to append this language tag to actual sentence (so that it will be e.g. first word in the sentence).
How is it done in fairseq's multilingual transformer?
Thank you,
Maksym
How do you use it for multiple target languages?
The example only covers mulitple sources and one target language.
We added --decoder-langtok supports in #620. You can specify --decoder-langtok in both training and inference. It feeds the target language token as the first token to decoder.
We added
--decoder-langtoksupports in #620. You can specify--decoder-langtokin both training and inference. It feeds the target language token as the first token to decoder.
@pipibjc can you please add an example of many-to-many multilingual translation case, right now the example only covers many-to-one scenario.
@madaanpulkit sure, I have some draft example about how to train a many-to-many multilingual translation model, but I need to clean it up a bit. I will update the example page shortly.
A draft would work for the time being (pulkit.[email protected]). Thanks for the quick replies.
Here is an example that uses the binarized data from the multilingual example. Here I just demonstrate how to specify the command line correctly without tuning the hyper-parameters:
Training:
CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt17.de_fr.en.bpe16k/ --max-epoch 50 --ddp-backend=no_c10d --task multilingual_translation --arch multilingual_transformer_iwslt_de_en --share-decoders --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9, 0.98)' --lr 0.0005 --lr-scheduler inverse_sqrt --min-lr '1e-09' --warmup-updates 4000 --warmup-init-lr '1e-07' --label-smoothing 0.1 --criterion label_smoothed_cross_entropy --dropout 0.3 --weight-decay 0.0001 --save-dir checkpoints/multilingual_transformer --max-tokens 4000 --update-freq 8 --max-update 20 --log-format json --lang-pairs de-en,fr-en,en-fr,en-de --encoder-langtok tgt
Inference:
CUDA_VISIBLE_DEVICES=0 python generate.py data-bin/iwslt17.de_fr.en.bpe16k/ --task multilingual_translation --path checkpoints/multilingual_transformer/checkpoint_best.pt --source-lang en --target-lang fr --gen-subset valid --lang-pairs de-en,fr-en,en-fr,en-de --encoder-langtok tgt
Here is an example that uses the binarized data from the multilingual example. Here I just demonstrate how to specify the command line correctly without tuning the hyper-parameters:
Training:
CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt17.de_fr.en.bpe16k/ --max-epoch 50 --ddp-backend=no_c10d --task multilingual_translation --arch multilingual_transformer_iwslt_de_en --share-decoders --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9, 0.98)' --lr 0.0005 --lr-scheduler inverse_sqrt --min-lr '1e-09' --warmup-updates 4000 --warmup-init-lr '1e-07' --label-smoothing 0.1 --criterion label_smoothed_cross_entropy --dropout 0.3 --weight-decay 0.0001 --save-dir checkpoints/multilingual_transformer --max-tokens 4000 --update-freq 8 --max-update 20 --log-format json --lang-pairs de-en,fr-en,en-fr,en-de --encoder-langtok tgtInference:
CUDA_VISIBLE_DEVICES=0 python generate.py data-bin/iwslt17.de_fr.en.bpe16k/ --task multilingual_translation --path checkpoints/multilingual_transformer/checkpoint_best.pt --source-lang en --target-lang fr --gen-subset valid --lang-pairs de-en,fr-en,en-fr,en-de --encoder-langtok tgt
@pipibjc thanks for the help.
Any particular reason behind using --encoder-langtok and not --decoder-langtok ?
I have experimented both --encoder-langtok tgt and --docoder-langtok on many-to-many case, but I didn't find any difference. I use --encoder-langtok tgt as example is just because the original paper suggested to do so.
I have experimented both
--encoder-langtok tgtand--docoder-langtokon many-to-many case, but I didn't find any difference. I use--encoder-langtok tgtas example is just because the original paper suggested to do so.
I tried and --encoder-langtok tgt worked better for me.
Most helpful comment
@madaanpulkit sure, I have some draft example about how to train a many-to-many multilingual translation model, but I need to clean it up a bit. I will update the example page shortly.