Attempting to run training in with xlmr large using transformer_from_pretrained_xlm for task
translation_from_pretrained_xlm.
Not sure if bug is a good term here as it's not documented and I have been trying to piece together what to do via Fairseq's and XLMR's repos.
This will eventually generate the following stack trace:
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/ec2-user/fairseq/fairseq_cli/train.py", line 286, in distributed_main
main(args, init_distributed=True)
File "/home/ec2-user/fairseq/fairseq_cli/train.py", line 62, in main
model = task.build_model(args)
File "/home/ec2-user/fairseq/fairseq/tasks/translation.py", line 278, in build_model
return super().build_model(args)
File "/home/ec2-user/fairseq/fairseq/tasks/fairseq_task.py", line 211, in build_model
return models.build_model(args, self)
File "/home/ec2-user/fairseq/fairseq/models/__init__.py", line 48, in build_model
return ARCH_MODEL_REGISTRY[args.arch].build_model(args, task)
File "/home/ec2-user/fairseq/fairseq/models/transformer_from_pretrained_xlm.py", line 63, in build_model
return super().build_model(args, task)
File "/home/ec2-user/fairseq/fairseq/models/transformer.py", line 222, in build_model
encoder = cls.build_encoder(args, src_dict, encoder_embed_tokens)
File "/home/ec2-user/fairseq/fairseq/models/transformer_from_pretrained_xlm.py", line 67, in build_encoder
return TransformerEncoderFromPretrainedXLM(args, src_dict, embed_tokens)
File "/home/ec2-user/fairseq/fairseq/models/transformer_from_pretrained_xlm.py", line 127, in __init__
pretrained_xlm_checkpoint=args.pretrained_xlm_checkpoint,
File "/home/ec2-user/fairseq/fairseq/models/transformer_from_pretrained_xlm.py", line 106, in upgrade_state_dict_with_xlm_weights
subkey, key, pretrained_xlm_checkpoint)
AssertionError: odict_keys(['version', 'embed_tokens.weight', 'embed_positions._float_tensor', 'layers.0.self_attn.k_proj.weight', 'layers.0.self_attn.k_proj.bias', 'layers.0.self_attn.v_proj.weight', 'layers.0.self_attn.v_proj.bias', 'layers.0.self_attn.q_proj.weight', 'layers.0.self_attn.q_proj.bias', 'layers.0.self_attn.out_proj.weight', 'layers.0.self_attn.out_proj.bias', 'layers.0.self_attn_layer_norm.weight', 'layers.0.self_attn_layer_norm.bias', 'layers.0.fc1.weight', 'layers.0.fc1.bias', 'layers.0.fc2.weight', 'layers.0.fc2.bias', 'layers.0.final_layer_norm.weight', 'layers.0.final_layer_norm.bias', 'layers.1.self_attn.k_proj.weight', 'layers.1.self_attn.k_proj.bias', 'layers.1.self_attn.v_proj.weight', 'layers.1.self_attn.v_proj.bias', 'layers.1.self_attn.q_proj.weight', 'layers.1.self_attn.q_proj.bias', 'layers.1.self_attn.out_proj.weight', 'layers.1.self_attn.out_proj.bias', 'layers.1.self_attn_layer_norm.weight', 'layers.1.self_attn_layer_norm.bias', 'layers.1.fc1.weight', 'layers.1.fc1.bias', 'layers.1.fc2.weight', 'layers.1.fc2.bias', 'layers.1.final_layer_norm.weight', 'layers.1.final_layer_norm.bias', 'layers.2.self_attn.k_proj.weight', 'layers.2.self_attn.k_proj.bias', 'layers.2.self_attn.v_proj.weight', 'layers.2.self_attn.v_proj.bias', 'layers.2.self_attn.q_proj.weight', 'layers.2.self_attn.q_proj.bias', 'layers.2.self_attn.out_proj.weight', 'layers.2.self_attn.out_proj.bias', 'layers.2.self_attn_layer_norm.weight', 'layers.2.self_attn_layer_norm.bias', 'layers.2.fc1.weight', 'layers.2.fc1.bias', 'layers.2.fc2.weight', 'layers.2.fc2.bias', 'layers.2.final_layer_norm.weight', 'layers.2.final_layer_norm.bias', 'layers.3.self_attn.k_proj.weight', 'layers.3.self_attn.k_proj.bias', 'layers.3.self_attn.v_proj.weight', 'layers.3.self_attn.v_proj.bias', 'layers.3.self_attn.q_proj.weight', 'layers.3.self_attn.q_proj.bias', 'layers.3.self_attn.out_proj.weight', 'layers.3.self_attn.out_proj.bias', 'layers.3.self_attn_layer_norm.weight', 'layers.3.self_attn_layer_norm.bias', 'layers.3.fc1.weight', 'layers.3.fc1.bias', 'layers.3.fc2.weight', 'layers.3.fc2.bias', 'layers.3.final_layer_norm.weight', 'layers.3.final_layer_norm.bias', 'layers.4.self_attn.k_proj.weight', 'layers.4.self_attn.k_proj.bias', 'layers.4.self_attn.v_proj.weight', 'layers.4.self_attn.v_proj.bias', 'layers.4.self_attn.q_proj.weight', 'layers.4.self_attn.q_proj.bias', 'layers.4.self_attn.out_proj.weight', 'layers.4.self_attn.out_proj.bias', 'layers.4.self_attn_layer_norm.weight', 'layers.4.self_attn_layer_norm.bias', 'layers.4.fc1.weight', 'layers.4.fc1.bias', 'layers.4.fc2.weight', 'layers.4.fc2.bias', 'layers.4.final_layer_norm.weight', 'layers.4.final_layer_norm.bias', 'layers.5.self_attn.k_proj.weight', 'layers.5.self_attn.k_proj.bias', 'layers.5.self_attn.v_proj.weight', 'layers.5.self_attn.v_proj.bias', 'layers.5.self_attn.q_proj.weight', 'layers.5.self_attn.q_proj.bias', 'layers.5.self_attn.out_proj.weight', 'layers.5.self_attn.out_proj.bias', 'layers.5.self_attn_layer_norm.weight', 'layers.5.self_attn_layer_norm.bias', 'layers.5.fc1.weight', 'layers.5.fc1.bias', 'layers.5.fc2.weight', 'layers.5.fc2.bias', 'layers.5.final_layer_norm.weight', 'layers.5.final_layer_norm.bias']) Transformer encoder / decoder state_dict does not contain embed_positions.weight. Cannot load decoder.sentence_encoder.embed_positions.weight from pretrained XLM checkpoint /xlmr.large/model.pt into Transformer.
So that it points to a single dict running:
SRCS=("ar" "de" "en" "hi" "fr")
TEXT="BPE DIR"
for SRC in "${SRCS[@]}"; do
echo $SRC
fairseq-preprocess --source-lang $SRC --target-lang $TGT \
--task translation_from_pretrained_xlm \
--srcdict $ROOT/xlmr.large/dict.txt \
--trainpref $TEXT/train.bpe.$SRC-en --testpref $TEXT/test.bpe.$SRC-en --validpref $TEXT/valid.bpe.$SRC-en \
--tgtdict ./dict.en.txt \
--destdir ./data-bin/
done
Training
mkdir -p checkpoints/mlm
fairseq-train /data-bin \
--max-epoch 50 \
--task translation_from_pretrained_xlm \
--save-dir checkpoints/mlm \
--max-update 2400000 --save-interval 10 --no-epoch-checkpoints \
--arch transformer_from_pretrained_xlm \
--optimizer adam --lr-scheduler reduce_lr_on_plateau \
--lr-shrink 0.5 --lr 0.0001 --min-lr 1e-09 \
--dropout 0.3 \
--criterion label_smoothed_cross_entropy \
--max-tokens 2000 \
--source-lang en --target-lang de \
--activation-fn gelu \
--pretrained-xlm-checkpoint xlmr.large/model.pt
Ideally, would like to use the the weights from the 100 lang model to fine tune NMT for monolingual or multilingual models.
pip, source): sourceYes, I referenced issues #907 and #787 before opening.
Would be willing to help here as it will save some arctic ice sheets if the model can start pretrained for translation tasks.
@smart-patrol There are some small differences between original XLM and XLM-R models. The translation_from_pretrained_xlm was not updated to work with updated XLM-R model because we didn't evaluate it on translation tasks.
However, please feel free to submit PR, I am to review and merge.
@smart-patrol There are some small differences between original
XLMandXLM-Rmodels. Thetranslation_from_pretrained_xlmwas not updated to work with updatedXLM-Rmodel because we didn't evaluate it on translation tasks.
However, please feel free to submit PR, I am to review and merge.
Hi @ngoyal2707 , for the translation_from_pretrained_xlm task, I trained an XLM model according to here based on wiki corpus(downloaded and tokenized according toXLM Github Repository), but it didn't work.
It reported : Transformer encoder / decoder state_dict does not contain embed_positions.weight.
Then, could you please give me a hint that where can I obtain the XLM model which can be loaded in the translation_from_pretrained_xlm task? I tried all the models I can find in XLM Github Repository) but it didn't work, either. I don't know what can I do now. :(
Thank you in advance.
The data used for translation was preprocessed like:
fairseq-preprocess --source-lang $src --target-lang $tgt \
--srcdict $SRCDICT \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir $DESTDIR --workers 20
My train script is like:
CUDA_VISIBLE_DEVICES=0,1,2 fairseq-train \
$DATADIR \
--criterion label_smoothed_cross_entropy \
--pretrained-xlm-checkpoint ./checkpoints/mlm_wiki/checkpoint_best.pt \
--init-encoder-only --save-dir checkpoints/trans_xlm_new_g2p \
--optimizer adam --dropout 0.3 --weight-decay 0.0001 \
--max-tokens 500 --lr 5e-4 --activation-fn gelu \
--arch transformer_from_pretrained_xlm \
--task translation_from_pretrained_xlm
When I run the training script, it reported the following trace:
Traceback (most recent call last):
File "/home/zhangjiawen/anconda/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/zhangjiawen/code/fairseq/fairseq_cli/train.py", line 270, in distributed_main
main(args, init_distributed=True)
File "/home/zhangjiawen/code/fairseq/fairseq_cli/train.py", line 64, in main
model = task.build_model(args)
File "/home/zhangjiawen/code/fairseq/fairseq/tasks/translation.py", line 264, in build_model
return super().build_model(args)
File "/home/zhangjiawen/code/fairseq/fairseq/tasks/fairseq_task.py", line 187, in build_model
return models.build_model(args, self)
File "/home/zhangjiawen/code/fairseq/fairseq/models/__init__.py", line 48, in build_model
return ARCH_MODEL_REGISTRY[args.arch].build_model(args, task)
File "/home/zhangjiawen/code/fairseq/fairseq/models/transformer_from_pretrained_xlm.py", line 63, in build_model
return super().build_model(args, task)
File "/home/zhangjiawen/code/fairseq/fairseq/models/transformer.py", line 221, in build_model
encoder = cls.build_encoder(args, src_dict, encoder_embed_tokens)
File "/home/zhangjiawen/code/fairseq/fairseq/models/transformer_from_pretrained_xlm.py", line 67, in build_encoder
return TransformerEncoderFromPretrainedXLM(args, src_dict, embed_tokens)
File "/home/zhangjiawen/code/fairseq/fairseq/models/transformer_from_pretrained_xlm.py", line 128, in __init__
pretrained_xlm_checkpoint=args.pretrained_xlm_checkpoint,
File "/home/zhangjiawen/code/fairseq/fairseq/models/transformer_from_pretrained_xlm.py", line 107, in upgrade_state_dict_with_xlm_weights
subkey, key, pretrained_xlm_checkpoint)
AssertionError: odict_keys(['version', 'embed_tokens.weight', 'embed_positions._float_tensor', 'layers.0.self_attn.k_proj.weight', 'layers.0.self_attn.k_proj.bias', 'layers.0.self_attn.v_proj.weight', 'layers.0.self_attn.v_proj.bias', 'layers.0.self_attn.q_proj.weight', 'layers.0.self_attn.q_proj.bias', 'layers.0.self_attn.out_proj.weight', 'layers.0.self_attn.out_proj.bias', 'layers.0.self_attn_layer_norm.weight', 'layers.0.self_attn_layer_norm.bias', 'layers.0.fc1.weight', 'layers.0.fc1.bias', 'layers.0.fc2.weight', 'layers.0.fc2.bias', 'layers.0.final_layer_norm.weight', 'layers.0.final_layer_norm.bias', 'layers.1.self_attn.k_proj.weight', 'layers.1.self_attn.k_proj.bias', 'layers.1.self_attn.v_proj.weight', 'layers.1.self_attn.v_proj.bias', 'layers.1.self_attn.q_proj.weight', 'layers.1.self_attn.q_proj.bias', 'layers.1.self_attn.out_proj.weight', 'layers.1.self_attn.out_proj.bias', 'layers.1.self_attn_layer_norm.weight', 'layers.1.self_attn_layer_norm.bias', 'layers.1.fc1.weight', 'layers.1.fc1.bias', 'layers.1.fc2.weight', 'layers.1.fc2.bias', 'layers.1.final_layer_norm.weight', 'layers.1.final_layer_norm.bias', 'layers.2.self_attn.k_proj.weight', 'layers.2.self_attn.k_proj.bias', 'layers.2.self_attn.v_proj.weight', 'layers.2.self_attn.v_proj.bias', 'layers.2.self_attn.q_proj.weight', 'layers.2.self_attn.q_proj.bias', 'layers.2.self_attn.out_proj.weight', 'layers.2.self_attn.out_proj.bias', 'layers.2.self_attn_layer_norm.weight', 'layers.2.self_attn_layer_norm.bias', 'layers.2.fc1.weight', 'layers.2.fc1.bias', 'layers.2.fc2.weight', 'layers.2.fc2.bias', 'layers.2.final_layer_norm.weight', 'layers.2.final_layer_norm.bias', 'layers.3.self_attn.k_proj.weight', 'layers.3.self_attn.k_proj.bias', 'layers.3.self_attn.v_proj.weight', 'layers.3.self_attn.v_proj.bias', 'layers.3.self_attn.q_proj.weight', 'layers.3.self_attn.q_proj.bias', 'layers.3.self_attn.out_proj.weight', 'layers.3.self_attn.out_proj.bias', 'layers.3.self_attn_layer_norm.weight', 'layers.3.self_attn_layer_norm.bias', 'layers.3.fc1.weight', 'layers.3.fc1.bias', 'layers.3.fc2.weight', 'layers.3.fc2.bias', 'layers.3.final_layer_norm.weight', 'layers.3.final_layer_norm.bias', 'layers.4.self_attn.k_proj.weight', 'layers.4.self_attn.k_proj.bias', 'layers.4.self_attn.v_proj.weight', 'layers.4.self_attn.v_proj.bias', 'layers.4.self_attn.q_proj.weight', 'layers.4.self_attn.q_proj.bias', 'layers.4.self_attn.out_proj.weight', 'layers.4.self_attn.out_proj.bias', 'layers.4.self_attn_layer_norm.weight', 'layers.4.self_attn_layer_norm.bias', 'layers.4.fc1.weight', 'layers.4.fc1.bias', 'layers.4.fc2.weight', 'layers.4.fc2.bias', 'layers.4.final_layer_norm.weight', 'layers.4.final_layer_norm.bias', 'layers.5.self_attn.k_proj.weight', 'layers.5.self_attn.k_proj.bias', 'layers.5.self_attn.v_proj.weight', 'layers.5.self_attn.v_proj.bias', 'layers.5.self_attn.q_proj.weight', 'layers.5.self_attn.q_proj.bias', 'layers.5.self_attn.out_proj.weight', 'layers.5.self_attn.out_proj.bias', 'layers.5.self_attn_layer_norm.weight', 'layers.5.self_attn_layer_norm.bias', 'layers.5.fc1.weight', 'layers.5.fc1.bias', 'layers.5.fc2.weight', 'layers.5.fc2.bias', 'layers.5.final_layer_norm.weight', 'layers.5.final_layer_norm.bias']) Transformer encoder / decoder state_dict does not contain embed_positions.weight. Cannot load encoder.sentence_encoder.embed_positions.weight from pretrained XLM checkpoint ./checkpoints/mlm_wiki/checkpoint_best.pt into Transformer.
I've encountered the same problem after I trained an XLM using fairseq code.
I get the same exception.
Any conclusion?
I've figured it out, you need to add the args: --encoder-learned-pos and --decoder-learned-pos
@ngoyal2707 now does translation_from_pretrained_xlm task support xlm-r models. ?
@tonylekhtman, were you able finetune your trained XLM in fairseq for NMT in fairseq?
@ajesujoba
I was able to pretrain the xlm model and then finetune it for nmt.
The pretraining and fine-tuning was done using fairseq
Hi @tonylekhtman, that's great!! can you please share your training script for the pretraining and fine-tuning using fairseq? Thanks!
The pretraining code is taken from here:
https://github.com/pytorch/fairseq/tree/master/examples/cross_lingual_language_model
Then you need to preprocess the bilinugal data you are interested in ft on using fairseq-preprocess.
The fineutining code is as follows:
fairseq-train --data /path/to/preprocessed_bilingual_data --task translation_from_pretrained_xlm -a transformer_from_pretrained_xlm --pretrained-xlm-checkpoint /path/to/pretrained_model_checkpoint --max-tokens 4000 --encoder-embed-dim 1024 --decoder-embed-dim 1024 --encoder-ffn-embed-dim 4096 --encoder-learned-pos --decoder-learned-pos --max-source-positions 256 --max-target-positions 256 --num-workers 6
Cool! Thanks @tonylekhtman . Had the same command, just wanted to be sure. Thanks once again!
@tonylekhtman Hi! Does the XLM pretrain from fairseq only support the MLM? I found the origin XLM repository can pretrain for MLM+TLM, but fairseq's example saied only MLM is supported.