Thanks for releasing the mbart models! I am trying to decode with the pretrained model for use as a baseline, but am running into a few problems:
<en> is appended to the encoder and (at training time) prefixed to the target side. Is this correct?[en_US] or something--prefix 1 or something) but this isn't clear. Perhaps the language code (-t) is used implicitly in the task?Furthermore, I cannot run the model to test this. When running with the latest fairseq, I get the following error:
RuntimeError: Error(s) in loading state_dict for BARTModel:
Unexpected key(s) in state_dict: "encoder.layernorm_embedding.weight", "encoder.layernorm_embedding.bias", "decoder.layernorm_embedding.weight", "decoder.layernorm_embedding.bias".
This suggests to me that I am doing something wrong or that some code was not committed.
Steps to reproduce the behavior (always include the command you ran):
infile=wmt19.en
reffile=wmt19.de
outfile=out.wmt19.de
sacrebleu -t wmt19 -l en-de --echo src | head -n 10 > $infile
sacrebleu -t wmt19 -l en-de --echo ref | head -n 10 > $reffile
# constants
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
MODELDIR=./cc25_pretrain
DICT=$MODELDIR/dict.txt
export FAIRSEQ=~/code/fairseq
# end constants
tmpdir=$(mktemp -d --tmpdir=/expscratch/$USER)
SRC=en_XX
TRG=de_DE
cat $infile | spm_encode --model $MODELDIR/sentence.bpe.model > $tmpdir/data.spm.$SRC
cat $reffile | spm_encode --model $MODELDIR/sentence.bpe.model > $tmpdir/data.spm.$SRC
cp $tmpdir/data.spm.$SRC $tmpdir/data.spm.$TRG
python3 $FAIRSEQ/preprocess.py \
--source-lang $SRC \
--target-lang $TRG \
--testpref $tmpdir/data.spm \
--destdir $tmpdir \
--thresholdtgt 0 \
--thresholdsrc 0 \
--srcdict ${DICT} \
--tgtdict ${DICT} \
--workers 70
python3 $FAIRSEQ/generate.py $tmpdir \
--path $MODELDIR/model.pt \
--task translation_from_pretrained_bart \
--gen-subset test \
-s $SRC \
-t $TRG \
--remove-bpe 'sentencepiece' \
--max-sentences 32 \
--langs $langs > $outfile
This dies with the above-reported error.
pip, source): sourcepip install --editable (from within a conda env)Update: I do see how the language codes are handled in fairseq/tasks/translation_from_pretrained_bart.py, but I have not been able to figure out the model key error.
Looks like this task only works with the fine-tuned model. Switching to that one (for EN-RO) solved the problem.
Can someone verify that this is the correct preprocessing output for en-ro?
> tokenizer.prepare_translation_batch([' UN Chief Says There Is No Military Solution in Syria'], ['艦eful ONU declar膬 c膬 nu exist膬 o solu牛ie militar膬 卯n Siria'])
=> {'input_ids':[ 8274, ..., 51712, 2, 250004],
'decoder_input_ids': [ 47711, ..., 2, 250020]}
prepend_bos=False).For decoding, decoder_input_ids should start with the target language code? even though the decoder_input_ids end with the target language code?
I have read both threads and am not sure what the verdict is. The legendary @mjpost may have some insight.
Most helpful comment
Looks like this task only works with the fine-tuned model. Switching to that one (for EN-RO) solved the problem.