Fairseq: errors trying to decode with mbart model

Created on 28 Feb 2020 · 3Comments · Source: pytorch/fairseq

🐛 Bug

Thanks for releasing the mbart models! I am trying to decode with the pretrained model for use as a baseline, but am running into a few problems:

Can you clarify where the language codes go?
The paper suggests that something like <en> is appended to the encoder and (at training time) prefixed to the target side. Is this correct?
However, the README implies the code is more like [en_US] or something
How do I translate to a specific language? I would have expected to have to force-decode to the prefix (via --prefix 1 or something) but this isn't clear. Perhaps the language code (-t) is used implicitly in the task?

Furthermore, I cannot run the model to test this. When running with the latest fairseq, I get the following error:

RuntimeError: Error(s) in loading state_dict for BARTModel:
        Unexpected key(s) in state_dict: "encoder.layernorm_embedding.weight", "encoder.layernorm_embedding.bias", "decoder.layernorm_embedding.weight", "decoder.layernorm_embedding.bias".

This suggests to me that I am doing something wrong or that some code was not committed.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

infile=wmt19.en
reffile=wmt19.de
outfile=out.wmt19.de

sacrebleu -t wmt19 -l en-de --echo src | head -n 10 > $infile
sacrebleu -t wmt19 -l en-de --echo ref | head -n 10 > $reffile

# constants
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
MODELDIR=./cc25_pretrain
DICT=$MODELDIR/dict.txt
export FAIRSEQ=~/code/fairseq
# end constants

tmpdir=$(mktemp -d --tmpdir=/expscratch/$USER)

SRC=en_XX
TRG=de_DE

cat $infile | spm_encode --model $MODELDIR/sentence.bpe.model > $tmpdir/data.spm.$SRC
cat $reffile | spm_encode --model $MODELDIR/sentence.bpe.model > $tmpdir/data.spm.$SRC

cp $tmpdir/data.spm.$SRC $tmpdir/data.spm.$TRG

python3 $FAIRSEQ/preprocess.py \
  --source-lang $SRC \
  --target-lang $TRG \
  --testpref $tmpdir/data.spm  \
  --destdir $tmpdir \
  --thresholdtgt 0 \
  --thresholdsrc 0 \
  --srcdict ${DICT} \
  --tgtdict ${DICT} \
  --workers 70

python3 $FAIRSEQ/generate.py $tmpdir \
  --path $MODELDIR/model.pt \
  --task translation_from_pretrained_bart \
  --gen-subset test \
  -s $SRC \
  -t $TRG \
  --remove-bpe 'sentencepiece' \
  --max-sentences 32 \
  --langs $langs > $outfile

This dies with the above-reported error.

Environment

fairseq Version (e.g., 1.0 or master): latest github
PyTorch Version (e.g., 1.0): 1.4.0
OS (e.g., Linux): CentOS 7.5
How you installed fairseq (pip, source): source
Build command you used (if compiling from source): pip install --editable (from within a conda env)
Python version: 3.7.6
CUDA/cuDNN version: 10.1 / 7.6.3
GPU models and configuration: Titan RTX

bug needs triage

Source

mjpost

Most helpful comment

Looks like this task only works with the fine-tuned model. Switching to that one (for EN-RO) solved the problem.

mjpost on 28 Feb 2020

👍2

All 3 comments

Update: I do see how the language codes are handled in fairseq/tasks/translation_from_pretrained_bart.py, but I have not been able to figure out the model key error.

mjpost on 28 Feb 2020

Looks like this task only works with the fine-tuned model. Switching to that one (for EN-RO) solved the problem.

mjpost on 28 Feb 2020

👍2

Can someone verify that this is the correct preprocessing output for en-ro?

> tokenizer.prepare_translation_batch([' UN Chief Says There Is No Military Solution in Syria'], ['Şeful ONU declară că nu există o soluţie militară în Siria'])
=> {'input_ids':[  8274, ...,  51712,      2, 250004], 
 'decoder_input_ids': [ 47711,   ...,  2, 250020]}

Should the source language code be after eos_id?
No bos_token on either side (prepend_bos=False).
(250004 is the language code for en_EN, afaict.)

For decoding, decoder_input_ids should start with the target language code? even though the decoder_input_ids end with the target language code?

I have read both threads and am not sure what the verdict is. The legendary @mjpost may have some insight.

sshleifer on 9 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 8: ordinal not in range(128)

zhaoxv · 3Comments

Any performance comparison between pre-norm and post-norm for Transformer on Machine Translation

gaopengcuhk · 3Comments

Error during inference of model trained on fp16

Raghava14 · 3Comments

Want to construct a STT from wav2vec

tyoc213 · 3Comments

(wav2vec 2.0)Can you provide detailed hyperparameters for finetune?

zqs01 · 3Comments