Fairseq: Cannot load model parameters from checkpoint, please ensure that the architectures match

Created on 7 Oct 2018 · 11Comments · Source: pytorch/fairseq

Hi,

I'm following the story generation example, but I'm unable to train a pre-trained model as per the steps provided in the example. This is what I'm doing:

Cloning fairseq repo and installing requirements.
Downloading dataset:

curl https://s3.amazonaws.com/fairseq-py/data/writingPrompts.tar.gz | tar xvzf -

Trimming it to the first 1,000 words for each story, as suggested:

data = ["train", "test", "valid"]
for name in data:
  with open(name + ".wp_target") as f:
    stories = f.readlines()
  stories = [" ".join(i.split()[0:1000]) for i in stories]
  with open(name + ".wp_target", "w") as o:
    for line in stories:
      o.write(line.strip() + "\n")

Binarize dataset:

TEXT=examples/stories/writingPrompts
python preprocess.py --source-lang wp_source --target-lang wp_target \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/writingPrompts --padding-factor 1 --thresholdtgt 10 --thresholdsrc 10

Initial training of the model:

python train.py data-bin/writingPrompts -a fconv_self_att_wp --lr 0.25 --clip-norm 0.1 --max-tokens 1500 --lr-scheduler reduce_lr_on_plateau --decoder-attention True --encoder-attention False --criterion label_smoothed_cross_entropy --weight-decay .0000001 --label-smoothing 0 --source-lang wp_source --target-lang wp_target --gated-attention True --self-attention True --project-input True --pretrained False --save-interval-updates 50000

Once I get a couple of checkpoints generated, stop the training and run a new training, this time with "--pretrained True":

python train.py data-bin/writingPrompts -a fconv_self_att_wp --lr 0.25 --clip-norm 0.1 --max-tokens 1500 --lr-scheduler reduce_lr_on_plateau --decoder-attention True --encoder-attention False --criterion label_smoothed_cross_entropy --weight-decay .0000001 --label-smoothing 0 --source-lang wp_source --target-lang wp_target --gated-attention True --self-attention True --project-input True --pretrained True --save-interval-updates 50000 --pretrained-checkpoint ./checkpoints/checkpoint_best.pt

Obtaining the following error:

Traceback (most recent call last):
  File "/home/nacho/git/fairseq/fairseq/utils.py", line 73, in load_model_state
    model.load_state_dict(state['model'], strict=True)
  File "/home/nacho/git/fairseq/fairseq/models/fairseq_model.py", line 64, in load_state_dict
    super().load_state_dict(state_dict, strict)
  File "/home/nacho/environments/deeplearning/lib/python3.6/site-packages/torch/nn/modules/module.py", line 719, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for FConvModelSelfAtt:
    Missing key(s) in state_dict: "encoder.pretrained.encoder.embed_tokens.weight", "encoder.pretrained.encoder.embed_positions.weight", "encoder.pretrained.encoder.fc1.weight", "encoder.pretrained.encoder.fc1.bias", "encoder.pretrained.encoder.projections.2.weight", "encoder.pretrained.encoder.projections.2.bias", "encoder.pretrained.encoder.convolutions.0.weight", "encoder.pretrained.encoder.convolutions.0.bias", "encoder.pretrained.encoder.convolutions.1.weight", "encoder.pretrained.encoder.convolutions.1.bias", "encoder.pretrained.encoder.convolutions.2.weight", "encoder.pretrained.encoder.convolutions.2.bias", "encoder.pretrained.encoder.fc2.weight", "encoder.pretrained.encoder.fc2.bias", "decoder.pretrained_decoder.version", "decoder.pretrained_decoder.embed_tokens.weight", "decoder.pretrained_decoder.embed_positions.weight", "decoder.pretrained_decoder.fc1.weight", "decoder.pretrained_decoder.fc1.bias", "decoder.pretrained_decoder.projections.4.weight", "decoder.pretrained_decoder.projections.4.bias", "decoder.pretrained_decoder.projections.6.weight", "decoder.pretrained_decoder.projections.6.bias", "decoder.pretrained_decoder.convolutions.0.weight", "decoder.pretrained_decoder.convolutions.0.bias", "decoder.pretrained_decoder.convolutions.1.weight", "decoder.pretrained_decoder.convolutions.1.bias", "decoder.pretrained_decoder.convolutions.2.weight", "decoder.pretrained_decoder.convolutions.2.bias", "decoder.pretrained_decoder.convolutions.3.weight", "decoder.pretrained_decoder.convolutions.3.bias", "decoder.pretrained_decoder.convolutions.4.weight", "decoder.pretrained_decoder.convolutions.4.bias", "decoder.pretrained_decoder.convolutions.5.weight", "decoder.pretrained_decoder.convolutions.5.bias", "decoder.pretrained_decoder.convolutions.6.weight", "decoder.pretrained_decoder.convolutions.6.bias", "decoder.pretrained_decoder.attention.0.attention_module.in_proj_q.bias", "decoder.pretrained_decoder.attention.0.attention_module.in_proj_q.weight_g", "decoder.pretrained_decoder.attention.0.attention_module.in_proj_q.weight_v", "decoder.pretrained_decoder.attention.0.attention_module.in_proj_k.0.bias", "decoder.pretrained_decoder.attention.0.attention_module.in_proj_k.0.weight_g", "decoder.pretrained_decoder.attention.0.attention_module.in_proj_k.0.weight_v", "decoder.pretrained_decoder.attention.0.attention_module.in_proj_v.0.bias", "decoder.pretrained_decoder.attention.0.attention_module.in_proj_v.0.weight_g", "decoder.pretrained_decoder.attention.0.attention_module.in_proj_v.0.weight_v", "decoder.pretrained_decoder.attention.0.attention_module.out_proj.bias", "decoder.pretrained_decoder.attention.0.attention_module.out_proj.weight_g", "decoder.pretrained_decoder.attention.0.attention_module.out_proj.weight_v", "decoder.pretrained_decoder.attention.1.attention_module.in_proj_q.bias", "decoder.pretrained_decoder.attention.1.attention_module.in_proj_q.weight_g", "decoder.pretrained_decoder.attention.1.attention_module.in_proj_q.weight_v", "decoder.pretrained_decoder.attention.1.attention_module.in_proj_k.0.bias", "decoder.pretrained_decoder.attention.1.attention_module.in_proj_k.0.weight_g", "decoder.pretrained_decoder.attention.1.attention_module.in_proj_k.0.weight_v", "decoder.pretrained_decoder.attention.1.attention_module.in_proj_v.0.bias", "decoder.pretrained_decoder.attention.1.attention_module.in_proj_v.0.weight_g", "decoder.pretrained_decoder.attention.1.attention_module.in_proj_v.0.weight_v", "decoder.pretrained_decoder.attention.1.attention_module.out_proj.bias", "decoder.pretrained_decoder.attention.1.attention_module.out_proj.weight_g", "decoder.pretrained_decoder.attention.1.attention_module.out_proj.weight_v", "decoder.pretrained_decoder.attention.2.attention_module.in_proj_q.bias", "decoder.pretrained_decoder.attention.2.attention_module.in_proj_q.weight_g", "decoder.pretrained_decoder.attention.2.attention_module.in_proj_q.weight_v", "decoder.pretrained_decoder.attention.2.attention_module.in_proj_k.0.bias", "decoder.pretrained_decoder.attention.2.attention_module.in_proj_k.0.weight_g", "decoder.pretrained_decoder.attention.2.attention_module.in_proj_k.0.weight_v", "decoder.pretrained_decoder.attention.2.attention_module.in_proj_v.0.bias", "decoder.pretrained_decoder.attention.2.attention_module.in_proj_v.0.weight_g", "decoder.pretrained_decoder.attention.2.attention_module.in_proj_v.0.weight_v", "decoder.pretrained_decoder.attention.2.attention_module.out_proj.bias", "decoder.pretrained_decoder.attention.2.attention_module.out_proj.weight_g", "decoder.pretrained_decoder.attention.2.attention_module.out_proj.weight_v", "decoder.pretrained_decoder.attention.3.attention_module.in_proj_q.bias", "decoder.pretrained_decoder.attention.3.attention_module.in_proj_q.weight_g", "decoder.pretrained_decoder.attention.3.attention_module.in_proj_q.weight_v", "decoder.pretrained_decoder.attention.3.attention_module.in_proj_k.0.bias", "decoder.pretrained_decoder.attention.3.attention_module.in_proj_k.0.weight_g", "decoder.pretrained_decoder.attention.3.attention_module.in_proj_k.0.weight_v", "decoder.pretrained_decoder.attention.3.attention_module.in_proj_v.0.bias", "decoder.pretrained_decoder.attention.3.attention_module.in_proj_v.0.weight_g", "decoder.pretrained_decoder.attention.3.attention_module.in_proj_v.0.weight_v", "decoder.pretrained_decoder.attention.3.attention_module.out_proj.bias", "decoder.pretrained_decoder.attention.3.attention_module.out_proj.weight_g", "decoder.pretrained_decoder.attention.3.attention_module.out_proj.weight_v", "decoder.pretrained_decoder.attention.4.attention_module.in_proj_q.bias", "decoder.pretrained_decoder.attention.4.attention_module.in_proj_q.weight_g", "decoder.pretrained_decoder.attention.4.attention_module.in_proj_q.weight_v", "decoder.pretrained_decoder.attention.4.attention_module.in_proj_k.0.bias", "decoder.pretrained_decoder.attention.4.attention_module.in_proj_k.0.weight_g", "decoder.pretrained_decoder.attention.4.attention_module.in_proj_k.0.weight_v", "decoder.pretrained_decoder.attention.4.attention_module.in_proj_v.0.bias", "decoder.pretrained_decoder.attention.4.attention_module.in_proj_v.0.weight_g", "decoder.pretrained_decoder.attention.4.attention_module.in_proj_v.0.weight_v", "decoder.pretrained_decoder.attention.4.attention_module.out_proj.bias", "decoder.pretrained_decoder.attention.4.attention_module.out_proj.weight_g", "decoder.pretrained_decoder.attention.4.attention_module.out_proj.weight_v", "decoder.pretrained_decoder.attention.5.attention_module.in_proj_q.bias", "decoder.pretrained_decoder.attention.5.attention_module.in_proj_q.weight_g", "decoder.pretrained_decoder.attention.5.attention_module.in_proj_q.weight_v", "decoder.pretrained_decoder.attention.5.attention_module.in_proj_k.0.bias", "decoder.pretrained_decoder.attention.5.attention_module.in_proj_k.0.weight_g", "decoder.pretrained_decoder.attention.5.attention_module.in_proj_k.0.weight_v", "decoder.pretrained_decoder.attention.5.attention_module.in_proj_v.0.bias", "decoder.pretrained_decoder.attention.5.attention_module.in_proj_v.0.weight_g", "decoder.pretrained_decoder.attention.5.attention_module.in_proj_v.0.weight_v", "decoder.pretrained_decoder.attention.5.attention_module.out_proj.bias", "decoder.pretrained_decoder.attention.5.attention_module.out_proj.weight_g", "decoder.pretrained_decoder.attention.5.attention_module.out_proj.weight_v", "decoder.pretrained_decoder.attention.6.attention_module.in_proj_q.bias", "decoder.pretrained_decoder.attention.6.attention_module.in_proj_q.weight_g", "decoder.pretrained_decoder.attention.6.attention_module.in_proj_q.weight_v", "decoder.pretrained_decoder.attention.6.attention_module.in_proj_k.0.bias", "decoder.pretrained_decoder.attention.6.attention_module.in_proj_k.0.weight_g", "decoder.pretrained_decoder.attention.6.attention_module.in_proj_k.0.weight_v", "decoder.pretrained_decoder.attention.6.attention_module.in_proj_v.0.bias", "decoder.pretrained_decoder.attention.6.attention_module.in_proj_v.0.weight_g", "decoder.pretrained_decoder.attention.6.attention_module.in_proj_v.0.weight_v", "decoder.pretrained_decoder.attention.6.attention_module.out_proj.bias", "decoder.pretrained_decoder.attention.6.attention_module.out_proj.weight_g", "decoder.pretrained_decoder.attention.6.attention_module.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.0.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.0.out_proj.bias", "decoder.pretrained_decoder.selfattention.0.attention.0.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.0.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.1.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.1.out_proj.bias", "decoder.pretrained_decoder.selfattention.0.attention.1.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.1.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.2.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.2.out_proj.bias", "decoder.pretrained_decoder.selfattention.0.attention.2.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.2.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.3.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.3.out_proj.bias", "decoder.pretrained_decoder.selfattention.0.attention.3.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.3.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.0.attention.out_proj.bias", "decoder.pretrained_decoder.selfattention.0.attention.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.0.attention.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.0.in_proj_q.weight", "decoder.pretrained_decoder.selfattention.0.in_proj_q.bias", "decoder.pretrained_decoder.selfattention.0.in_proj_k.weight", "decoder.pretrained_decoder.selfattention.0.in_proj_k.bias", "decoder.pretrained_decoder.selfattention.0.in_proj_v.weight", "decoder.pretrained_decoder.selfattention.0.in_proj_v.bias", "decoder.pretrained_decoder.selfattention.0.ln.weight", "decoder.pretrained_decoder.selfattention.0.ln.bias", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.0.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.0.out_proj.bias", "decoder.pretrained_decoder.selfattention.1.attention.0.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.0.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.1.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.1.out_proj.bias", "decoder.pretrained_decoder.selfattention.1.attention.1.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.1.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.2.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.2.out_proj.bias", "decoder.pretrained_decoder.selfattention.1.attention.2.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.2.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.3.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.3.out_proj.bias", "decoder.pretrained_decoder.selfattention.1.attention.3.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.3.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.1.attention.out_proj.bias", "decoder.pretrained_decoder.selfattention.1.attention.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.1.attention.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.1.in_proj_q.weight", "decoder.pretrained_decoder.selfattention.1.in_proj_q.bias", "decoder.pretrained_decoder.selfattention.1.in_proj_k.weight", "decoder.pretrained_decoder.selfattention.1.in_proj_k.bias", "decoder.pretrained_decoder.selfattention.1.in_proj_v.weight", "decoder.pretrained_decoder.selfattention.1.in_proj_v.bias", "decoder.pretrained_decoder.selfattention.1.ln.weight", "decoder.pretrained_decoder.selfattention.1.ln.bias", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.0.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.0.out_proj.bias", "decoder.pretrained_decoder.selfattention.2.attention.0.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.0.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.1.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.1.out_proj.bias", "decoder.pretrained_decoder.selfattention.2.attention.1.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.1.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.2.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.2.out_proj.bias", "decoder.pretrained_decoder.selfattention.2.attention.2.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.2.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.3.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.3.out_proj.bias", "decoder.pretrained_decoder.selfattention.2.attention.3.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.3.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.2.attention.out_proj.bias", "decoder.pretrained_decoder.selfattention.2.attention.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.2.attention.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.2.in_proj_q.weight", "decoder.pretrained_decoder.selfattention.2.in_proj_q.bias", "decoder.pretrained_decoder.selfattention.2.in_proj_k.weight", "decoder.pretrained_decoder.selfattention.2.in_proj_k.bias", "decoder.pretrained_decoder.selfattention.2.in_proj_v.weight", "decoder.pretrained_decoder.selfattention.2.in_proj_v.bias", "decoder.pretrained_decoder.selfattention.2.ln.weight", "decoder.pretrained_decoder.selfattention.2.ln.bias", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.0.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.0.out_proj.bias", "decoder.pretrained_decoder.selfattention.3.attention.0.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.0.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.1.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.1.out_proj.bias", "decoder.pretrained_decoder.selfattention.3.attention.1.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.1.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.2.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.2.out_proj.bias", "decoder.pretrained_decoder.selfattention.3.attention.2.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.2.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.3.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.3.out_proj.bias", "decoder.pretrained_decoder.selfattention.3.attention.3.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.3.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.3.attention.out_proj.bias", "decoder.pretrained_decoder.selfattention.3.attention.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.3.attention.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.3.in_proj_q.weight", "decoder.pretrained_decoder.selfattention.3.in_proj_q.bias", "decoder.pretrained_decoder.selfattention.3.in_proj_k.weight", "decoder.pretrained_decoder.selfattention.3.in_proj_k.bias", "decoder.pretrained_decoder.selfattention.3.in_proj_v.weight", "decoder.pretrained_decoder.selfattention.3.in_proj_v.bias", "decoder.pretrained_decoder.selfattention.3.ln.weight", "decoder.pretrained_decoder.selfattention.3.ln.bias", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.0.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.0.out_proj.bias", "decoder.pretrained_decoder.selfattention.4.attention.0.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.0.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.1.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.1.out_proj.bias", "decoder.pretrained_decoder.selfattention.4.attention.1.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.1.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.2.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.2.out_proj.bias", "decoder.pretrained_decoder.selfattention.4.attention.2.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.2.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.3.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.3.out_proj.bias", "decoder.pretrained_decoder.selfattention.4.attention.3.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.3.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.4.attention.out_proj.bias", "decoder.pretrained_decoder.selfattention.4.attention.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.4.attention.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.4.in_proj_q.weight", "decoder.pretrained_decoder.selfattention.4.in_proj_q.bias", "decoder.pretrained_decoder.selfattention.4.in_proj_k.weight", "decoder.pretrained_decoder.selfattention.4.in_proj_k.bias", "decoder.pretrained_decoder.selfattention.4.in_proj_v.weight", "decoder.pretrained_decoder.selfattention.4.in_proj_v.bias", "decoder.pretrained_decoder.selfattention.4.ln.weight", "decoder.pretrained_decoder.selfattention.4.ln.bias", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.0.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.0.out_proj.bias", "decoder.pretrained_decoder.selfattention.5.attention.0.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.0.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.1.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.1.out_proj.bias", "decoder.pretrained_decoder.selfattention.5.attention.1.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.1.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.2.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.2.out_proj.bias", "decoder.pretrained_decoder.selfattention.5.attention.2.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.2.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.3.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.3.out_proj.bias", "decoder.pretrained_decoder.selfattention.5.attention.3.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.3.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.5.attention.out_proj.bias", "decoder.pretrained_decoder.selfattention.5.attention.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.5.attention.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.5.in_proj_q.weight", "decoder.pretrained_decoder.selfattention.5.in_proj_q.bias", "decoder.pretrained_decoder.selfattention.5.in_proj_k.weight", "decoder.pretrained_decoder.selfattention.5.in_proj_k.bias", "decoder.pretrained_decoder.selfattention.5.in_proj_v.weight", "decoder.pretrained_decoder.selfattention.5.in_proj_v.bias", "decoder.pretrained_decoder.selfattention.5.ln.weight", "decoder.pretrained_decoder.selfattention.5.ln.bias", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.0.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.0.out_proj.bias", "decoder.pretrained_decoder.selfattention.6.attention.0.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.0.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.1.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.1.out_proj.bias", "decoder.pretrained_decoder.selfattention.6.attention.1.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.1.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.2.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.2.out_proj.bias", "decoder.pretrained_decoder.selfattention.6.attention.2.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.2.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_q.0.bias", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_q.0.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_q.0.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_q.2.bias", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_q.2.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_q.2.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_q.4.bias", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_q.4.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_q.4.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_k.1.0.bias", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_k.1.0.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_k.1.0.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_k.1.2.bias", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_k.1.2.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_k.1.2.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_k.1.4.bias", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_k.1.4.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_k.1.4.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_v.1.0.bias", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_v.1.0.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_v.1.0.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_v.1.2.bias", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_v.1.2.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_v.1.2.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_v.1.4.bias", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_v.1.4.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.3.in_proj_v.1.4.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.3.out_proj.bias", "decoder.pretrained_decoder.selfattention.6.attention.3.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.3.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.6.attention.out_proj.bias", "decoder.pretrained_decoder.selfattention.6.attention.out_proj.weight_g", "decoder.pretrained_decoder.selfattention.6.attention.out_proj.weight_v", "decoder.pretrained_decoder.selfattention.6.in_proj_q.weight", "decoder.pretrained_decoder.selfattention.6.in_proj_q.bias", "decoder.pretrained_decoder.selfattention.6.in_proj_k.weight", "decoder.pretrained_decoder.selfattention.6.in_proj_k.bias", "decoder.pretrained_decoder.selfattention.6.in_proj_v.weight", "decoder.pretrained_decoder.selfattention.6.in_proj_v.bias", "decoder.pretrained_decoder.selfattention.6.ln.weight", "decoder.pretrained_decoder.selfattention.6.ln.bias", "decoder.pretrained_decoder.attproj.0.weight", "decoder.pretrained_decoder.attproj.0.bias", "decoder.pretrained_decoder.attproj.1.weight", "decoder.pretrained_decoder.attproj.1.bias", "decoder.pretrained_decoder.attproj.2.weight", "decoder.pretrained_decoder.attproj.2.bias", "decoder.pretrained_decoder.attproj.3.weight", "decoder.pretrained_decoder.attproj.3.bias", "decoder.pretrained_decoder.attproj.4.weight", "decoder.pretrained_decoder.attproj.4.bias", "decoder.pretrained_decoder.attproj.5.weight", "decoder.pretrained_decoder.attproj.5.bias", "decoder.pretrained_decoder.attproj.6.weight", "decoder.pretrained_decoder.attproj.6.bias", "decoder.pretrained_decoder.fc2.weight", "decoder.pretrained_decoder.fc2.bias", "decoder.pretrained_decoder.fc3.weight", "decoder.pretrained_decoder.fc3.bias", "decoder.gate1.0.weight", "decoder.gate1.0.bias", "decoder.gate2.0.weight", "decoder.gate2.0.bias", "decoder.joining.0.weight", "decoder.joining.0.bias", "decoder.joining.1.weight", "decoder.joining.1.bias", "decoder.joining.3.weight", "decoder.joining.3.bias", "decoder.joining.4.weight", "decoder.joining.4.bias", "decoder.joining.6.weight", "decoder.joining.6.bias", "decoder.joining.7.weight", "decoder.joining.7.bias", "pretrained_encoder.encoder.embed_tokens.weight", "pretrained_encoder.encoder.embed_positions.weight", "pretrained_encoder.encoder.fc1.weight", "pretrained_encoder.encoder.fc1.bias", "pretrained_encoder.encoder.projections.2.weight", "pretrained_encoder.encoder.projections.2.bias", "pretrained_encoder.encoder.convolutions.0.weight", "pretrained_encoder.encoder.convolutions.0.bias", "pretrained_encoder.encoder.convolutions.1.weight", "pretrained_encoder.encoder.convolutions.1.bias", "pretrained_encoder.encoder.convolutions.2.weight", "pretrained_encoder.encoder.convolutions.2.bias", "pretrained_encoder.encoder.fc2.weight", "pretrained_encoder.encoder.fc2.bias". 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 359, in <module>
    main(args)
  File "train.py", line 77, in main
    if not load_checkpoint(args, trainer, epoch_itr):
  File "train.py", line 315, in load_checkpoint
    eval(args.optimizer_overrides))
  File "/home/nacho/git/fairseq/fairseq/trainer.py", line 118, in load_checkpoint
    utils.load_model_state(filename, self.get_model())
  File "/home/nacho/git/fairseq/fairseq/utils.py", line 75, in load_model_state
    raise Exception('Cannot load model parameters from checkpoint, '
Exception: Cannot load model parameters from checkpoint, please ensure that the architectures match

As additional information, I'm using Ubuntu 18.04 and Cuda 10.0. I also tried on Windows (which I know it's not supported) but I was still having the exact same issue.

By the way, I'm able to successfully generate prompts and stories of the pre-trained model, so that's working fine:

python generate.py data-bin/writingPrompts --path checkpoints/checkpoint_best.pt --batch-size 1 --beam 1 --sampling --sampling-topk 10 --sampling-temperature 0.8 --nbest 1 --model-overrides "{'pretrained_checkpoint':'checkpoints/checkpoint_best.pt'}"

I saw a thread here with a similar issue, but I'm not sure how the OP solved it.

If you have any suggestions, please let me know!

Thanks in advance.

Source

Shuukaido

All 11 comments

It looks like you're missing the "pretrained" part of the state dict. Can you take a look at the state dict to see what is in it? Also, please double check the path for the pretrained model is pointing at the fconv_self_att_wp model you trained.

huihuifan on 8 Oct 2018

Thanks a lot for your reply.

If I print they keys in the state_dict it's true that I don't see any with "pretrained" on it.

Would you mind to clarify a couple of points?

When you say to double check the path for the pretrained model, do you mean the checkpoint that I'm passing in --pretained-checkpoint (where I'm indeed passing a checkpoint from a fconv_self_att_wp model that has been pretrained) or is there some other parameter that I need to check?
Do I need to perhaps pretrain the model for a specific number of epochs before the pretraining is considered completed? So far, I'm pretraining it for only 1 epoch and then stopping, as before training it for more epochs first I wanted to test everything was working fine.

Thanks again, I'll keep debugging this on my side. I feel I'm missing something obvious as I'm just following the steps outlined in the sample usage for this particular model.

Shuukaido on 9 Oct 2018

Pinging also @bepierre in case he has further info on this, as he was having the exact same issue in #183

Shuukaido on 10 Oct 2018

Try to set the --save-dir flag to where your checkpoints folder is located (first maybe even without the --pretrained-checkpoint flag).

bepierre on 10 Oct 2018

❤1

Thanks for your reply. If I pass --save-dir without --pretrained-checkpoint, I receive this:

| [wp_source] dictionary: 19025 types
| [wp_target] dictionary: 104960 types
| data-bin/writingPrompts train 272600 examples
| data-bin/writingPrompts valid 15620 examples
| loading pretrained model
Traceback (most recent call last):
  File "train.py", line 359, in <module>
    main(args)
  File "train.py", line 41, in main
    model = task.build_model(args)
  File "/home/nacho/git/fairseq/fairseq/tasks/fairseq_task.py", line 130, in build_model
    return models.build_model(args, self)
  File "/home/nacho/git/fairseq/fairseq/models/__init__.py", line 28, in build_model
    return ARCH_MODEL_REGISTRY[args.arch].build_model(args, task)
  File "/home/nacho/git/fairseq/fairseq/models/fconv_self_att.py", line 88, in build_model
    task=task,
  File "/home/nacho/git/fairseq/fairseq/utils.py", line 145, in load_ensemble_for_inference
    raise IOError('Model file not found: {}'.format(filename))
OSError: Model file not found:

If I pass both of them I get the same error as before, with the missing "pretrained" keys in the state dictionary.

I'll continue troubleshooting.

Shuukaido on 10 Oct 2018

❤1

Ok, I think I got it working now. I had to send both "--restore-file" and "--pretrained-checkpoint" pointing to the same pretrained checkpoint.

python train.py data-bin/writingPrompts -a fconv_self_att_wp --lr 0.25 --clip-norm 0.1 --max-tokens 1500 --lr-scheduler reduce_lr_on_plateau --decoder-attention True --encoder-attention False --criterion label_smoothed_cross_entropy --weight-decay .0000001 --label-smoothing 0 --source-lang wp_source --target-lang wp_target --gated-attention True --self-attention True --project-input True --pretrained True --save-interval-updates 50000 --restore-file checkpoints/checkpoint_best.pt --pretrained-checkpoint checkpoints/checkpoint_best.pt

@huihuifan can you confirm if "--restore-file" it's needed when using "--pretrained True"?

Thanks again both for your suggestions!

Shuukaido on 10 Oct 2018

❤1

Just saw this. Yes, I think that's correct. I believe I didn't have this issue because my --restore-file was by default pointing to the same location. I will update the documentation. Thank you for finding this, @Shuukaido and @bepierre!

huihuifan on 11 Oct 2018

👍1

No worries! Thanks a lot for sharing your research. It has been a great help for my own undergraduate project.

Shuukaido on 11 Oct 2018

@Shuukaido Hi, I am having the same issue.
Is adding --restore-file and pointing it to same checkpoint the only change that solved your issue?
I cannot understand how it solved the problem of pretrained model not having the "pretrained" keys in the state dict.
Any suggestions?

fabrahman on 3 Apr 2020

Hi @Hannabrahman, it's been a while since I used Fairseq, but yes as far as I remember sending "--restore-file" was the only change I had to make in order for the errors "Missing key(s) in state_dict (...)" to go away. I believe restore-file it's now point by default to {save_dir}/checkpoint_last.pt:

https://github.com/pytorch/fairseq/blob/master/fairseq/options.py#L427
https://github.com/pytorch/fairseq/blob/master/fairseq/checkpoint_utils.py#L126

so if your pretrained model has a different name/path then you need to pass it.

Shuukaido on 6 Apr 2020

@Shuukaido thanks. Yeah I understand that both restore-file and --pretrained-checkpoint should point to same path to avoid "architecture not matched" error. However, none of the saved checkpoints have "pretrained" part in their keys and that's why I am still getting Missing key(s) in state_dict error.

Thanks I continue troubleshooting.

fabrahman on 6 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings