I've been trying to load the pretrained weights from RoBERTa but the model does not correspond to the one that is created.
Namespace(activation_dropout=0.0, activation_fn='gelu', arch='roberta', attention_dropout=0.1, best_checkpoint_metric='loss', bpe=None, bucket_cap_mb=25, clip_norm=25, cpu=False, criterion='cross_entropy', curriculum=0, data='/Donnees/corpus_mini/bin', dataset_impl=None, ddp_backend='c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.1, encoder_attention_heads=12, encoder_embed_dim=768, encoder_ffn_embed_dim=3072, encoder_layers=12, find_unused_parameters=False, fix_batches_to_gpus=False, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.25], lr_scheduler='fixed', lr_shrink=0.1, max_epoch=0, max_positions=512, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4000, max_tokens_valid=4000, max_update=0, maximize_best_checkpoint_metric=False, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=-1, momentum=0.99, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, num_workers=1, optimizer='nag', optimizer_overrides='{}', pooler_activation_fn='tanh', pooler_dropout=0.0, raw_text=False, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='/Donnees/roberta.base/model.pt', save_dir='checkpoints', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, train_subset='train', update_freq=[1], upsample_primary=1, use_bmuf=False, user_dir=None, valid_subset='valid', validate_interval=1, warmup_updates=0, weight_decay=0.0)
| [src] dictionary: 50264 types
| [trg] dictionary: 50264 types
| loaded 1000 examples from: /Donnees/corpus_mini/bin/valid.src-trg.src
| loaded 1000 examples from: /Donnees/corpus_mini/bin/valid.src-trg.trg
| /Donnees/corpus_mini/bin valid src-trg 1000 examples
RobertaModel(
(decoder): RobertaEncoder(
(sentence_encoder): TransformerSentenceEncoder(
(embed_tokens): Embedding(50264, 768, padding_idx=1)
(embed_positions): LearnedPositionalEmbedding(514, 768, padding_idx=1)
(layers): ModuleList(
(0): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
)
(1): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
)
(2): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
)
(3): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
)
(4): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
)
(5): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
)
(6): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
)
(7): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
)
(8): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
)
(9): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
)
(10): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
)
(11): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
)
)
(emb_layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
)
(lm_head): RobertaLMHead(
(dense): Linear(in_features=768, out_features=768, bias=True)
(layer_norm): LayerNorm(torch.Size([768]), eps=1e-05, elementwise_affine=True)
)
)
(classification_heads): ModuleDict()
)
| model roberta, criterion CrossEntropyCriterion
| num. model params: 124695896 (num. trained: 124695896)
| training on 1 GPUs
| max tokens per GPU = 4000 and max sentences per GPU = None
Traceback (most recent call last):
File "/Users/user/git/fairseq/fairseq/trainer.py", line 150, in load_checkpoint
self.get_model().load_state_dict(state['model'], strict=True)
File "/Users/user/git/fairseq/fairseq/models/fairseq_model.py", line 69, in load_state_dict
return super().load_state_dict(state_dict, strict)
File "/Users/user/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 777, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for RobertaModel:
size mismatch for decoder.sentence_encoder.embed_tokens.weight: copying a param with shape torch.Size([50265, 768]) from checkpoint, the shape in current model is torch.Size([50264, 768]).
size mismatch for decoder.lm_head.weight: copying a param with shape torch.Size([50265, 768]) from checkpoint, the shape in current model is torch.Size([50264, 768]).
size mismatch for decoder.lm_head.bias: copying a param with shape torch.Size([50265]) from checkpoint, the shape in current model is torch.Size([50264]).
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1758, in <module>
main()
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1752, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1147, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/user/git/fairseq/train.py", line 325, in <module>
cli_main()
File "/Users/user/git/fairseq/train.py", line 321, in cli_main
main(args)
File "/Users/user/git/fairseq/train.py", line 68, in main
extra_state, epoch_itr = checkpoint_utils.load_checkpoint(args, trainer)
File "/Users/user/git/fairseq/fairseq/checkpoint_utils.py", line 109, in load_checkpoint
reset_meters=args.reset_meters,
File "/Users/user/git/fairseq/fairseq/trainer.py", line 153, in load_checkpoint
'Cannot load model parameters from checkpoint, '
Exception: Cannot load model parameters from checkpoint, please ensure that the architectures match.
For some reason the dictionary contains 50264 types which means the input layer expect a vector of 50264 features. However, the checkpoint input layer expects 50265 features. I'm using the same dictionary as the checkpoint. Here are the processing steps that I use:
for SPLIT in train dev; do
python -m examples.roberta.multiprocessing_bpe_encoder \
--encoder-json encoder.json \
--vocab-bpe vocab.bpe \
--inputs "/Donnees/corpus_mini/$SPLIT.src" \
--outputs "/Donnees/corpus_mini/$SPLIT.bpe.src" \
--workers 60 \
--keep-empty
python -m examples.roberta.multiprocessing_bpe_encoder \
--encoder-json encoder.json \
--vocab-bpe vocab.bpe \
--inputs "/Donnees/corpus_mini/$SPLIT.trg" \
--outputs "/Donnees/corpus_mini/$SPLIT.bpe.trg" \
--workers 60 \
--keep-empty
done
fairseq-preprocess \
--source-lang src \
--target-lang trg \
--nwordssrc 50265 \
--nwordstgt 50265 \
--trainpref "/Donnees/corpus_mini/train.bpe" \
--validpref "/Donnees/corpus_mini/dev.bpe" \
--destdir "/Donnees/corpus_mini/bin" \
--workers 12 \
--srcdict dict.txt \
--tgtdict dict.txt
I'm using the encoder.json and vocab.bpe suggested by the documentation. I'm also using the dict.txt that came with the checkpoint.
I'm probably doing something wrong but I don't see it.
I've build fairseq from the tag v0.8.0.
You need to specify --task=masked_lm. That task adds an extra <mask> element to the dictionary: https://github.com/pytorch/fairseq/blob/master/fairseq/tasks/masked_lm.py#L64
Most helpful comment
You need to specify
--task=masked_lm. That task adds an extra<mask>element to the dictionary: https://github.com/pytorch/fairseq/blob/master/fairseq/tasks/masked_lm.py#L64