Transformers: Strange bug when Finetuning own pretrained model (with an even stranger solution)

Created on 23 Feb 2020 · 5Comments · Source: huggingface/transformers

🐛 Bug

Information

Roberta

Language I am using the model on (English, Chinese ...): Latin script(migh have a mix of languages)

The problem arises when using:
run_glue on model obtained from run_language_modeling

The tasks I am working on is:
Sequence Classification(single)

Steps to reproduce the behavior:

Train model using run_language_modeling
Use trained model in run_glue script

Error:
File "run_glue.py", line 148, in train
optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
File "/usr/local/lib/python3.6/dist-packages/torch/optim/optimizer.py", line 116, in load_state_dict
raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group
Note: I searched what might cause this error(freezing some layers and passing an incorrect params_group). But I have not done anything like that, so this error should not occur.

Quick Hack/Solution:

There is a strange solution simply by deleting optimizer.pt and setting number of epochs to an arbitrarily large number. Not setting epochs to a very high number causes the script to proceed directly to evaluation and not do any training.

Environment info

Google Colab
Tokenizers 0.5
Transformers 2.5
GPU:P4

Should Fix wontfix

Source

aditya-malte

👀2

Most helpful comment

Hi! This is an interesting use-case, I think the error stems from the run_glue script trying to re-use the different attributes the run_language_modeling script had saved.

That includes:

the optimizer state
the scheduler state
the current global step, which is inferred from the name

Your patch works because
1) the optimizer state shouldn't be kept across different trainings. Deleting the optimizer file makes sense.
2) The script believes you're already at a very high global step, as inferred from the name of your file. Setting a very high number of epochs means a very high number of steps to complete the training, hence some remaining steps.

We should work to fix the issue, but for now I would recommend deleting the files you don't need (optimizer.pt and scheduler.pt), and rename your folder containing your model/config/tokenizer files so that it doesn't end with a number.

LysandreJik on 25 Feb 2020

❤5

All 5 comments

Hi! This is an interesting use-case, I think the error stems from the run_glue script trying to re-use the different attributes the run_language_modeling script had saved.

That includes:

the optimizer state
the scheduler state
the current global step, which is inferred from the name

LysandreJik on 25 Feb 2020

❤5

Maybe we could raise a warning after pretraining is over. Ideally, this should be handled by the script itself, and such deletion etc. should not be required

aditya-malte on 26 Feb 2020

Yes, I was also stuck on this issue. @LysandreJik , kudos to your hack.

oya163 on 19 Mar 2020

Stuck in the same issue too. Thanks for your suggestion @LysandreJik

sakares on 9 Apr 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.