Roberta
Language I am using the model on (English, Chinese ...): Latin script(migh have a mix of languages)
The problem arises when using:
run_glue on model obtained from run_language_modeling
The tasks I am working on is:
Sequence Classification(single)
Steps to reproduce the behavior:
Error:
File "run_glue.py", line 148, in train
optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
File "/usr/local/lib/python3.6/dist-packages/torch/optim/optimizer.py", line 116, in load_state_dict
raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group
Note: I searched what might cause this error(freezing some layers and passing an incorrect params_group). But I have not done anything like that, so this error should not occur.
There is a strange solution simply by deleting optimizer.pt and setting number of epochs to an arbitrarily large number. Not setting epochs to a very high number causes the script to proceed directly to evaluation and not do any training.
Google Colab
Tokenizers 0.5
Transformers 2.5
GPU:P4
Hi! This is an interesting use-case, I think the error stems from the run_glue script trying to re-use the different attributes the run_language_modeling script had saved.
That includes:
Your patch works because
1) the optimizer state shouldn't be kept across different trainings. Deleting the optimizer file makes sense.
2) The script believes you're already at a very high global step, as inferred from the name of your file. Setting a very high number of epochs means a very high number of steps to complete the training, hence some remaining steps.
We should work to fix the issue, but for now I would recommend deleting the files you don't need (optimizer.pt and scheduler.pt), and rename your folder containing your model/config/tokenizer files so that it doesn't end with a number.
Maybe we could raise a warning after pretraining is over. Ideally, this should be handled by the script itself, and such deletion etc. should not be required
Yes, I was also stuck on this issue. @LysandreJik , kudos to your hack.
Stuck in the same issue too. Thanks for your suggestion @LysandreJik
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Hi! This is an interesting use-case, I think the error stems from the
run_gluescript trying to re-use the different attributes therun_language_modelingscript had saved.That includes:
Your patch works because
1) the optimizer state shouldn't be kept across different trainings. Deleting the optimizer file makes sense.
2) The script believes you're already at a very high global step, as inferred from the name of your file. Setting a very high number of epochs means a very high number of steps to complete the training, hence some remaining steps.
We should work to fix the issue, but for now I would recommend deleting the files you don't need (
optimizer.ptandscheduler.pt), and rename your folder containing your model/config/tokenizer files so that it doesn't end with a number.