Thanks for the transformers library!
I am trying to finetune a pre-trained model of type LongformerForQuestionAnswer on a custom QA dataset using a custom script morphed from run_squad.py. The pre-trained model is allenai/longformer-large-4096-finetuned-triviaqa
While saving the pretrained model, I run into the following error:
Traceback (most recent call last):
File "examples/question-answering/run_nq.py", line 809, in <module>
main()
File "examples/question-answering/run_nq.py", line 752, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "examples/question-answering/run_nq.py", line 248, in train
tokenizer.save_pretrained(output_dir)
File "/home/danishp/git/explain-qa/src/third_party/transformers/src/transformers/tokenization_utils_base.py", line 1368, in save_pretrained
write_dict[key] = value.__getstate__()
AttributeError: 'AddedToken' object has no attribute '__getstate__'
+1, I got the same error.
Hi, do you mind pasting your environment information? Especially related to your transformers and tokenizers versions.
Hi @LysandreJik, thanks for checking in. I am using the version 2.11.0 of the transformers library, and tokenizers==0.7.0.
Following is the associated config file. It doesn't say much about the tokenizer version, but I think the tokenizers are too loaded from LongformerForQuestionAnswering.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")
{
"architectures": [
"LongformerForQuestionAnswering"
],
"attention_mode": "longformer",
"attention_probs_dropout_prob": 0.1,
"attention_window": [
512,
512,
512,
512,
512,
512,
512,
512,
512,
512,
512,
512,
512,
512,
512,
512,
512,
512,
512,
512,
512,
512,
512,
512
],
"bos_token_id": 0,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"ignore_attention_mask": false,
"initializer_range": 0.02,
"intermediate_size": 4096,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 4098,
"model_type": "longformer",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"pad_token_id": 1,
"sep_token_id": 2,
"type_vocab_size": 1,
"vocab_size": 50265
}
A simple way to reproduce the problem is the following:
import transformers
from transformers import *
tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096")
tokenizer.save_pretrained("~/")
I think I found out where the problem lies:
tokenizer.special_tokens_map_extended.items()
There are these special tokens which are instances of AddedToken which do not have a __getstate__ function which is called in line 1368 of tokenization_utils_base.py
dict_items([('bos_token', AddedToken("<s>", rstrip=False, lstrip=False, single_word=False)), ('eos_token', AddedToken("</s>", rstrip=False, lstrip=False, single_word=False)), ('unk_token', AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False)), ('sep_token', AddedToken("</s>", rstrip=False, lstrip=False, single_word=False)), ('pad_token', AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False)), ('cls_token', AddedToken("<s>", rstrip=False, lstrip=False, single_word=False)), ('mask_token', AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False))])
Hmmm, I can't reproduce on my end with your versions. Three questions:
pip install -U transformers==2.11.0 and pip install -U tokenizers==0.8.0pip install -U transformers should take care of it.Let me know if any of these fix your issue.
Actually, I never pip-installed the tranformers library, I am just running the cloned github code from a few days ago (this is because I had to edit some parts of the code for my use case).
However, when I pip installed these versions, surprisingly, I don't see this error. As you suggest, it is possible that some tokenizer changes that were intended for version 3.0.0 crept in.
In the cloned code that I am using, if I change the following line to:
write_dict[key] = value.content # instead of __getstate__()
The problem is fixed.
Actually, I never pip-installed the tranformers library, I am just running the cloned github code from a few days ago (this is because I had to edit some parts of the code for my use case).
However, when I pip installed these versions, surprisingly, I don't see this error. As you suggest, it is possible that some tokenizer changes that were intended for version 3.0.0 crept in.
In the cloned code that I am using, if I change the following line to:
write_dict[key] = value.content # instead of __getstate__()The problem is fixed.
Just had the same issue with version 3.0.2 while fine-tuning the Robert-base model. Guess, it would have been the same with other BERT-base models.
Changing this line solved the issue.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Just had the same issue with version 3.0.2 while fine-tuning the Robert-base model. Guess, it would have been the same with other BERT-base models.
Changing this line solved the issue.