Transformers: Error while saving Longformer pre-trained model

Created on 2 Jul 2020 · 9Comments · Source: huggingface/transformers

Thanks for the transformers library!

Information

I am trying to finetune a pre-trained model of type LongformerForQuestionAnswer on a custom QA dataset using a custom script morphed from run_squad.py. The pre-trained model is allenai/longformer-large-4096-finetuned-triviaqa

While saving the pretrained model, I run into the following error:

Traceback (most recent call last):
  File "examples/question-answering/run_nq.py", line 809, in <module>
    main()
  File "examples/question-answering/run_nq.py", line 752, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "examples/question-answering/run_nq.py", line 248, in train
    tokenizer.save_pretrained(output_dir)
  File "/home/danishp/git/explain-qa/src/third_party/transformers/src/transformers/tokenization_utils_base.py", line 1368, in save_pretrained
    write_dict[key] = value.__getstate__()
AttributeError: 'AddedToken' object has no attribute '__getstate__'

wontfix

Source

danishpruthi

Most helpful comment

Actually, I never pip-installed the tranformers library, I am just running the cloned github code from a few days ago (this is because I had to edit some parts of the code for my use case).

However, when I pip installed these versions, surprisingly, I don't see this error. As you suggest, it is possible that some tokenizer changes that were intended for version 3.0.0 crept in.

In the cloned code that I am using, if I change the following line to:

https://github.com/huggingface/transformers/blob/ef0e9d806c51059b07b98cb0279a20d3ba3cbc1d/src/transformers/tokenization_utils_base.py#L1368
write_dict[key] = value.content # instead of __getstate__()
The problem is fixed.

Just had the same issue with version 3.0.2 while fine-tuning the Robert-base model. Guess, it would have been the same with other BERT-base models.
Changing this line solved the issue.

gabrer on 14 Jul 2020

👍2

All 9 comments

+1, I got the same error.

yoeldk on 2 Jul 2020

Hi, do you mind pasting your environment information? Especially related to your transformers and tokenizers versions.

LysandreJik on 2 Jul 2020

Hi @LysandreJik, thanks for checking in. I am using the version 2.11.0 of the transformers library, and tokenizers==0.7.0.

Following is the associated config file. It doesn't say much about the tokenizer version, but I think the tokenizers are too loaded from LongformerForQuestionAnswering.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")

{
  "architectures": [
    "LongformerForQuestionAnswering"
  ],
  "attention_mode": "longformer",
  "attention_probs_dropout_prob": 0.1,
  "attention_window": [
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "ignore_attention_mask": false,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 4098,
  "model_type": "longformer",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 1,
  "sep_token_id": 2,
  "type_vocab_size": 1,
  "vocab_size": 50265
}

danishpruthi on 2 Jul 2020

A simple way to reproduce the problem is the following:

import transformers
from transformers import * 
tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096")
tokenizer.save_pretrained("~/")

danishpruthi on 2 Jul 2020

I think I found out where the problem lies:

tokenizer.special_tokens_map_extended.items()

There are these special tokens which are instances of AddedToken which do not have a __getstate__ function which is called in line 1368 of tokenization_utils_base.py

dict_items([('bos_token', AddedToken("<s>", rstrip=False, lstrip=False, single_word=False)), ('eos_token', AddedToken("</s>", rstrip=False, lstrip=False, single_word=False)), ('unk_token', AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False)), ('sep_token', AddedToken("</s>", rstrip=False, lstrip=False, single_word=False)), ('pad_token', AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False)), ('cls_token', AddedToken("<s>", rstrip=False, lstrip=False, single_word=False)), ('mask_token', AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False))])

danishpruthi on 2 Jul 2020

Hmmm, I can't reproduce on my end with your versions. Three questions:

Did you install from source? If you did, it's possible that you have some tokenizer changes that were intended for version 3.0.0. In that case, could you try installing tokenizers==0.8.0, that has the necessary changes to handle that?
Is it possible for you to reinstall both transformers and tokenizers to check? pip install -U transformers==2.11.0 and pip install -U tokenizers==0.8.0
If all else fails, is it a possibility for you to install the latest versions? A simple pip install -U transformers should take care of it.

Let me know if any of these fix your issue.

LysandreJik on 2 Jul 2020

Actually, I never pip-installed the tranformers library, I am just running the cloned github code from a few days ago (this is because I had to edit some parts of the code for my use case).

However, when I pip installed these versions, surprisingly, I don't see this error. As you suggest, it is possible that some tokenizer changes that were intended for version 3.0.0 crept in.

In the cloned code that I am using, if I change the following line to:

https://github.com/huggingface/transformers/blob/ef0e9d806c51059b07b98cb0279a20d3ba3cbc1d/src/transformers/tokenization_utils_base.py#L1368

write_dict[key] = value.content # instead of __getstate__()

The problem is fixed.

danishpruthi on 2 Jul 2020

🎉2

Actually, I never pip-installed the tranformers library, I am just running the cloned github code from a few days ago (this is because I had to edit some parts of the code for my use case).

However, when I pip installed these versions, surprisingly, I don't see this error. As you suggest, it is possible that some tokenizer changes that were intended for version 3.0.0 crept in.

In the cloned code that I am using, if I change the following line to:

https://github.com/huggingface/transformers/blob/ef0e9d806c51059b07b98cb0279a20d3ba3cbc1d/src/transformers/tokenization_utils_base.py#L1368
write_dict[key] = value.content # instead of __getstate__()
The problem is fixed.

Just had the same issue with version 3.0.2 while fine-tuning the Robert-base model. Guess, it would have been the same with other BERT-base models.
Changing this line solved the issue.

gabrer on 14 Jul 2020

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.