Transformers: Error while saving Longformer pre-trained model

Created on 2 Jul 2020  路  9Comments  路  Source: huggingface/transformers

Thanks for the transformers library!

Information

I am trying to finetune a pre-trained model of type LongformerForQuestionAnswer on a custom QA dataset using a custom script morphed from run_squad.py. The pre-trained model is allenai/longformer-large-4096-finetuned-triviaqa

While saving the pretrained model, I run into the following error:

Traceback (most recent call last):
  File "examples/question-answering/run_nq.py", line 809, in <module>
    main()
  File "examples/question-answering/run_nq.py", line 752, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "examples/question-answering/run_nq.py", line 248, in train
    tokenizer.save_pretrained(output_dir)
  File "/home/danishp/git/explain-qa/src/third_party/transformers/src/transformers/tokenization_utils_base.py", line 1368, in save_pretrained
    write_dict[key] = value.__getstate__()
AttributeError: 'AddedToken' object has no attribute '__getstate__'
wontfix

Most helpful comment

Actually, I never pip-installed the tranformers library, I am just running the cloned github code from a few days ago (this is because I had to edit some parts of the code for my use case).

However, when I pip installed these versions, surprisingly, I don't see this error. As you suggest, it is possible that some tokenizer changes that were intended for version 3.0.0 crept in.

In the cloned code that I am using, if I change the following line to:

https://github.com/huggingface/transformers/blob/ef0e9d806c51059b07b98cb0279a20d3ba3cbc1d/src/transformers/tokenization_utils_base.py#L1368

write_dict[key] = value.content # instead of __getstate__()

The problem is fixed.

Just had the same issue with version 3.0.2 while fine-tuning the Robert-base model. Guess, it would have been the same with other BERT-base models.
Changing this line solved the issue.

All 9 comments

+1, I got the same error.

Hi, do you mind pasting your environment information? Especially related to your transformers and tokenizers versions.

Hi @LysandreJik, thanks for checking in. I am using the version 2.11.0 of the transformers library, and tokenizers==0.7.0.

Following is the associated config file. It doesn't say much about the tokenizer version, but I think the tokenizers are too loaded from LongformerForQuestionAnswering.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")

{
  "architectures": [
    "LongformerForQuestionAnswering"
  ],
  "attention_mode": "longformer",
  "attention_probs_dropout_prob": 0.1,
  "attention_window": [
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "ignore_attention_mask": false,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 4098,
  "model_type": "longformer",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 1,
  "sep_token_id": 2,
  "type_vocab_size": 1,
  "vocab_size": 50265
}

A simple way to reproduce the problem is the following:

import transformers
from transformers import * 
tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096")
tokenizer.save_pretrained("~/")

I think I found out where the problem lies:

tokenizer.special_tokens_map_extended.items()

There are these special tokens which are instances of AddedToken which do not have a __getstate__ function which is called in line 1368 of tokenization_utils_base.py

dict_items([('bos_token', AddedToken("<s>", rstrip=False, lstrip=False, single_word=False)), ('eos_token', AddedToken("</s>", rstrip=False, lstrip=False, single_word=False)), ('unk_token', AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False)), ('sep_token', AddedToken("</s>", rstrip=False, lstrip=False, single_word=False)), ('pad_token', AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False)), ('cls_token', AddedToken("<s>", rstrip=False, lstrip=False, single_word=False)), ('mask_token', AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False))])

Hmmm, I can't reproduce on my end with your versions. Three questions:

  • Did you install from source? If you did, it's possible that you have some tokenizer changes that were intended for version 3.0.0. In that case, could you try installing tokenizers==0.8.0, that has the necessary changes to handle that?
  • Is it possible for you to reinstall both transformers and tokenizers to check? pip install -U transformers==2.11.0 and pip install -U tokenizers==0.8.0
  • If all else fails, is it a possibility for you to install the latest versions? A simple pip install -U transformers should take care of it.

Let me know if any of these fix your issue.

Actually, I never pip-installed the tranformers library, I am just running the cloned github code from a few days ago (this is because I had to edit some parts of the code for my use case).

However, when I pip installed these versions, surprisingly, I don't see this error. As you suggest, it is possible that some tokenizer changes that were intended for version 3.0.0 crept in.

In the cloned code that I am using, if I change the following line to:

https://github.com/huggingface/transformers/blob/ef0e9d806c51059b07b98cb0279a20d3ba3cbc1d/src/transformers/tokenization_utils_base.py#L1368

write_dict[key] = value.content # instead of __getstate__()

The problem is fixed.

Actually, I never pip-installed the tranformers library, I am just running the cloned github code from a few days ago (this is because I had to edit some parts of the code for my use case).

However, when I pip installed these versions, surprisingly, I don't see this error. As you suggest, it is possible that some tokenizer changes that were intended for version 3.0.0 crept in.

In the cloned code that I am using, if I change the following line to:

https://github.com/huggingface/transformers/blob/ef0e9d806c51059b07b98cb0279a20d3ba3cbc1d/src/transformers/tokenization_utils_base.py#L1368

write_dict[key] = value.content # instead of __getstate__()

The problem is fixed.

Just had the same issue with version 3.0.2 while fine-tuning the Robert-base model. Guess, it would have been the same with other BERT-base models.
Changing this line solved the issue.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings