Flair: OSError when trying to load model trained on another commputer

Created on 8 Jul 2020  路  18Comments  路  Source: flairNLP/flair

Describe the bug
I trained a sequence tagger model using CamemBERT with a AWS EC2 instance (Ubuntu). When I try to load the model from my computer I have this error :

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-5-2f413379aedd> in <module>
      6 
      7 # load the NER tagger
----> 8 tagger = SequenceTagger.load('best-model.pt')
      9 
     10 

c:\users\nicod\miniconda3\lib\site-packages\flair\nn.py in load(cls, model)
     86             # see https://github.com/zalandoresearch/flair/issues/351
     87             f = file_utils.load_big_file(str(model_file))
---> 88             state = torch.load(f, map_location='cpu')
     89 
     90         model = cls._init_model_with_state_dict(state)

c:\users\nicod\miniconda3\lib\site-packages\torch\serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
    527             with _open_zipfile_reader(f) as opened_zipfile:
    528                 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 529         return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
    530 
    531 

c:\users\nicod\miniconda3\lib\site-packages\torch\serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
    700     unpickler = pickle_module.Unpickler(f, **pickle_load_args)
    701     unpickler.persistent_load = persistent_load
--> 702     result = unpickler.load()
    703 
    704     deserialized_storage_keys = pickle_module.load(f, **pickle_load_args)

c:\users\nicod\miniconda3\lib\site-packages\transformers\tokenization_camembert.py in __setstate__(self, d)
    259             raise
    260         self.sp_model = spm.SentencePieceProcessor()
--> 261         self.sp_model.Load(self.vocab_file)
    262 
    263     def convert_tokens_to_string(self, tokens):

c:\users\nicod\miniconda3\lib\site-packages\sentencepiece.py in Load(self, model_file, model_proto)
    365       if model_proto:
    366         return self.LoadFromSerializedProto(model_proto)
--> 367       return self.LoadFromFile(model_file)
    368 
    369 

c:\users\nicod\miniconda3\lib\site-packages\sentencepiece.py in LoadFromFile(self, arg)
    175 
    176     def LoadFromFile(self, arg):
--> 177         return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
    178 
    179     def Init(self,

OSError: Not found: "/home/ubuntu/.cache/torch/transformers/3715e3a4a2de48834619b2a6f48979e13ddff5cabfb1f3409db689f9ce3bb98f.28d30f926f545047fc59da64289371eef0fbdc0764ce9ec56f808a646fcfec59": No such file or directory Error #2

Flair is searching a file in my computer that was visibly in the Ubuntu instance so I can't load the model. If I try to load the model from the Ubuntu instance, everything works.

To Reproduce

Here is my code to train the model :

from flair.embeddings import TransformerWordEmbeddings
from flair.data import Corpus
from flair.datasets import ColumnCorpus
import torch

def trainCamembert():
    torch.multiprocessing.freeze_support()
    # define columns
    columns = {0: 'text', 1: 'pos', 2: 'ner'}

    # this is the folder in which train, test and dev files reside
    data_folder = 'wikiner'

    # init a corpus using column format, data folder and the names of the train, dev and test files
    corpus = ColumnCorpus(data_folder =data_folder,
                          column_format=columns,
                          train_file='train.txt',
                          test_file='test.txt',
                          in_memory=True)


    # 1. get the corpus
    print(corpus)

    # 2. what tag do we want to predict?
    tag_type = 'ner'

    # 3. make the tag dictionary from the corpus
    tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
    print(tag_dictionary)

    # 4. initialize embeddings
    embeddings = TransformerWordEmbeddings('camembert-base',
                                           layers='all',
                                           use_scalar_mix=True,
                                           pooling_operation='mean')

    # 5. initialize sequence tagger
    from flair.models import SequenceTagger

    tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                            embeddings=embeddings,
                                            tag_dictionary=tag_dictionary,
                                            tag_type=tag_type,
                                            use_crf=True)

    # 6. initialize trainer
    from flair.trainers import ModelTrainer

    trainer: ModelTrainer = ModelTrainer(tagger, corpus)

    # 7. start training
    trainer.train('example-ner-camtest',
                  learning_rate=0.1,
                  mini_batch_size=32,
                  max_epochs=150,
                  embeddings_storage_mode="gpu",
                  monitor_test=True,
                  checkpoint=True)

if __name__=="__main__":
    trainCamembert()

and to load the model (on Windows) :

from flair.models import SequenceTagger

# load the NER tagger
tagger = SequenceTagger.load('best-model.pt')

Expected behavior
The model is loaded and I can make prediction with it from another computer than the one used to train.

Environment (On my computer):

  • OS [e.g. iOS, Linux]: Windows 10 Family
  • Version [e.g. flair-0.3.2]: flair 0.5.1 (updated with master branch)

Environment (On the AWS Instance):

  • OS [e.g. iOS, Linux]: Ubuntu 16.04
  • Version [e.g. flair-0.3.2]: flair 0.5.1 (updated with master branch)
bug wontfix

All 18 comments

Any help ? Is this just me or is this a real issue ? I've tried to serialize the tagger with pickle, but I still get the same issue when loading it. I've also dived in the code of Flair to understand how this behaviour can be changed, however it seems like tokenizer (the CamembertTokenizer) is often called with the from_pretrained method so I don't really understand why when loading the model, the tokenizer isn't downloaded instead of searching the vocab_file. Does that mean the vocab_file of the tokenizer is extended during the training so even if we were able to download the original vocab_file, it wouldn't work either ?

Also the issue is not related to this specific case (training on the Ubuntu machine and trying to load on a Windows machine) : I tried different scenarios with different computers and I still get the issue. Also I thought maybe choosing embeddings_storage_mode='none'may be a solution but the issue is related to the vocab_file, not the embeddings stored during the training process.

I think this might be the same issue we had in #1422 with the old CamembertEmbedding class. Could you try reloading the tokenizer after loading:

import flair
from flair.models import SequenceTagger
from transformers import AutoConfig, AutoModel

# load your tagger
tagger = SequenceTagger.load('best-model.pt')

# get name of embedding
transformer_model_name = '-'.join(tagger.embeddings.name.split('-')[2:])
print(transformer_model_name)

# reload transformer embedding
config = AutoConfig.from_pretrained(transformer_model_name, output_hidden_states=True)
tagger.embeddings.model = AutoModel.from_pretrained(transformer_model_name, config=config)

Hello, I tried your code like this :

import flair
from flair.models import SequenceTagger
from transformers import AutoConfig, AutoModel

# load your tagger
tagger = SequenceTagger.load('example-ner-camembert2/best-model.pt')

# get name of embedding
transformer_model_name = '-'.join(tagger.embeddings.name.split('-')[2:])
print(transformer_model_name)

# reload transformer embedding
config = AutoConfig.from_pretrained(transformer_model_name, output_hidden_states=True)
tagger.embeddings.model = AutoModel.from_pretrained(transformer_model_name, config=config)

tagger.save("mymodel.pt")

This is the output :

2020-07-14 09:02:31,256 loading file example-ner-camembert2/best-model.pt
camembert-base

So the model name is correct. However when I try to load mymodel.pt I still get the same issue :

2020-07-14 11:09:21,902 loading file mymodel.pt

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-4-b4e9d1576b10> in <module>
     11 #embedding.embed(sentence)
     12 # load the NER tagger
---> 13 tagger = SequenceTagger.load("mymodel.pt")

c:\users\nicod\miniconda3\lib\site-packages\flair\nn.py in load(cls, model)
     84             # see https://github.com/zalandoresearch/flair/issues/351
     85             f = file_utils.load_big_file(str(model_file))
---> 86             state = torch.load(f, map_location=flair.device)
     87 
     88         model = cls._init_model_with_state_dict(state)

c:\users\nicod\miniconda3\lib\site-packages\torch\serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
    527             with _open_zipfile_reader(f) as opened_zipfile:
    528                 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 529         return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
    530 
    531 

c:\users\nicod\miniconda3\lib\site-packages\torch\serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
    700     unpickler = pickle_module.Unpickler(f, **pickle_load_args)
    701     unpickler.persistent_load = persistent_load
--> 702     result = unpickler.load()
    703 
    704     deserialized_storage_keys = pickle_module.load(f, **pickle_load_args)

c:\users\nicod\miniconda3\lib\site-packages\transformers\tokenization_camembert.py in __setstate__(self, d)
    259             raise
    260         self.sp_model = spm.SentencePieceProcessor()
--> 261         self.sp_model.Load(self.vocab_file)
    262 
    263     def convert_tokens_to_string(self, tokens):

c:\users\nicod\miniconda3\lib\site-packages\sentencepiece.py in Load(self, model_file, model_proto)
    365       if model_proto:
    366         return self.LoadFromSerializedProto(model_proto)
--> 367       return self.LoadFromFile(model_file)
    368 
    369 

c:\users\nicod\miniconda3\lib\site-packages\sentencepiece.py in LoadFromFile(self, arg)
    175 
    176     def LoadFromFile(self, arg):
--> 177         return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
    178 
    179     def Init(self,

OSError: Not found: "/root/.cache/torch/transformers/3715e3a4a2de48834619b2a6f48979e13ddff5cabfb1f3409db689f9ce3bb98f.28d30f926f545047fc59da64289371eef0fbdc0764ce9ec56f808a646fcfec59": No such file or directory Error #2

It's not exactly identical because this time I made the model in a docker container (debian) inside another Windows machine thus the path not found is different.

Hm strange. So when you load it the first time it works, but if you save the model again and re-load it, it throws this error? Or are you switching machines in between?

I can only load the model on the computer used to create it. If I train my model on computer A then I am only able to load it from A. If I try to load it on say computer B then I have the error. I can load it on computer A as many times as I want so yes I am switching machines in between.

This issue seems to be quite old actually : #1712 reported a similar behaviour with another transformer model. It sounds like it's related to the transformer package and the way it uses a function called save_pretrained but it's note sure. I think this is a good track : I will dive into it and report here anything useful I can find to solve this issue

So I think the problem comes from the use of torch.save function to save models coming from tranformers. As stated here https://github.com/huggingface/transformers/issues/5292#issuecomment-650677235, it's not the correct way to save transformers' objects most of the time. When torch.save is called to save the model with its state_dict, the path to the vocab_file of the tokenizer is written in the generated .pt file as you can see here with my camemBERT.pt file trained on my Ubuntu instance :

image

This is the path were the vocab_file was cached when the model was launched for the first time so the computer needed to download CamemBERT's vocab_file. Unfortunately it's not possible (or I didn't find how to) load the .pt file, change the location for the vocab_file and I don't know where in the state dict of the model this path for the vocab_file is written.

It's possible to save the vocab_file at another location with the save_vocabulary method of all PreTrainedTokenizer so I try to add this method in your TransformerWordEmbedding class :

def save(self, model_directory: Union[str, Path]):
        """
        Saves the current model to the provided directory.
        :param model_directory: The directory to saved model's information
        """
        self.model.save_pretrained(model_directory)
        #Save all tokenizer's information in the directory model_directory
        self.tokenizer.save_pretrained(model_directory)

        #Then overwrite the location of the vocab_file in the tokenizer with an environment variable
        path = self.tokenizer.save_vocabulary(model_directory)
        os.environ['VOCAB_FILE'] = os.path.abspath(path[0])
        self.tokenizer.vocab_file = (lambda x : f"{os.getenv('VOCAB_FILE')}")("")

and I modified the save method of the SequenceTagger too :

def save(self, model_file: Union[str, Path]):
        """
        Saves the current model to the provided file.
        :param model_file: the model file
        """
        if isinstance(self.embeddings,TransformerWordEmbeddings):
            print("You need to save the information about the tokenizer too")
            self.embeddings.save("test")
        super().save(model_file)

I wanted to decide a location for the vocab_file that will in a directory called "test". Then I tried to overwrite the location of the vocab_file in the tokenizer with a lambda function that output the content of a environement variable called "VOCAB_FILE". I hoped that when the .pt file would be created instead of a path for vocab_file, I would have instead the lambda function so that when loading the model from another computer where we download the vocab_file, we would juste have to change the VOCAB_FILE variable so that the correct path is put. However it didn't work and I still get the path in the .cache folder.

I think that the easiest way to solve this issue would be to write a completely different save method for SequenceTagger in the case the embedding is a TransformerWordEmbedding. And this method will have to use save_pretrained method (from Transformers) to save the information regarding the tokenizer. So a different load method would be also necessary to load the model using from_pretrained for the tokenizer as recommended by Hugging Face

@Nighthyst Did you solve the issue ? I'm also getting the same issue when I train my "xlm-roberta-base" model on google colab and use it on my local machine using TextClassifier.load(mymodel.pt).

No I didn't : this is pretty hard to solve with my limited knowledge of Flair's codebase. I hope @alanakbik or some experienced contributor of Flair can check this out

@Nighthyst you can use a workaround as mentioned here https://github.com/flairNLP/flair/issues/1712
Just patch the cambert_tokenizer instead of albert_tokenizer in the issue above

Hi @mittalsuraj18 , so I tried your workaround with CamemBERT like this :

from types import MethodType
import transformers
vocab_file = transformers.tokenization_camembert.CamembertTokenizer.from_pretrained("camembert-base").vocab_file

def _setstate(self, d):  # Method to patch with
    self.__dict__ = d
    try:
        import sentencepiece as spm
    except ImportError:
        logger.warning(
            "You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece"
            "pip install sentencepiece"
        )
        raise
    self.sp_model = spm.SentencePieceProcessor()
    self.sp_model.Load(vocab_file)

# Actual Patching being done here
transformers.tokenization_camembert.CamembertTokenizer.__setstate__ = MethodType(
            _setstate, transformers.tokenization_camembert.CamembertTokenizer(vocab_file )
)

Using it, I can load the model :

tagger = SequenceTagger.load("super_camembert/best-model.pt")

This above works now. However when I try to use the tagger to predict entities with the following code snippet:

# The sentence objects holds a sentence that we may want to embed or tag
from flair.data import Sentence

# text with English and German sentences
sentence = Sentence("Je me demande pourquoi Paris n'est pas plus appr茅ci茅e pour ses monuments")

# predict NER tags
tagger.predict(sentence)

# print sentence with predicted tags
print(sentence.to_tagged_string())

I have this error :

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-3a0824de8da8> in <module>
      6 
      7 # predict PoS tags
----> 8 tagger.predict(sentence)
      9 
     10 # print sentence with predicted tags

/opt/conda/envs/asd/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py in predict(self, sentences, mini_batch_size, embedding_storage_mode, all_tag_prob, verbose, use_tokenizer)
    370                     continue
    371 
--> 372                 feature: torch.Tensor = self.forward(batch)
    373                 tags, all_tags = self._obtain_labels(
    374                     feature=feature,

/opt/conda/envs/asd/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py in forward(self, sentences)
    511     def forward(self, sentences: List[Sentence]):
    512 
--> 513         self.embeddings.embed(sentences)
    514 
    515         lengths: List[int] = [len(sentence.tokens) for sentence in sentences]

/opt/conda/envs/asd/lib/python3.7/site-packages/flair/embeddings/base.py in embed(self, sentences)
     57 
     58         if not everything_embedded or not self.static_embeddings:
---> 59             self._add_embeddings_internal(sentences)
     60 
     61         return sentences

/opt/conda/envs/asd/lib/python3.7/site-packages/flair/embeddings/token.py in _add_embeddings_internal(self, sentences)
    869         # embed each micro-batch
    870         for batch in sentence_batches:
--> 871             self._add_embeddings_to_sentences(batch)
    872 
    873         return sentences

/opt/conda/envs/asd/lib/python3.7/site-packages/flair/embeddings/token.py in _add_embeddings_to_sentences(self, sentences)
    983 
    984         # put encoded batch through transformer model to get all hidden states of all encoder layers
--> 985         hidden_states = self.model(input_ids, attention_mask=mask)[-1]
    986 
    987         # gradients are enabled if fine-tuning is enabled

/opt/conda/envs/asd/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

/opt/conda/envs/asd/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states, return_tuple)
    746             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    747         )
--> 748         return_tuple = return_tuple if return_tuple is not None else self.config.use_return_tuple
    749 
    750         if input_ids is not None and inputs_embeds is not None:

/opt/conda/envs/asd/lib/python3.7/site-packages/transformers/configuration_utils.py in use_return_tuple(self)
    197     def use_return_tuple(self):
    198         # If torchscript is set, force return_tuple to avoid jit errors
--> 199         return self.return_tuple or self.torchscript
    200 
    201     @property

AttributeError: 'CamembertConfig' object has no attribute 'return_tuple'

Did you encountered this error before with Albert and do you know how to solve it ?

@Nighthyst I didn't encounter this issue in albert model. You can try setting the return tuple manually using the code

tagger.embeddings.return_tuple = False

Weird : still not working even with tagger.emebeddings.return_tuple set to False. I have the same error as before.

Maybe it's because of Flair's version : you were using version 0.3.2 with your fix. I guess TransformerWordEmbedding didn't exist yet so you were using the equivalent of CamembertEmbedding for Albert.

With version 0.5.1 this has maybe change a lot so the workaround may not work anymore. Are you using the latest version of Flair with your workaround @mittalsuraj18 ?

I encounter this issue as well trying to load a fine-tuned "xlm-roberta" model with
TextClassifier.load('data/best-model_all_noweight.pt') on a different machine.

There is no issue at all if the path for the config file and of the sentencepiece.bpe.model is identical to the path on the machine used for training.
However, if it is different, then there is an OSError .

flair==0.5
transformers==3.0.2
torch==1.5.1

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-24-8c9d5babdbe6> in <module>
----> 1 pred_model = TextClassifier.load('data/best-model_all_noweight.pt')

/opt/conda/lib/python3.7/site-packages/flair/nn.py in load(cls, model)
     84             # see https://github.com/zalandoresearch/flair/issues/351
     85             f = file_utils.load_big_file(str(model_file))
---> 86             state = torch.load(f, map_location=flair.device)
     87 
     88         model = cls._init_model_with_state_dict(state)

/opt/conda/lib/python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
    591                     return torch.jit.load(f)
    592                 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 593         return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
    594 
    595 

/opt/conda/lib/python3.7/site-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
    771     unpickler = pickle_module.Unpickler(f, **pickle_load_args)
    772     unpickler.persistent_load = persistent_load
--> 773     result = unpickler.load()
    774 
    775     deserialized_storage_keys = pickle_module.load(f, **pickle_load_args)

/opt/conda/lib/python3.7/site-packages/transformers/tokenization_xlm_roberta.py in __setstate__(self, d)
    173             raise
    174         self.sp_model = spm.SentencePieceProcessor()
--> 175         self.sp_model.Load(self.vocab_file)
    176 
    177     def build_inputs_with_special_tokens(

/opt/conda/lib/python3.7/site-packages/sentencepiece.py in Load(self, model_file, model_proto)
    365       if model_proto:
    366         return self.LoadFromSerializedProto(model_proto)
--> 367       return self.LoadFromFile(model_file)
    368 
    369 

/opt/conda/lib/python3.7/site-packages/sentencepiece.py in LoadFromFile(self, arg)
    175 
    176     def LoadFromFile(self, arg):
--> 177         return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
    178 
    179     def Init(self,

OSError: Not found: "/home/jovyan/data/pretrained/XLM/sentencepiece.bpe.model": No such file or directory Error #2

I might have found a workaround (pls let me know if there is some sort of a logical mistake :) ).
Saving only the state_dict on the training machine and then using load_state_dict on the TextClassifier seems to solve the issue with different paths.

# Training machine (pytorch_model.bin & config-json & sentencepiece.bpe.model in data/pretrained/XLM/) # 

import torch
# load fine-tuned model on the training machine
model_to_save = TextClassifier.load("/home/jovyan/work/trade_fin_ho/notebooks/data/best-model_all_noweight.pt")  
# save state_dict of the fine-tuned model
torch.save(model_to_save.state_dict(), "/home/jovyan/data/pretrained/try/best-model_all_noweight_state_dict.pt")  
# save the label_dict
label_dict = corpus.make_label_dictionary()                                                                     
with open("/home/jovyan/data/pretrained/try/label-dict.pkl", 'wb') as f: pickle.dump(label_dict, f)   
# Inference machine (pytorch_model.bin & config-json & sentencepiece.bpe.model in data/pretrained/XLM_infer/) # 

import torch
# load document_embeddings on the inference machine (same code as on the training machine, BUT different path)
document_embeddings = TransformerDocumentEmbeddings('/home/jovyan/data/pretrained/XLM_infer/', fine_tune=True, use_scalar_mix=True, layers='-1')  
# load label_dict on the inference machine
with open("/home/jovyan/data/pretrained/try/label-dict.pkl", 'rb') as f: label_dict = pickle.load(f)                                              
# setup TextClassifier the same way it was done on the training machine
classifier = TextClassifier(document_embeddings,                                                                                                  
                            label_dictionary=label_dict,
                            multi_label=False)
# load state_dict of the fine-tuned model
state_dict = torch.load("/home/jovyan/data/pretrained/try/best-model_all_noweight_state_dict.pt")                                                 
classifier.load_state_dict(state_dict)  

@CourtVision, your approach looks fine. It should work in my opinion.
One thing you can optimize more is to save everything in a single file instead of multiple files. That should allow you to keep track of models and dictionaries easier.

import torch
model_to_save = TextClassifier.load("/home/jovyan/work/trade_fin_ho/notebooks/data/best-model_all_noweight.pt")  
label_dict = corpus.make_label_dictionary()          
torch.save({
    "state_dict":model_to_save.state_dict(),
    "label_dict":label_dict 
}, "/home/jovyan/data/pretrained/try/best-model_all_noweight_state_dict.pt")  

Thanks @mittalsuraj18 !!

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jewl123 picture jewl123  路  3Comments

Aditya715 picture Aditya715  路  3Comments

aschmu picture aschmu  路  3Comments

alanakbik picture alanakbik  路  3Comments

ChessMateK picture ChessMateK  路  3Comments