Describe the bug
I trained a sequence tagger model using CamemBERT with a AWS EC2 instance (Ubuntu). When I try to load the model from my computer I have this error :
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-5-2f413379aedd> in <module>
6
7 # load the NER tagger
----> 8 tagger = SequenceTagger.load('best-model.pt')
9
10
c:\users\nicod\miniconda3\lib\site-packages\flair\nn.py in load(cls, model)
86 # see https://github.com/zalandoresearch/flair/issues/351
87 f = file_utils.load_big_file(str(model_file))
---> 88 state = torch.load(f, map_location='cpu')
89
90 model = cls._init_model_with_state_dict(state)
c:\users\nicod\miniconda3\lib\site-packages\torch\serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
527 with _open_zipfile_reader(f) as opened_zipfile:
528 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 529 return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
530
531
c:\users\nicod\miniconda3\lib\site-packages\torch\serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
700 unpickler = pickle_module.Unpickler(f, **pickle_load_args)
701 unpickler.persistent_load = persistent_load
--> 702 result = unpickler.load()
703
704 deserialized_storage_keys = pickle_module.load(f, **pickle_load_args)
c:\users\nicod\miniconda3\lib\site-packages\transformers\tokenization_camembert.py in __setstate__(self, d)
259 raise
260 self.sp_model = spm.SentencePieceProcessor()
--> 261 self.sp_model.Load(self.vocab_file)
262
263 def convert_tokens_to_string(self, tokens):
c:\users\nicod\miniconda3\lib\site-packages\sentencepiece.py in Load(self, model_file, model_proto)
365 if model_proto:
366 return self.LoadFromSerializedProto(model_proto)
--> 367 return self.LoadFromFile(model_file)
368
369
c:\users\nicod\miniconda3\lib\site-packages\sentencepiece.py in LoadFromFile(self, arg)
175
176 def LoadFromFile(self, arg):
--> 177 return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
178
179 def Init(self,
OSError: Not found: "/home/ubuntu/.cache/torch/transformers/3715e3a4a2de48834619b2a6f48979e13ddff5cabfb1f3409db689f9ce3bb98f.28d30f926f545047fc59da64289371eef0fbdc0764ce9ec56f808a646fcfec59": No such file or directory Error #2
Flair is searching a file in my computer that was visibly in the Ubuntu instance so I can't load the model. If I try to load the model from the Ubuntu instance, everything works.
To Reproduce
Here is my code to train the model :
from flair.embeddings import TransformerWordEmbeddings
from flair.data import Corpus
from flair.datasets import ColumnCorpus
import torch
def trainCamembert():
torch.multiprocessing.freeze_support()
# define columns
columns = {0: 'text', 1: 'pos', 2: 'ner'}
# this is the folder in which train, test and dev files reside
data_folder = 'wikiner'
# init a corpus using column format, data folder and the names of the train, dev and test files
corpus = ColumnCorpus(data_folder =data_folder,
column_format=columns,
train_file='train.txt',
test_file='test.txt',
in_memory=True)
# 1. get the corpus
print(corpus)
# 2. what tag do we want to predict?
tag_type = 'ner'
# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary)
# 4. initialize embeddings
embeddings = TransformerWordEmbeddings('camembert-base',
layers='all',
use_scalar_mix=True,
pooling_operation='mean')
# 5. initialize sequence tagger
from flair.models import SequenceTagger
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type,
use_crf=True)
# 6. initialize trainer
from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
# 7. start training
trainer.train('example-ner-camtest',
learning_rate=0.1,
mini_batch_size=32,
max_epochs=150,
embeddings_storage_mode="gpu",
monitor_test=True,
checkpoint=True)
if __name__=="__main__":
trainCamembert()
and to load the model (on Windows) :
from flair.models import SequenceTagger
# load the NER tagger
tagger = SequenceTagger.load('best-model.pt')
Expected behavior
The model is loaded and I can make prediction with it from another computer than the one used to train.
Environment (On my computer):
Environment (On the AWS Instance):
Any help ? Is this just me or is this a real issue ? I've tried to serialize the tagger with pickle, but I still get the same issue when loading it. I've also dived in the code of Flair to understand how this behaviour can be changed, however it seems like tokenizer (the CamembertTokenizer) is often called with the from_pretrained method so I don't really understand why when loading the model, the tokenizer isn't downloaded instead of searching the vocab_file. Does that mean the vocab_file of the tokenizer is extended during the training so even if we were able to download the original vocab_file, it wouldn't work either ?
Also the issue is not related to this specific case (training on the Ubuntu machine and trying to load on a Windows machine) : I tried different scenarios with different computers and I still get the issue. Also I thought maybe choosing embeddings_storage_mode='none'may be a solution but the issue is related to the vocab_file, not the embeddings stored during the training process.
I think this might be the same issue we had in #1422 with the old CamembertEmbedding class. Could you try reloading the tokenizer after loading:
import flair
from flair.models import SequenceTagger
from transformers import AutoConfig, AutoModel
# load your tagger
tagger = SequenceTagger.load('best-model.pt')
# get name of embedding
transformer_model_name = '-'.join(tagger.embeddings.name.split('-')[2:])
print(transformer_model_name)
# reload transformer embedding
config = AutoConfig.from_pretrained(transformer_model_name, output_hidden_states=True)
tagger.embeddings.model = AutoModel.from_pretrained(transformer_model_name, config=config)
Hello, I tried your code like this :
import flair
from flair.models import SequenceTagger
from transformers import AutoConfig, AutoModel
# load your tagger
tagger = SequenceTagger.load('example-ner-camembert2/best-model.pt')
# get name of embedding
transformer_model_name = '-'.join(tagger.embeddings.name.split('-')[2:])
print(transformer_model_name)
# reload transformer embedding
config = AutoConfig.from_pretrained(transformer_model_name, output_hidden_states=True)
tagger.embeddings.model = AutoModel.from_pretrained(transformer_model_name, config=config)
tagger.save("mymodel.pt")
This is the output :
2020-07-14 09:02:31,256 loading file example-ner-camembert2/best-model.pt
camembert-base
So the model name is correct. However when I try to load mymodel.pt I still get the same issue :
2020-07-14 11:09:21,902 loading file mymodel.pt
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-4-b4e9d1576b10> in <module>
11 #embedding.embed(sentence)
12 # load the NER tagger
---> 13 tagger = SequenceTagger.load("mymodel.pt")
c:\users\nicod\miniconda3\lib\site-packages\flair\nn.py in load(cls, model)
84 # see https://github.com/zalandoresearch/flair/issues/351
85 f = file_utils.load_big_file(str(model_file))
---> 86 state = torch.load(f, map_location=flair.device)
87
88 model = cls._init_model_with_state_dict(state)
c:\users\nicod\miniconda3\lib\site-packages\torch\serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
527 with _open_zipfile_reader(f) as opened_zipfile:
528 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 529 return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
530
531
c:\users\nicod\miniconda3\lib\site-packages\torch\serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
700 unpickler = pickle_module.Unpickler(f, **pickle_load_args)
701 unpickler.persistent_load = persistent_load
--> 702 result = unpickler.load()
703
704 deserialized_storage_keys = pickle_module.load(f, **pickle_load_args)
c:\users\nicod\miniconda3\lib\site-packages\transformers\tokenization_camembert.py in __setstate__(self, d)
259 raise
260 self.sp_model = spm.SentencePieceProcessor()
--> 261 self.sp_model.Load(self.vocab_file)
262
263 def convert_tokens_to_string(self, tokens):
c:\users\nicod\miniconda3\lib\site-packages\sentencepiece.py in Load(self, model_file, model_proto)
365 if model_proto:
366 return self.LoadFromSerializedProto(model_proto)
--> 367 return self.LoadFromFile(model_file)
368
369
c:\users\nicod\miniconda3\lib\site-packages\sentencepiece.py in LoadFromFile(self, arg)
175
176 def LoadFromFile(self, arg):
--> 177 return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
178
179 def Init(self,
OSError: Not found: "/root/.cache/torch/transformers/3715e3a4a2de48834619b2a6f48979e13ddff5cabfb1f3409db689f9ce3bb98f.28d30f926f545047fc59da64289371eef0fbdc0764ce9ec56f808a646fcfec59": No such file or directory Error #2
It's not exactly identical because this time I made the model in a docker container (debian) inside another Windows machine thus the path not found is different.
Hm strange. So when you load it the first time it works, but if you save the model again and re-load it, it throws this error? Or are you switching machines in between?
I can only load the model on the computer used to create it. If I train my model on computer A then I am only able to load it from A. If I try to load it on say computer B then I have the error. I can load it on computer A as many times as I want so yes I am switching machines in between.
This issue seems to be quite old actually : #1712 reported a similar behaviour with another transformer model. It sounds like it's related to the transformer package and the way it uses a function called save_pretrained but it's note sure. I think this is a good track : I will dive into it and report here anything useful I can find to solve this issue
So I think the problem comes from the use of torch.save function to save models coming from tranformers. As stated here https://github.com/huggingface/transformers/issues/5292#issuecomment-650677235, it's not the correct way to save transformers' objects most of the time. When torch.save is called to save the model with its state_dict, the path to the vocab_file of the tokenizer is written in the generated .pt file as you can see here with my camemBERT.pt file trained on my Ubuntu instance :

This is the path were the vocab_file was cached when the model was launched for the first time so the computer needed to download CamemBERT's vocab_file. Unfortunately it's not possible (or I didn't find how to) load the .pt file, change the location for the vocab_file and I don't know where in the state dict of the model this path for the vocab_file is written.
It's possible to save the vocab_file at another location with the save_vocabulary method of all PreTrainedTokenizer so I try to add this method in your TransformerWordEmbedding class :
def save(self, model_directory: Union[str, Path]):
"""
Saves the current model to the provided directory.
:param model_directory: The directory to saved model's information
"""
self.model.save_pretrained(model_directory)
#Save all tokenizer's information in the directory model_directory
self.tokenizer.save_pretrained(model_directory)
#Then overwrite the location of the vocab_file in the tokenizer with an environment variable
path = self.tokenizer.save_vocabulary(model_directory)
os.environ['VOCAB_FILE'] = os.path.abspath(path[0])
self.tokenizer.vocab_file = (lambda x : f"{os.getenv('VOCAB_FILE')}")("")
and I modified the save method of the SequenceTagger too :
def save(self, model_file: Union[str, Path]):
"""
Saves the current model to the provided file.
:param model_file: the model file
"""
if isinstance(self.embeddings,TransformerWordEmbeddings):
print("You need to save the information about the tokenizer too")
self.embeddings.save("test")
super().save(model_file)
I wanted to decide a location for the vocab_file that will in a directory called "test". Then I tried to overwrite the location of the vocab_file in the tokenizer with a lambda function that output the content of a environement variable called "VOCAB_FILE". I hoped that when the .pt file would be created instead of a path for vocab_file, I would have instead the lambda function so that when loading the model from another computer where we download the vocab_file, we would juste have to change the VOCAB_FILE variable so that the correct path is put. However it didn't work and I still get the path in the .cache folder.
I think that the easiest way to solve this issue would be to write a completely different save method for SequenceTagger in the case the embedding is a TransformerWordEmbedding. And this method will have to use save_pretrained method (from Transformers) to save the information regarding the tokenizer. So a different load method would be also necessary to load the model using from_pretrained for the tokenizer as recommended by Hugging Face
@Nighthyst Did you solve the issue ? I'm also getting the same issue when I train my "xlm-roberta-base" model on google colab and use it on my local machine using TextClassifier.load(mymodel.pt).
No I didn't : this is pretty hard to solve with my limited knowledge of Flair's codebase. I hope @alanakbik or some experienced contributor of Flair can check this out
@Nighthyst you can use a workaround as mentioned here https://github.com/flairNLP/flair/issues/1712
Just patch the cambert_tokenizer instead of albert_tokenizer in the issue above
Hi @mittalsuraj18 , so I tried your workaround with CamemBERT like this :
from types import MethodType
import transformers
vocab_file = transformers.tokenization_camembert.CamembertTokenizer.from_pretrained("camembert-base").vocab_file
def _setstate(self, d): # Method to patch with
self.__dict__ = d
try:
import sentencepiece as spm
except ImportError:
logger.warning(
"You need to install SentencePiece to use AlbertTokenizer: https://github.com/google/sentencepiece"
"pip install sentencepiece"
)
raise
self.sp_model = spm.SentencePieceProcessor()
self.sp_model.Load(vocab_file)
# Actual Patching being done here
transformers.tokenization_camembert.CamembertTokenizer.__setstate__ = MethodType(
_setstate, transformers.tokenization_camembert.CamembertTokenizer(vocab_file )
)
Using it, I can load the model :
tagger = SequenceTagger.load("super_camembert/best-model.pt")
This above works now. However when I try to use the tagger to predict entities with the following code snippet:
# The sentence objects holds a sentence that we may want to embed or tag
from flair.data import Sentence
# text with English and German sentences
sentence = Sentence("Je me demande pourquoi Paris n'est pas plus appr茅ci茅e pour ses monuments")
# predict NER tags
tagger.predict(sentence)
# print sentence with predicted tags
print(sentence.to_tagged_string())
I have this error :
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-19-3a0824de8da8> in <module>
6
7 # predict PoS tags
----> 8 tagger.predict(sentence)
9
10 # print sentence with predicted tags
/opt/conda/envs/asd/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py in predict(self, sentences, mini_batch_size, embedding_storage_mode, all_tag_prob, verbose, use_tokenizer)
370 continue
371
--> 372 feature: torch.Tensor = self.forward(batch)
373 tags, all_tags = self._obtain_labels(
374 feature=feature,
/opt/conda/envs/asd/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py in forward(self, sentences)
511 def forward(self, sentences: List[Sentence]):
512
--> 513 self.embeddings.embed(sentences)
514
515 lengths: List[int] = [len(sentence.tokens) for sentence in sentences]
/opt/conda/envs/asd/lib/python3.7/site-packages/flair/embeddings/base.py in embed(self, sentences)
57
58 if not everything_embedded or not self.static_embeddings:
---> 59 self._add_embeddings_internal(sentences)
60
61 return sentences
/opt/conda/envs/asd/lib/python3.7/site-packages/flair/embeddings/token.py in _add_embeddings_internal(self, sentences)
869 # embed each micro-batch
870 for batch in sentence_batches:
--> 871 self._add_embeddings_to_sentences(batch)
872
873 return sentences
/opt/conda/envs/asd/lib/python3.7/site-packages/flair/embeddings/token.py in _add_embeddings_to_sentences(self, sentences)
983
984 # put encoded batch through transformer model to get all hidden states of all encoder layers
--> 985 hidden_states = self.model(input_ids, attention_mask=mask)[-1]
986
987 # gradients are enabled if fine-tuning is enabled
/opt/conda/envs/asd/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)
/opt/conda/envs/asd/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states, return_tuple)
746 output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
747 )
--> 748 return_tuple = return_tuple if return_tuple is not None else self.config.use_return_tuple
749
750 if input_ids is not None and inputs_embeds is not None:
/opt/conda/envs/asd/lib/python3.7/site-packages/transformers/configuration_utils.py in use_return_tuple(self)
197 def use_return_tuple(self):
198 # If torchscript is set, force return_tuple to avoid jit errors
--> 199 return self.return_tuple or self.torchscript
200
201 @property
AttributeError: 'CamembertConfig' object has no attribute 'return_tuple'
Did you encountered this error before with Albert and do you know how to solve it ?
@Nighthyst I didn't encounter this issue in albert model. You can try setting the return tuple manually using the code
tagger.embeddings.return_tuple = False
Weird : still not working even with tagger.emebeddings.return_tuple set to False. I have the same error as before.
Maybe it's because of Flair's version : you were using version 0.3.2 with your fix. I guess TransformerWordEmbedding didn't exist yet so you were using the equivalent of CamembertEmbedding for Albert.
With version 0.5.1 this has maybe change a lot so the workaround may not work anymore. Are you using the latest version of Flair with your workaround @mittalsuraj18 ?
I encounter this issue as well trying to load a fine-tuned "xlm-roberta" model with
TextClassifier.load('data/best-model_all_noweight.pt') on a different machine.
There is no issue at all if the path for the config file and of the sentencepiece.bpe.model is identical to the path on the machine used for training.
However, if it is different, then there is an OSError .
flair==0.5
transformers==3.0.2
torch==1.5.1
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-24-8c9d5babdbe6> in <module>
----> 1 pred_model = TextClassifier.load('data/best-model_all_noweight.pt')
/opt/conda/lib/python3.7/site-packages/flair/nn.py in load(cls, model)
84 # see https://github.com/zalandoresearch/flair/issues/351
85 f = file_utils.load_big_file(str(model_file))
---> 86 state = torch.load(f, map_location=flair.device)
87
88 model = cls._init_model_with_state_dict(state)
/opt/conda/lib/python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
591 return torch.jit.load(f)
592 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 593 return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
594
595
/opt/conda/lib/python3.7/site-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
771 unpickler = pickle_module.Unpickler(f, **pickle_load_args)
772 unpickler.persistent_load = persistent_load
--> 773 result = unpickler.load()
774
775 deserialized_storage_keys = pickle_module.load(f, **pickle_load_args)
/opt/conda/lib/python3.7/site-packages/transformers/tokenization_xlm_roberta.py in __setstate__(self, d)
173 raise
174 self.sp_model = spm.SentencePieceProcessor()
--> 175 self.sp_model.Load(self.vocab_file)
176
177 def build_inputs_with_special_tokens(
/opt/conda/lib/python3.7/site-packages/sentencepiece.py in Load(self, model_file, model_proto)
365 if model_proto:
366 return self.LoadFromSerializedProto(model_proto)
--> 367 return self.LoadFromFile(model_file)
368
369
/opt/conda/lib/python3.7/site-packages/sentencepiece.py in LoadFromFile(self, arg)
175
176 def LoadFromFile(self, arg):
--> 177 return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
178
179 def Init(self,
OSError: Not found: "/home/jovyan/data/pretrained/XLM/sentencepiece.bpe.model": No such file or directory Error #2
I might have found a workaround (pls let me know if there is some sort of a logical mistake :) ).
Saving only the state_dict on the training machine and then using load_state_dict on the TextClassifier seems to solve the issue with different paths.
# Training machine (pytorch_model.bin & config-json & sentencepiece.bpe.model in data/pretrained/XLM/) #
import torch
# load fine-tuned model on the training machine
model_to_save = TextClassifier.load("/home/jovyan/work/trade_fin_ho/notebooks/data/best-model_all_noweight.pt")
# save state_dict of the fine-tuned model
torch.save(model_to_save.state_dict(), "/home/jovyan/data/pretrained/try/best-model_all_noweight_state_dict.pt")
# save the label_dict
label_dict = corpus.make_label_dictionary()
with open("/home/jovyan/data/pretrained/try/label-dict.pkl", 'wb') as f: pickle.dump(label_dict, f)
# Inference machine (pytorch_model.bin & config-json & sentencepiece.bpe.model in data/pretrained/XLM_infer/) #
import torch
# load document_embeddings on the inference machine (same code as on the training machine, BUT different path)
document_embeddings = TransformerDocumentEmbeddings('/home/jovyan/data/pretrained/XLM_infer/', fine_tune=True, use_scalar_mix=True, layers='-1')
# load label_dict on the inference machine
with open("/home/jovyan/data/pretrained/try/label-dict.pkl", 'rb') as f: label_dict = pickle.load(f)
# setup TextClassifier the same way it was done on the training machine
classifier = TextClassifier(document_embeddings,
label_dictionary=label_dict,
multi_label=False)
# load state_dict of the fine-tuned model
state_dict = torch.load("/home/jovyan/data/pretrained/try/best-model_all_noweight_state_dict.pt")
classifier.load_state_dict(state_dict)
@CourtVision, your approach looks fine. It should work in my opinion.
One thing you can optimize more is to save everything in a single file instead of multiple files. That should allow you to keep track of models and dictionaries easier.
import torch
model_to_save = TextClassifier.load("/home/jovyan/work/trade_fin_ho/notebooks/data/best-model_all_noweight.pt")
label_dict = corpus.make_label_dictionary()
torch.save({
"state_dict":model_to_save.state_dict(),
"label_dict":label_dict
}, "/home/jovyan/data/pretrained/try/best-model_all_noweight_state_dict.pt")
Thanks @mittalsuraj18 !!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.