I have been using the nlp.from_bytes method with the large english model version 1. We upgraded to the 2.1.0 model, and now im getting an error stating that it cant find the en_model.vectors. See below how I load the model. When I do the to_bytes, am I supposed to also save the en_model.vectors separately?
with open(os.path.join(filepath, "large_2.1.0.txt"), 'rb') as file:
modelContent = file.read()
nlp = spacy.blank(meta["lang"])
for pipe_name in meta["pipeline"]:
pipe = nlp.create_pipe(pipe_name)
nlp.add_pipe(pipe)
nlp.from_bytes(modelContent)
looks like the thinc.load_nlp assumes its on disk. https://github.com/explosion/thinc/blob/master/thinc/extra/load_nlp.py
I am loading the whole model from memory.
here is my stack trace:
Traceback (most recent call last):
File "c:\ws\repos\quality\src\main\python\ins\quality\nltk_helper.py", line 89, in get_spacy_model
nlp.from_bytes(modelContent)
File "C:\Programs\Python36\lib\site-packages\spacy\language.py", line 893, in from_bytes
util.from_bytes(bytes_data, deserializers, exclude)
File "C:\Programs\Python36\lib\site-packages\spacy\util.py", line 616, in from_bytes
setter(msg[key])
File "C:\Programs\Python36\lib\site-packages\spacy\language.py", line 890, in <lambda>
b, exclude=["vocab"]
File "pipes.pyx", line 600, in spacy.pipeline.pipes.Tagger.from_bytes
File "C:\Programs\Python36\lib\site-packages\spacy\util.py", line 616, in from_bytes
setter(msg[key])
File "pipes.pyx", line 597, in spacy.pipeline.pipes.Tagger.from_bytes.lambda16
File "pipes.pyx", line 580, in spacy.pipeline.pipes.Tagger.from_bytes.load_model
File "pipes.pyx", line 530, in spacy.pipeline.pipes.Tagger.Model
File "C:\Programs\Python36\lib\site-packages\spacy\_ml.py", line 523, in build_tagger_model
pretrained_vectors=pretrained_vectors,
File "C:\Programs\Python36\lib\site-packages\spacy\_ml.py", line 346, in Tok2Vec
glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
File "C:\Programs\Python36\lib\site-packages\thinc\neural\_classes\static_vectors.py", line 43, in __init__
vectors = self.get_vectors()
File "C:\Programs\Python36\lib\site-packages\thinc\neural\_classes\static_vectors.py", line 55, in get_vectors
return get_vectors(self.ops, self.lang)
File "C:\Programs\Python36\lib\site-packages\thinc\extra\load_nlp.py", line 26, in get_vectors
nlp = get_spacy(lang)
File "C:\Programs\Python36\lib\site-packages\thinc\extra\load_nlp.py", line 14, in get_spacy
SPACY_MODELS[lang] = spacy.load(lang, **kwargs)
File "C:\Programs\Python36\lib\site-packages\spacy\__init__.py", line 27, in load
return util.load_model(name, **overrides)
File "C:\Programs\Python36\lib\site-packages\spacy\util.py", line 139, in load_model
raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_model.vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
@smiles3983 The nlp.from_bytes() method should definitely work without touching disk. The code in Thinc that you've pointed to is a bit misleading in that respect, as there's some backwards compatibility fixes there.
I think the real problem is actually pretty simple: the model format changed between v2.0 and v2.1. You need to retrain the model between the versions. The problem runs deeper than mere formatting changes: the models are different and so the same weights won't work in both versions.
Im not using the 2.0 model at all anymore. Im importing the 2.1 model using the disk installed model using pip and then spacy.load(), no error, and then im doing a to_bytes, saving the binary as well as the meta file extract. I then start over and take the binary, and the meta and create a blank meta, load the model using from_bytes. Thats when I get the error that it cant find the en_model_vectors. I didnt have to train the model with version 2.0 unless Spacy was doing this for me. All I did was change the model to 2.1 to create a new binary and a new meta file, and then im loading it in the same way as before. How do I retrain it?
Hmm. Okay, thanks I'll look into this.
Can I just remove the en_model_vectors section from the meta? think that would still work? or will I be missing the point of having the large model?
That would cause a different error, as the models will be expecting the vectors. I've verified the error, we should be able to get a solution for the next version.
The issue is that the model has a previous name listed for its vectors, and we don't have the fixup logic in the from_bytes code. You should be able to work around the error with this prior to serialization:
for name, pipe in nlp.pipeline:
if hasattr(pipe, "cfg") and pipe.cfg.get("pretrained_vectors"):
pipe.cfg["pretrained_vectors"] = nlp.vocab.vectors.name
ok, ill test it Monday. this should work below?
```nlp = spacy.blank(meta["lang"])
# 2. Initialize the pipeline
for pipe_name in meta["pipeline"]:
pipe = nlp.create_pipe(pipe_name)
nlp.add_pipe(pipe)
for name, pipe in nlp.pipeline:
if hasattr(pipe, "cfg") and pipe.cfg.get("pretrained_vectors"):
pipe.cfg["pretrained_vectors"] = nlp.vocab.vectors.name
nlp.from_bytes(modelContent)
Yes, I think that should work. Regardless, it's definitely fixed in v2.2, which we expect to be out Monday (just having some trouble with the wheel building system).
Awesome. Yeah we dropped back to 2.0 for now. So we鈥檒l have to test the new version next week. If we weren鈥檛 limited to 3GB disk usage in our enterprise container we wouldn鈥檛 have to use the from_bytes function. We have our own ML models as well which take up a lot of space so we have to stay in memory.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.