I need to handle German and English languages with a single application. It worked fine with spaCy 1.8.2, 1.9.0 and 1.10.0 but gets broken with spaCy 2.0.3.
To reproduce the issue:
>>> import spacy
>>> nlpEN = spacy.load('en')
>>> nlpDE = spacy.load('de')
>>> doc = nlpEN('Hello world!')
The error messages:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Projects/foobar/.env/lib/python3.6/site-packages/spacy/language.py", line 333, in __call__
doc = proc(doc)
File "pipeline.pyx", line 390, in spacy.pipeline.Tagger.__call__
File "pipeline.pyx", line 402, in spacy.pipeline.Tagger.predict
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 161, in __call__
return self.predict(x)
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 55, in predict
X = layer(X)
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 161, in __call__
return self.predict(x)
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 293, in predict
X = layer(layer.ops.flatten(seqs_in, pad=pad))
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 161, in __call__
return self.predict(x)
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 55, in predict
X = layer(X)
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 161, in __call__
return self.predict(x)
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 125, in predict
y, _ = self.begin_update(X)
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 372, in uniqued_fwd
Y_uniq, bp_Y_uniq = layer.begin_update(X[ind], drop=drop)
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 61, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 176, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 176, in <listcomp>
values = [fwd(X, *a, **k) for fwd in forward]
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 258, in wrap
output = func(*args, **kwargs)
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 176, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 176, in <listcomp>
values = [fwd(X, *a, **k) for fwd in forward]
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 258, in wrap
output = func(*args, **kwargs)
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 176, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 176, in <listcomp>
values = [fwd(X, *a, **k) for fwd in forward]
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 258, in wrap
output = func(*args, **kwargs)
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 176, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 176, in <listcomp>
values = [fwd(X, *a, **k) for fwd in forward]
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/api.py", line 258, in wrap
output = func(*args, **kwargs)
File "/Projects/foobar/.env/lib/python3.6/site-packages/thinc/neural/_classes/static_vectors.py", line 67, in begin_update
dotted = self.ops.batch_dot(vectors, self.W)
File "ops.pyx", line 299, in thinc.neural.ops.NumpyOps.batch_dot
ValueError: shapes (4,0) and (300,128) not aligned: 0 (dim 1) != 300 (dim 0)
I have a similar error working with Dutch and English in the same application. The Dutch does not give the error. Only the english model causes the error
from spacy import displacy
import en_core_web_md
import nl_core_news_sm
nlp_english = en_core_web_md.load()
nlp_dutch = nl_core_news_sm.load()
nlp_english("I Am an englis barclay bank")
Version
Python 3.6.4
[93m Installed models (spaCy v2.0.5)?[0m
\AppData\Local\Programs\Python\Python36\libsite-packagesspacy
TYPE NAME MODEL VERSION
package xx-ent-wiki-sm xx_ent_wiki_sm ?[38;5;2m2.0.0?[0m ?[38;5;2m?[0m
package nl-core-news-sm nl_core_news_sm ?[38;5;2m2.0.0?[0m ?[38;5;2m?[0m
package en-core-web-md en_core_web_md ?[38;5;2m2.0.0?[0m ?[38;5;2m?[0m
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packagesspacy\language.py", line 333, in __call__
doc = proc(doc)
File "pipeline.pyx", line 390, in spacy.pipeline.Tagger.__call__
File "pipeline.pyx", line 402, in spacy.pipeline.Tagger.predict
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\neural_classes\model.py", line 161, in __call__
return self.predict(x)
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 55, in predict
X = layer(X)
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\neural_classes\model.py", line 161, in __call__
return self.predict(x)
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 293, in predict
X = layer(layer.ops.flatten(seqs_in, pad=pad))
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\neural_classes\model.py", line 161, in __call__
return self.predict(x)
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 55, in predict
X = layer(X)
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\neural_classes\model.py", line 161, in __call__
return self.predict(x)
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\neural_classes\model.py", line 125, in predict
y, _ = self.begin_update(X)
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 374, in uniqued_fwd
Y_uniq, bp_Y_uniq = layer.begin_update(X_uniq, drop=drop)
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 61, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 176, in begin_update
values = [fwd(X, a, *k) for fwd in forward]
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 176, in
values = [fwd(X, a, *k) for fwd in forward]
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 258, in wrap
output = func(args, *kwargs)
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 176, in begin_update
values = [fwd(X, a, *k) for fwd in forward]
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 176, in
values = [fwd(X, a, *k) for fwd in forward]
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 258, in wrap
output = func(args, *kwargs)
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 176, in begin_update
values = [fwd(X, a, *k) for fwd in forward]
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 176, in
values = [fwd(X, a, *k) for fwd in forward]
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 258, in wrap
output = func(args, *kwargs)
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 176, in begin_update
values = [fwd(X, a, *k) for fwd in forward]
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 176, in
values = [fwd(X, a, *k) for fwd in forward]
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\api.py", line 258, in wrap
output = func(args, *kwargs)
File "C:\Users\mike\AppData\Local\Programs\Python\Python36\libsite-packages\thinc\neural_classesstatic_vectors.py", line 67, in begin_update
dotted = self.ops.batch_dot(vectors, self.W)
File "ops.pyx", line 338, in thinc.neural.ops.NumpyOps.batch_dot
ValueError: shapes (7,0) and (300,128) not aligned: 0 (dim 1) != 300 (dim 0)
I am experiencing the same problem. Whenever there are multiple spacy models in memory, one of them tends to fail (usually english) with the stack trace @zhaow-de shows in his post.
Has anyone found a fix for this?
I think this comes down to an ill-considered use of a global variable when using pre-trained models in Thinc. The global variable is used to avoid storing extra copies of the vectors data. However, I think it's not keyed correctly by the spaCy model --- causing this error when there are multiple language models in memory.
I expect to get to this bug before the end of the week -- thanks for your patience; and thanks for reporting.
Has anybody found even a temporary fix to this?
This does not only happen with pre-trained models, this is happening for me with all custom models
It works if you use a smaller model.
For example, en_web_core_md, Spanish and Dutch loaded will cause English to fail,
But en_web_core_sm, Spanish and Dutch loaded seems to all work.
Still waiting on a fix though :)
Im guessing thats because the smaller model only has the context vectors and not the full set of vectors overhead (which is whats getting rewritten?) Can anybody point to where this is 'keyed'?
Tom, your point works, but I am skeptical about what is going on behind the scenes. for example, if I load like this everything is fine
spanish = spacy.load('es_core_news_sm')
english = spacy.load('large_custom_english_model')
but if I load in the reverse order with english first
english = spacy.load('large_custom_english_model')
spanish = spacy.load('es_core_news_sm')
I get the error. This leads me not to trust the results. Is the first one working because the english is just overwriting the context vectors of the spanish model, which share the same dimensions and works coincidentally?
I also get this for running two English models (I wanted to compare them)
md_nlp = spacy.load('en_core_web_md')
sm_nlp = spacy.load('en_core_web_sm')
causes uses of md_nlp to fail.
Same situation here.
Error message:
ValueError: shapes (12,50) and (300,128) not aligned: 50 (dim 1) != 300 (dim 0)
Models:
Installed models (spaCy v2.0.8)
C:\ProgramData\Anaconda3\envs\dflt3\lib\site-packages\spacy
TYPE NAME MODEL VERSION
package en-core-web-md en_core_web_md 2.0.0
package en-core-web-sm en_core_web_sm 2.0.0
package es-core-news-sm es_core_news_sm 2.0.0
package es-core-news-md es_core_news_md 2.0.0
package xx-ent-wiki-sm xx_ent_wiki_sm 2.0.0
link es_core_news_md es_core_news_md 2.0.0
link xx xx_ent_wiki_sm 2.0.0
link es es_core_news_md 2.0.0
link en en_core_web_sm 2.0.0
link en_core_web_md en_core_web_md 2.0.0
@honnibal could you please give an update on this bug? It is a real dealbreaker for my use-case, since I want to serve for multiple languages. I would like to know the timeframe, if possible :), before I setup different instances for each language.
@honnibal This is a huge problem for us, we need Spanish and English. I don't really want to change my microservices to need Spanish and English instances.
+1 -- We are preparing to move off of Spacy entirely in a large production environment because of this critical issue. It would have been nice to even just get a pointer to where in Thinc this is happening so that maybe the community could investigate and help...
@liuzzi Really sorry for losing track of this issue.
First, here's some background on the life-cycle of pre-trained vectors, and how they get used within the models.
Most of the models build context-sensitive word representations using the spacy._ml.Tok2Vec function. This function builds some learned vector representations, and then optionally also concatenates on the pre-trained vectors, if a parameter is passed telling it the dimensions of the pre-trained vectors to pass in.
The pre-trained vectors are loaded within the thinc.neural._classes.static_vectors module. This maintains a dictionary to cache multiple vector tables, so that they don't need to be reloaded. To actually assign vectors to a batch of words, the StaticVectors class is passed in an array of integer IDs, and it's told ahead of time which column to read. It gets this column, and then uses this as indices into the vectors table.
The StaticVectors class is where the error you're seeing occurs, and the only mention of the class should be within the spacy._ml file, within that Tok2Vec function. Having a look at the Tok2Vec function, we can already see the problem:
if pretrained_dims is not None and pretrained_dims >= 1:
glove = StaticVectors(VECTORS_KEY, width, column=cols.index(ID))
embed = uniqued(
(glove | norm | prefix | suffix | shape)
>> LN(Maxout(width, width*5, pieces=3)), column=5)
else:
embed = uniqued(
(norm | prefix | suffix | shape)
>> LN(Maxout(width, width*4, pieces=3)), column=5)
We're passing in a single key VECTORS_KEY to the StaticVectors class, which means the vectors from different models are stepping on each other. For models which have different pre-trained dimensions, this then causes the models to fail to load correctly.
One solution would be to avoid using the StaticVectors class altogether. That class has trouble because it expects to be passed in the array of IDs, which means it has to somehow load the vectors in the background. This is difficult.
We also have another class within the _ml module, used in the text classifier. The SpacyVectors class gets passed in a batch of Doc objects, rather than the arrays. This makes it very easy to fetch the vectors. The downside is it's harder to cache the whole vector computation per word type: if we first extract the array, we can have a convenient column to unique by. It's also much easier to keep the GPU efficient this way.
Edit: It's tough to give up on extracting the array and indexing into it, so I've been looking at passing the vectors into the Tok2Vec function. It's still difficult though. The keying issue also comes up in the link_vectors_to_models helper function, which can also be found within spacy._ml. In this function, we set the lex.rank attribute for the words in the vocab to the row in the vector table. We do this once, and it saves us as hash lookup on each token.
However, we need a unique ID for the vocab and vectors -- which currently we don't have. The vocab knows its language, but that's not unique enough. We also don't want to do id(vocab), as we can't persist that.
We also can't simply pass in the data to the StaticVectors class. If we did that, we'd have to save the vectors within each model (parser, tagger, entity recognizer, etc) --- because once we deserialized the models, we'd have the same problem.
Still working on this. In the meantime, this package should mitigate the issue: https://github.com/kootenpv/spacy_api
Same issue here, when loaded like this:
nlp_en = spacy.load('en_core_web_md')
nlp_es = spacy.load('es_core_news_md')
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
I think this comes down to an ill-considered use of a global variable when using pre-trained models in Thinc. The global variable is used to avoid storing extra copies of the vectors data. However, I think it's not keyed correctly by the spaCy model --- causing this error when there are multiple language models in memory.
I expect to get to this bug before the end of the week -- thanks for your patience; and thanks for reporting.