For a specific (confidential) dataset, I ran the textcat pipeline succesfully with the default setting ngram_size set to 1, then I set it specifically to 1 which works equally well (obviously), but then it crashes when setting this to 2 while keeping all other parameters the same:
textcat = nlp.create_pipe(
"textcat",
config={
"exclusive_classes": False,
"architecture": "ensemble",
"ngram_size": 2,
}
)
I'm guessing that perhaps I have texts in there too short to get 2-grams from? The thinc error is a little cryptic...
Any idea what happens here?
Traceback (most recent call last):
...
File "...\spacy_model.py", line 79, in train_model
nlp.update(docs=texts, golds=annotations, sgd=optimizer, drop=setting.DROP_OUT, losses=losses)
File "C:\...\lib\site-packages\spacy\language.py", line 452, in update
proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
File "pipes.pyx", line 931, in spacy.pipeline.pipes.TextCategorizer.update
File "C:\...\lib\site-packages\thinc\neural\_classes\feed_forward.py", line 46, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File "C:\...\lib\site-packages\thinc\api.py", line 132, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File "C:\...\lib\site-packages\thinc\api.py", line 132, in <listcomp>
values = [fwd(X, *a, **k) for fwd in forward]
File "C:\...\lib\site-packages\thinc\api.py", line 225, in wrap
output = func(*args, **kwargs)
File "C:\...\lib\site-packages\thinc\neural\_classes\feed_forward.py", line 46, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File "C:\...\lib\site-packages\spacy\_ml.py", line 137, in begin_update
ngrams.append(self.ops.ngrams(n, unigrams))
File "ops.pyx", line 727, in thinc.neural.ops.NumpyOps.ngrams
File "ops.pyx", line 398, in thinc.neural.ops.NumpyOps.allocate
ValueError: negative dimensions are not allowed
Yeah there's probably a one-word sentence in there that's messing things up. Could you print [len(doc) for doc in batch] for the failing batch? I'm guessing there'll be a 1 or a 0 in there.
This issue has been automatically closed because there has been no response to a request for more information from the original author. With only the information that is currently in the issue, there's not enough information to take action. If you're the original author, feel free to reopen the issue if you have or find the answers needed to investigate further.
@honnibal : Sorry for the delay.
I created a minimal unit test that exhibits this behaviour: https://github.com/svlandeg/spaCy/commit/ac30d6311002aeb64a7793b2ac8578465033b6c2 (I can put it in a PR if you like).
This test crashes with the same error as quoted above:
ValueError: negative dimensions are not allowed
ops.pyx:398: ValueError
If you change ngram_size to 1, or if you edit the 3rd training text to contain more than 1 word, the error goes away and the training works.
Thanks! Definitely a bug.
@svlandeg - Did you have a workaround for ngram_size>1 in the the meantime?
@tomstelk : yep, if you want to use k long n-grams, you'll have to make sure that each input text has at least k tokens. So for now you'll have to "manually" pre-filter
Thanks @svlandeg - was hoping there might be some quick hack in thinc somewhere to handle text with less than k tokens, oh well
@tomstelk : we found the issue - should be fixed in the next version
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.