Spacy: Textcat with setting ngram parameter

Created on 17 Apr 2019 · 9Comments · Source: explosion/spaCy

How to reproduce the behaviour

For a specific (confidential) dataset, I ran the textcat pipeline succesfully with the default setting ngram_size set to 1, then I set it specifically to 1 which works equally well (obviously), but then it crashes when setting this to 2 while keeping all other parameters the same:

textcat = nlp.create_pipe(
        "textcat",
        config={
            "exclusive_classes": False,
            "architecture": "ensemble",
            "ngram_size": 2,
        }
    )

I'm guessing that perhaps I have texts in there too short to get 2-grams from? The thinc error is a little cryptic...

Any idea what happens here?

Traceback (most recent call last):
  ...
  File "...\spacy_model.py", line 79, in train_model
    nlp.update(docs=texts, golds=annotations, sgd=optimizer, drop=setting.DROP_OUT, losses=losses)   
  File "C:\...\lib\site-packages\spacy\language.py", line 452, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "pipes.pyx", line 931, in spacy.pipeline.pipes.TextCategorizer.update
  File "C:\...\lib\site-packages\thinc\neural\_classes\feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "C:\...\lib\site-packages\thinc\api.py", line 132, in begin_update
    values = [fwd(X, *a, **k) for fwd in forward]
  File "C:\...\lib\site-packages\thinc\api.py", line 132, in <listcomp>
    values = [fwd(X, *a, **k) for fwd in forward]
  File "C:\...\lib\site-packages\thinc\api.py", line 225, in wrap
    output = func(*args, **kwargs)
  File "C:\...\lib\site-packages\thinc\neural\_classes\feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "C:\...\lib\site-packages\spacy\_ml.py", line 137, in begin_update
    ngrams.append(self.ops.ngrams(n, unigrams))
  File "ops.pyx", line 727, in thinc.neural.ops.NumpyOps.ngrams
  File "ops.pyx", line 398, in thinc.neural.ops.NumpyOps.allocate
ValueError: negative dimensions are not allowed

Your Environment

spaCy version: 2.1.3
Platform: Windows-10-10.0.17134-SP0
Python version: 3.6.8

bug feat / textcat

Source

svlandeg

All 9 comments

Yeah there's probably a one-word sentence in there that's messing things up. Could you print [len(doc) for doc in batch] for the failing batch? I'm guessing there'll be a 1 or a 0 in there.

honnibal on 19 Apr 2019

👍1

This issue has been automatically closed because there has been no response to a request for more information from the original author. With only the information that is currently in the issue, there's not enough information to take action. If you're the original author, feel free to reopen the issue if you have or find the answers needed to investigate further.

no-response[bot] on 3 May 2019

@honnibal : Sorry for the delay.
I created a minimal unit test that exhibits this behaviour: https://github.com/svlandeg/spaCy/commit/ac30d6311002aeb64a7793b2ac8578465033b6c2 (I can put it in a PR if you like).

This test crashes with the same error as quoted above:

ValueError: negative dimensions are not allowed
ops.pyx:398: ValueError

If you change ngram_size to 1, or if you edit the 3rd training text to contain more than 1 word, the error goes away and the training works.

svlandeg on 3 May 2019

👍1

Thanks! Definitely a bug.

honnibal on 11 May 2019

@svlandeg - Did you have a workaround for ngram_size>1 in the the meantime?

tomstelk on 11 Jun 2019

@tomstelk : yep, if you want to use k long n-grams, you'll have to make sure that each input text has at least k tokens. So for now you'll have to "manually" pre-filter

svlandeg on 11 Jun 2019

Thanks @svlandeg - was hoping there might be some quick hack in thinc somewhere to handle text with less than k tokens, oh well

tomstelk on 11 Jun 2019

@tomstelk : we found the issue - should be fixed in the next version

svlandeg on 11 Jul 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.