Spacy: Textcat with setting ngram parameter

Created on 17 Apr 2019  路  9Comments  路  Source: explosion/spaCy

How to reproduce the behaviour

For a specific (confidential) dataset, I ran the textcat pipeline succesfully with the default setting ngram_size set to 1, then I set it specifically to 1 which works equally well (obviously), but then it crashes when setting this to 2 while keeping all other parameters the same:

textcat = nlp.create_pipe(
        "textcat",
        config={
            "exclusive_classes": False,
            "architecture": "ensemble",
            "ngram_size": 2,
        }
    )

I'm guessing that perhaps I have texts in there too short to get 2-grams from? The thinc error is a little cryptic...

Any idea what happens here?

Traceback (most recent call last):
  ...
  File "...\spacy_model.py", line 79, in train_model
    nlp.update(docs=texts, golds=annotations, sgd=optimizer, drop=setting.DROP_OUT, losses=losses)   
  File "C:\...\lib\site-packages\spacy\language.py", line 452, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "pipes.pyx", line 931, in spacy.pipeline.pipes.TextCategorizer.update
  File "C:\...\lib\site-packages\thinc\neural\_classes\feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "C:\...\lib\site-packages\thinc\api.py", line 132, in begin_update
    values = [fwd(X, *a, **k) for fwd in forward]
  File "C:\...\lib\site-packages\thinc\api.py", line 132, in <listcomp>
    values = [fwd(X, *a, **k) for fwd in forward]
  File "C:\...\lib\site-packages\thinc\api.py", line 225, in wrap
    output = func(*args, **kwargs)
  File "C:\...\lib\site-packages\thinc\neural\_classes\feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "C:\...\lib\site-packages\spacy\_ml.py", line 137, in begin_update
    ngrams.append(self.ops.ngrams(n, unigrams))
  File "ops.pyx", line 727, in thinc.neural.ops.NumpyOps.ngrams
  File "ops.pyx", line 398, in thinc.neural.ops.NumpyOps.allocate
ValueError: negative dimensions are not allowed

Your Environment

  • spaCy version: 2.1.3
  • Platform: Windows-10-10.0.17134-SP0
  • Python version: 3.6.8
bug feat / textcat

All 9 comments

Yeah there's probably a one-word sentence in there that's messing things up. Could you print [len(doc) for doc in batch] for the failing batch? I'm guessing there'll be a 1 or a 0 in there.

This issue has been automatically closed because there has been no response to a request for more information from the original author. With only the information that is currently in the issue, there's not enough information to take action. If you're the original author, feel free to reopen the issue if you have or find the answers needed to investigate further.

@honnibal : Sorry for the delay.
I created a minimal unit test that exhibits this behaviour: https://github.com/svlandeg/spaCy/commit/ac30d6311002aeb64a7793b2ac8578465033b6c2 (I can put it in a PR if you like).

This test crashes with the same error as quoted above:

ValueError: negative dimensions are not allowed
ops.pyx:398: ValueError

If you change ngram_size to 1, or if you edit the 3rd training text to contain more than 1 word, the error goes away and the training works.

Thanks! Definitely a bug.

@svlandeg - Did you have a workaround for ngram_size>1 in the the meantime?

@tomstelk : yep, if you want to use k long n-grams, you'll have to make sure that each input text has at least k tokens. So for now you'll have to "manually" pre-filter

Thanks @svlandeg - was hoping there might be some quick hack in thinc somewhere to handle text with less than k tokens, oh well

@tomstelk : we found the issue - should be fixed in the next version

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ajayrfhp picture ajayrfhp  路  3Comments

TropComplique picture TropComplique  路  3Comments

ahalterman picture ahalterman  路  3Comments

prashant334 picture prashant334  路  3Comments

bebelbop picture bebelbop  路  3Comments