Spacy: stop words missing for en_core_web_md and en_core_web_lg for spaCy v2.0

Created on 14 Nov 2017  Â·  13Comments  Â·  Source: explosion/spaCy

_en_core_web_md_ and _en_core_web_lg_ models are giving 'False' for all words in the sentence using "is_stop" attribute.

PS : _en_core_web_sm_ is working fine.

System Information :

  • Python version: 2.7.12
  • Platform: Linux-4.10.0-38-generic-x86_64-with-Ubuntu-16.04-xenial
  • spaCy version: 2.0.2
  • Models: en
bug lang / en models

Most helpful comment

I know this has been an issue a long time --- the delay comes back to some infrastructure problems, which have made getting all the models retrained a hassle.

The following fix to the spacy train CLI command should make sure the issue doesn't reoccur: https://github.com/explosion/spaCy/commit/262d0a3148c9840b6df58ee955181c1cd486f8b1

The models for v2.1.0 are currently training, so fingers crossed updated models should be deployed soon.

All 13 comments

Thanks for the report and sorry about that – this should be fixed in the next update to the models.

In the meantime, here's a workaround:

nlp = spacy.load('en_core_web_lg')

for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True

With en_core_web_sm (spaCy 2.0.4), is_stop depends on casing:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> [nlp(s)[0].is_stop for s in 'this This THIS tHIS the The THE tHE'.split()]
[True, False, False, False, True, False, False, False]
# Expected [True, True, False/True, False/True, True, True, False/True, False/True].

Info about spaCy

  • spaCy version: 2.0.4
  • Platform: Linux-3.16.0-4-amd64-x86_64-with-debian-8.8
  • Python version: 3.6.0
  • Models: en_core_web_lg, en_vectors_web_lg, en_core_web_sm

en_core_web_sm for spaCy 2.0.0a10 correctly returned t.is_stop == True for both this and This.

Bump. I'm facing the same using en_core_web_sm.
Is this the expected output for the is_stop ? Or should we be using a different approach?

I know this has been an issue a long time --- the delay comes back to some infrastructure problems, which have made getting all the models retrained a hassle.

The following fix to the spacy train CLI command should make sure the issue doesn't reoccur: https://github.com/explosion/spaCy/commit/262d0a3148c9840b6df58ee955181c1cd486f8b1

The models for v2.1.0 are currently training, so fingers crossed updated models should be deployed soon.

Hi

Is this still the way to go:

nlp = spacy.load('en_core_web_lg')

for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True

This still leaves the is_stop property sensitive to case (ie "What" vs "what") - sounds like the fix needs to be applied upstream - as it is this is simple enough to handle outside of the token attribute system.

Any updates or ETA on this?

Slightly better stopwords workaround (but still not a good solution):

for word in nlp.Defaults.stop_words:
    for w in (word, word[0].upper() + word[1:], word.upper()):
        lex = nlp.vocab[w]
        lex.is_stop = True

This sets is_stop on the lowercase, uppercase, and first-char uppercase form for each.

Any progress?

I'm using the above snippet atm.

Why does a stop word need to be in the vocabulary? (general question)

We're currently training new models for the upcoming nightly release of the develop branch (spaCy v2.1.0). You can lurk the spacy-models repo for updates and progress, but it's all currently pre-alpha. Sorry this was taking so long – it really did come down to getting the infrastructure right to be able to train our current model family reliably (and be able to add more languages in the future).

@adrianog The is_stop attribute is an attribute on the lexeme, i.e. the context-independent entry in the vocabulary.

@ines I see. Where vocabulary here is to be intended as spacy vocaculary i.e. lexeme.is_oov() could still return "False"?

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings