Spacy: stop words missing for en_core_web_md and en_core_web_lg for spaCy v2.0

Created on 14 Nov 2017 · 13Comments · Source: explosion/spaCy

_en_core_web_md_ and _en_core_web_lg_ models are giving 'False' for all words in the sentence using "is_stop" attribute.

PS : _en_core_web_sm_ is working fine.

System Information :

Python version: 2.7.12
Platform: Linux-4.10.0-38-generic-x86_64-with-Ubuntu-16.04-xenial
spaCy version: 2.0.2
Models: en

bug lang / en models

Source

Gauravtolani

Most helpful comment

I know this has been an issue a long time --- the delay comes back to some infrastructure problems, which have made getting all the models retrained a hassle.

The following fix to the spacy train CLI command should make sure the issue doesn't reoccur: https://github.com/explosion/spaCy/commit/262d0a3148c9840b6df58ee955181c1cd486f8b1

The models for v2.1.0 are currently training, so fingers crossed updated models should be deployed soon.

honnibal on 17 Feb 2018

👍3

All 13 comments

Thanks for the report and sorry about that – this should be fixed in the next update to the models.

In the meantime, here's a workaround:

nlp = spacy.load('en_core_web_lg')

for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True

ines on 21 Nov 2017

👍1

With en_core_web_sm (spaCy 2.0.4), is_stop depends on casing:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> [nlp(s)[0].is_stop for s in 'this This THIS tHIS the The THE tHE'.split()]
[True, False, False, False, True, False, False, False]
# Expected [True, True, False/True, False/True, True, True, False/True, False/True].

Info about spaCy

spaCy version: 2.0.4
Platform: Linux-3.16.0-4-amd64-x86_64-with-debian-8.8
Python version: 3.6.0
Models: en_core_web_lg, en_vectors_web_lg, en_core_web_sm

en_core_web_sm for spaCy 2.0.0a10 correctly returned t.is_stop == True for both this and This.

kbulygin on 6 Dec 2017

Bump. I'm facing the same using en_core_web_sm.
Is this the expected output for the is_stop ? Or should we be using a different approach?

brickpattern on 13 Feb 2018

I know this has been an issue a long time --- the delay comes back to some infrastructure problems, which have made getting all the models retrained a hassle.

The following fix to the spacy train CLI command should make sure the issue doesn't reoccur: https://github.com/explosion/spaCy/commit/262d0a3148c9840b6df58ee955181c1cd486f8b1

The models for v2.1.0 are currently training, so fingers crossed updated models should be deployed soon.

honnibal on 17 Feb 2018

👍3

Is this still the way to go:

nlp = spacy.load('en_core_web_lg')

for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True

adrianog on 8 May 2018

This still leaves the is_stop property sensitive to case (ie "What" vs "what") - sounds like the fix needs to be applied upstream - as it is this is simple enough to handle outside of the token attribute system.

randomsven on 18 May 2018

Any updates or ETA on this?

rajhans on 23 May 2018

👍2

Slightly better stopwords workaround (but still not a good solution):

for word in nlp.Defaults.stop_words:
    for w in (word, word[0].upper() + word[1:], word.upper()):
        lex = nlp.vocab[w]
        lex.is_stop = True

This sets is_stop on the lowercase, uppercase, and first-char uppercase form for each.

nathanwdavis on 24 May 2018

Any progress?

I'm using the above snippet atm.

KingNoosh on 24 May 2018

Why does a stop word need to be in the vocabulary? (general question)

adrianog on 28 May 2018

We're currently training new models for the upcoming nightly release of the develop branch (spaCy v2.1.0). You can lurk the spacy-models repo for updates and progress, but it's all currently pre-alpha. Sorry this was taking so long – it really did come down to getting the infrastructure right to be able to train our current model family reliably (and be able to add more languages in the future).

@adrianog The is_stop attribute is an attribute on the lexeme, i.e. the context-independent entry in the vocabulary.

ines on 28 May 2018

@ines I see. Where vocabulary here is to be intended as spacy vocaculary i.e. lexeme.is_oov() could still return "False"?

adrianog on 29 May 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.