Spacy: stop words missing for en_core_web_md

Created on 25 Mar 2017 · 10Comments · Source: explosion/spaCy

New to spaCy I want to configure stopwords.
The regular spacy.en.STOP_WORDS do not seem to apply when loading the bigger file of en_core_web_md How can I configure the big file to use the regular stop words?

models

Source

geoHeil

👍2

Most helpful comment

@ines , I'm using en_core_web_md v 2.0.0 and this continues to be an issue. Works just fine with the small model.

jmidyet on 24 Dec 2017

👍2

All 10 comments

This sounds like a bug in the model, thanks.

The general-purpose answer is that flags like IS_STOP are computer per-type, so they're cached in the lexicon. You can add your own lexical flags or change how they're computed with the nlp.vocab.add_flag() method. You give this the flag ID and a function to compute the values, like this:

from spacy.attrs import IS_STOP
nlp.vocab.add_flag(IS_STOP, lambda string: string in my_stop_words)

This should be a good workaround for you until the model is updated.

honnibal on 25 Mar 2017

Btw could you run:

python -m spacy info --markdown
python -m spacy info en_core_web_md --markdown

And paste the results here?

Thanks,
Matt

honnibal on 25 Mar 2017

Info about spaCy

spaCy version: 1.7.2
Platform: Darwin-16.4.0-x86_64-i386-64bit
Python version: 3.6.0
Installed models: en, en_core_web_md

and

Info about model en_core_web_md

lang: en
name: core_web_md
license: CC BY-SA 3.0
author: Explosion AI
url: https://explosion.ai
version: 1.2.1
spacy_version: >=1.7.0,<2.0.0
email: [email protected]
description: General-purpose English model, with tagging, parsing, entities and word vectors
source: /Users/geoheil/anaconda3/lib/python3.6/site-packages/en_core_web_md/en_core_web_md-1.2.1

geoHeil on 26 Mar 2017

Same here. Correct workaround is:

nlp.vocab.add_flag(lambda s: s in spacy.en.word_sets.STOP_WORDS, spacy.attrs.IS_STOP)

(function first, ID later).

sadovnychyi on 16 Apr 2017

To include lower/upper/title -cased words (him/HIM/Him) I had to use:

nlp.vocab.add_flag(lambda s: s.lower() in spacy.en.word_sets.STOP_WORDS, spacy.attrs.IS_STOP)

pavlin99th on 23 Jun 2017

The new en_core_web_md model for v2.0 is now available and the problem should be fixed in the new version: https://spacy.io/models/en#en_core_web_md 🎉