Spacy: stop words missing for en_core_web_md

Created on 25 Mar 2017  路  10Comments  路  Source: explosion/spaCy

New to spaCy I want to configure stopwords.
The regular spacy.en.STOP_WORDS do not seem to apply when loading the bigger file of en_core_web_md How can I configure the big file to use the regular stop words?

models

Most helpful comment

@ines , I'm using en_core_web_md v 2.0.0 and this continues to be an issue. Works just fine with the small model.

All 10 comments

This sounds like a bug in the model, thanks.

The general-purpose answer is that flags like IS_STOP are computer per-type, so they're cached in the lexicon. You can add your own lexical flags or change how they're computed with the nlp.vocab.add_flag() method. You give this the flag ID and a function to compute the values, like this:

from spacy.attrs import IS_STOP
nlp.vocab.add_flag(IS_STOP, lambda string: string in my_stop_words)

This should be a good workaround for you until the model is updated.

Btw could you run:

python -m spacy info --markdown
python -m spacy info en_core_web_md --markdown

And paste the results here?

Thanks,
Matt

Info about spaCy

  • spaCy version: 1.7.2
  • Platform: Darwin-16.4.0-x86_64-i386-64bit
  • Python version: 3.6.0
  • Installed models: en, en_core_web_md

and

Info about model en_core_web_md

  • lang: en
  • name: core_web_md
  • license: CC BY-SA 3.0
  • author: Explosion AI
  • url: https://explosion.ai
  • version: 1.2.1
  • spacy_version: >=1.7.0,<2.0.0
  • email: [email protected]
  • description: General-purpose English model, with tagging, parsing, entities and word vectors
  • source: /Users/geoheil/anaconda3/lib/python3.6/site-packages/en_core_web_md/en_core_web_md-1.2.1

Same here. Correct workaround is:

nlp.vocab.add_flag(lambda s: s in spacy.en.word_sets.STOP_WORDS, spacy.attrs.IS_STOP)

(function first, ID later).

To include lower/upper/title -cased words (him/HIM/Him) I had to use:

nlp.vocab.add_flag(lambda s: s.lower() in spacy.en.word_sets.STOP_WORDS, spacy.attrs.IS_STOP)

The new en_core_web_md model for v2.0 is now available and the problem should be fixed in the new version: https://spacy.io/models/en#en_core_web_md 馃帀

@ines , I'm using en_core_web_md v 2.0.0 and this continues to be an issue. Works just fine with the small model.

Also having this problem with en_core_web_md v 2.0.0. I had to use the following as a workaround:

nlp.vocab.add_flag(lambda s: s.lower() in spacy.lang.en.stop_words.STOP_WORDS, spacy.attrs.IS_STOP)

Same problem but with en_core_web_lg v 2.0.0. @georgek's Suggested workaround did the trick.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings