New to spaCy I want to configure stopwords.
The regular spacy.en.STOP_WORDS do not seem to apply when loading the bigger file of en_core_web_md How can I configure the big file to use the regular stop words?
This sounds like a bug in the model, thanks.
The general-purpose answer is that flags like IS_STOP are computer per-type, so they're cached in the lexicon. You can add your own lexical flags or change how they're computed with the nlp.vocab.add_flag() method. You give this the flag ID and a function to compute the values, like this:
from spacy.attrs import IS_STOP
nlp.vocab.add_flag(IS_STOP, lambda string: string in my_stop_words)
This should be a good workaround for you until the model is updated.
Btw could you run:
python -m spacy info --markdown
python -m spacy info en_core_web_md --markdown
And paste the results here?
Thanks,
Matt
and
Same here. Correct workaround is:
nlp.vocab.add_flag(lambda s: s in spacy.en.word_sets.STOP_WORDS, spacy.attrs.IS_STOP)
(function first, ID later).
To include lower/upper/title -cased words (him/HIM/Him) I had to use:
nlp.vocab.add_flag(lambda s: s.lower() in spacy.en.word_sets.STOP_WORDS, spacy.attrs.IS_STOP)
The new en_core_web_md model for v2.0 is now available and the problem should be fixed in the new version: https://spacy.io/models/en#en_core_web_md 馃帀
@ines , I'm using en_core_web_md v 2.0.0 and this continues to be an issue. Works just fine with the small model.
Also having this problem with en_core_web_md v 2.0.0. I had to use the following as a workaround:
nlp.vocab.add_flag(lambda s: s.lower() in spacy.lang.en.stop_words.STOP_WORDS, spacy.attrs.IS_STOP)
Same problem but with en_core_web_lg v 2.0.0. @georgek's Suggested workaround did the trick.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
@ines , I'm using
en_core_web_mdv 2.0.0 and this continues to be an issue. Works just fine with the small model.