Spacy: Built-in stemmer?

Created on 9 Apr 2016 · 10Comments · Source: explosion/spaCy

Is there any built-in stemmer in spaCy? If not, is this something you plan to add? An open-source Python implementation of Porter's algorithm might be a good start.

Source

mfelice

👍4

Most helpful comment

I agree that lemmatization is generally better than stemming. However, lemmatization depends on POS (so broken/ADJ gives broken but broken/VERB gives break) and thus only targets inflection, not derivation, as you clearly state in your docs (so meeting/NOUN gives meeting while meeting/VERB gives meet).

If we wanted to match words based on their origin/derivation (i.e. if they have the same root), stemming would be better despite its limitations. For example, to do spelling correction, clustering, find word families... even basic information retrieval.

Of course I could use an external stemmer but I think it would be a nice addition to a linguistic processor such as spaCy, specially since it would require little effort (Porter's algorithm is out there and for many languages). In the end, users can ultimately decide if it's useful for them or not, but at least it would be there it they needed it. NLTK and CoreNLP, for example, include stemmers.

mfelice on 11 Apr 2016

👍6

All 10 comments

We have lemmatization, but no stemming. I think lemmatization is generally better, and Porter's algorithm is usually only useful to replicate another system exactly. We deprioritise replication in spaCy, to avoid duplicating functionality.

What are you trying to do? Is there a reason you think Porter stemming will be significantly better than lemmatization?

honnibal on 11 Apr 2016

mfelice on 11 Apr 2016

👍6

My interpretation of past results is that stemming isn't particularly good for the tasks you describe. Actually for anything where you're looking to do particularly linguistic stuff, Porter's algorithm is really bad. It's unapologetically arbitrary. For derivation identification, I think Porter's algorithm will be much worse than lemmatization and a few additional rules.

I know that other libraries have stemming functions, but I think they probably shouldn't. Currently my answer to "when should I use stemming?" is "Never". I think you should want meeting/NOUN to be "stemmed" to meeting, not meet. It really is a different word, with very different distribution.

honnibal on 12 Apr 2016

👍2

Stemming is far from ideal, as you rightly point out, but for this particular purpose it seems slightly better suited. Lemmatization will never tell us if two words derive from the same word, e.g. that meeting/NOUN and met/VERB both derive from meet/VERB. I guess I'm looking for a sort of "derivational lemmatization", so to speak.

I currently see two options to achieve this:

Do morphological analysis, as in polyglot. This would split the word into morphemes, which coupled with lemmatization can solve the problem. (probably overkill)
Access the "derivationally related form" from WordNet. This is an ideal solution and probably easier to implement if spaCy already gets the lemmas from WordNet (it's only one step away). There is a very simple example here.

I understand that you have a clear roadmap for spaCy; I just wondered if anything like this was available or could make a useful addition.

Thank you.

mfelice on 12 Apr 2016

👍1

Support for derivationally related words would be nice. I'll think about the best strategy for that.

In the meantime, you might be best off building this on top of the library, using the lemmas as keys (they're indeed derived from WordNet).

You'll hit some edge cases around multi word expressions and word sense ambiguity. But if you follow a first-sense heuristic, you should be able to get good results.

honnibal on 12 Apr 2016

👍1

I second the OP's opinion that despite its obvious limitations stemming is very useful for pattern matching. While one may indeed argue that “meeting” as a noun has a different meaning than “meeting” as a verb, gerunds in general are a blend between nouns and verbs and might be thought of as an example of non-categorial phenomena in language. Quite often discerning between a noun and a verb reading of a gerund is hard even for humans, so expecting a statistical parser to do this correctly and even consistently is asking for trouble.
A similar case is the spectrum between adjectives and gerunds. A practical problem appeared when I was trying to match “pulsing headache” in sentences like “this pulsing headache is killing me”. ‘pulsing’ is once recognised as a verb (“pulse”), once as an adjective (“pulsing”). So, there you go — here is an example situation where you'd be better off using stemming.

adam-ra on 29 Jun 2016

👍2

My 2 cents: a stemmer may be better than the provided lemmatizer for highly specialized domains (for instance, in my case, biomedical domain).

For example, I was expecting from spaCy lemmatize endosomes to something like (without the plural) endosome or endosom, but it does not. In this particular case or similar ones, the nltk SnowballStemmer (english) does a better job:

>>> SnowballStemmer("english").stem("endosomes")
'endosom'

I guess, ideally indeed one should train the spacy methods with the target domain but that's sometimes overkill or training data may be lacking.

juanmirocks on 28 Jan 2017

This is really confusing me.

>>> nlp.vocab.morphology.lemmatizer(u'endosomes', 'noun', morphology={'number': 'plur'})set([u'endosomes'])
>>> nlp.vocab.morphology.lemmatizer(u'chromosomes', 'noun', morphology={'number': 'plur'})
set([u'chromosome'])

Neither are in any exceptions data I can find. Trying to understand what's going on.

honnibal on 28 Jan 2017

😕1

@honnibal any news regarding derivationally related forms?