Is there any built-in stemmer in spaCy? If not, is this something you plan to add? An open-source Python implementation of Porter's algorithm might be a good start.
We have lemmatization, but no stemming. I think lemmatization is generally better, and Porter's algorithm is usually only useful to replicate another system exactly. We deprioritise replication in spaCy, to avoid duplicating functionality.
What are you trying to do? Is there a reason you think Porter stemming will be significantly better than lemmatization?
I agree that lemmatization is generally better than stemming. However, lemmatization depends on POS (so broken/ADJ gives broken but broken/VERB gives break) and thus only targets inflection, not derivation, as you clearly state in your docs (so meeting/NOUN gives meeting while meeting/VERB gives meet).
If we wanted to match words based on their origin/derivation (i.e. if they have the same root), stemming would be better despite its limitations. For example, to do spelling correction, clustering, find word families... even basic information retrieval.
Of course I could use an external stemmer but I think it would be a nice addition to a linguistic processor such as spaCy, specially since it would require little effort (Porter's algorithm is out there and for many languages). In the end, users can ultimately decide if it's useful for them or not, but at least it would be there it they needed it. NLTK and CoreNLP, for example, include stemmers.
My interpretation of past results is that stemming isn't particularly good for the tasks you describe. Actually for anything where you're looking to do particularly linguistic stuff, Porter's algorithm is really bad. It's unapologetically arbitrary. For derivation identification, I think Porter's algorithm will be much worse than lemmatization and a few additional rules.
I know that other libraries have stemming functions, but I think they probably shouldn't. Currently my answer to "when should I use stemming?" is "Never". I think you should want meeting/NOUN to be "stemmed" to meeting, not meet. It really is a different word, with very different distribution.
Stemming is far from ideal, as you rightly point out, but for this particular purpose it seems slightly better suited. Lemmatization will never tell us if two words derive from the same word, e.g. that meeting/NOUN and met/VERB both derive from meet/VERB. I guess I'm looking for a sort of "derivational lemmatization", so to speak.
I currently see two options to achieve this:
I understand that you have a clear roadmap for spaCy; I just wondered if anything like this was available or could make a useful addition.
Thank you.
Support for derivationally related words would be nice. I'll think about the best strategy for that.
In the meantime, you might be best off building this on top of the library, using the lemmas as keys (they're indeed derived from WordNet).
You'll hit some edge cases around multi word expressions and word sense ambiguity. But if you follow a first-sense heuristic, you should be able to get good results.
I second the OP's opinion that despite its obvious limitations stemming is very useful for pattern matching. While one may indeed argue that āmeetingā as a noun has a different meaning than āmeetingā as a verb, gerunds in general are a blend between nouns and verbs and might be thought of as an example of non-categorial phenomena in language. Quite often discerning between a noun and a verb reading of a gerund is hard even for humans, so expecting a statistical parser to do this correctly and even consistently is asking for trouble.
A similar case is the spectrum between adjectives and gerunds. A practical problem appeared when I was trying to match āpulsing headacheā in sentences like āthis pulsing headache is killing meā. āpulsingā is once recognised as a verb (āpulseā), once as an adjective (āpulsingā). So, there you go ā here is an example situation where you'd be better off using stemming.
My 2 cents: a stemmer may be better than the provided lemmatizer for highly specialized domains (for instance, in my case, biomedical domain).
For example, I was expecting from spaCy lemmatize endosomes to something like (without the plural) endosome or endosom, but it does not. In this particular case or similar ones, the nltk SnowballStemmer (english) does a better job:
>>> SnowballStemmer("english").stem("endosomes")
'endosom'
I guess, ideally indeed one should train the spacy methods with the target domain but that's sometimes overkill or training data may be lacking.
This is really confusing me.
>>> nlp.vocab.morphology.lemmatizer(u'endosomes', 'noun', morphology={'number': 'plur'})set([u'endosomes'])
>>> nlp.vocab.morphology.lemmatizer(u'chromosomes', 'noun', morphology={'number': 'plur'})
set([u'chromosome'])
Neither are in any exceptions data I can find. Trying to understand what's going on.
@honnibal any news regarding derivationally related forms?
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
I agree that lemmatization is generally better than stemming. However, lemmatization depends on POS (so
broken/ADJgivesbrokenbutbroken/VERBgivesbreak) and thus only targets inflection, not derivation, as you clearly state in your docs (someeting/NOUNgivesmeetingwhilemeeting/VERBgivesmeet).If we wanted to match words based on their origin/derivation (i.e. if they have the same root), stemming would be better despite its limitations. For example, to do spelling correction, clustering, find word families... even basic information retrieval.
Of course I could use an external stemmer but I think it would be a nice addition to a linguistic processor such as spaCy, specially since it would require little effort (Porter's algorithm is out there and for many languages). In the end, users can ultimately decide if it's useful for them or not, but at least it would be there it they needed it. NLTK and CoreNLP, for example, include stemmers.