Spacy: German lemmatization and noun gender

Created on 20 May 2016 · 11Comments · Source: explosion/spaCy

As far as I can tell the nascent German support doesn't yet do lemmatization or detect the gender of nouns.

For unknown nouns it's not always possible to guess what gender they are even if you do have an article or adjective — e.g. in ‘beim X’ X could be masculine or neuter. It should ideally have separate tags meaning 'could be masculine or neuter' (as before), 'could be feminine or plural' ('wegen der X'), 'could be masculine or plural' ('in den X', X could be accusative masculine or dative plural), and 'could be any gender or number' …

enhancement help wanted lang / de

Source

dpk

Most helpful comment

Hi, I just wanted to point you to one of my older projects called IWNLP which produces a list of form -> lemma (e.g., Schwimmbäder -> Schwimmbad) for German words based on Wiktionary. Check out https://github.com/Liebeck/IWNLP and www.iwnlp.com or query me for more information.
The produced mappings are in the form of (form, POS) -> lemma, are under a CC-lincense, and I'd love to see them implemented ;) I might be able to dump another format (or a generic format, if specified) if that's desired? The latest evaluation results are listed here: http://www.iwnlp.com/iwnlp_results.html

Liebeck on 11 Apr 2017

🎉1 👍1

All 11 comments

Hi, I just wanted to ask whether German lemmatization will be available in one of the future (alpha) releases, which would make spaCy even more awesome.

cschwem2er on 6 Nov 2016

Hello, we'll need the german lemmatization too. Is there a roadmap or any plans for this feature?

schwaen on 10 Nov 2016

Copying over my reply from #789:

Yes, agreed! We'd love to add at least a a lookup-table based lemmatizer for German. It shouldn't even be too much work, but unfortunately, we haven't gotten around to it yet. We're currently busy getting spaCy 2.0 ready, but if you want to play around with this (maybe porting it over from textblob could be an option?) and make a pull request, we'd definitely be happy to help and support!

Here's the info on textblob from @schlichtanders's comment:

After some research I found an open German lemmatizer as part of textblob-de

In the respective source code one of the decisive code snippet is this one
https://github.com/markuskiller/textblob-de/blob/dev/textblob_de/ext/_pattern/text/de/__init__.py#L186-L199

ines on 30 Jan 2017

Liebeck on 11 Apr 2017

🎉1 👍1

@honnibal @ines Is this something you'd like to see implemented?

Liebeck on 15 Apr 2017

This is really nice! We'd definitely like to see this implemented. Sorry I missed this comment before!

This should be quite easy actually. If you check out the morph_rules.py file, you can see that we already have this mechanism that takes does (form, POS) -> lemma mapping. This is how we do lemmatization for many English words.

honnibal on 16 Apr 2017

I'm merging all issues related to this and making #846 the "master issue", so I'm closing this one (a little frustrating that there's no good way to do this on GH, so just copying over the most relevant comments).

@Liebeck Thanks a lot for offering to take this on – much appreciated! 👍

ines on 16 Apr 2017

Any news from German?
I have a new library, DAFSA lookup based. His name is DEMorphy :) : https://github.com/DuyguA/DEMorphy

If anyone is interested in using an external library, or make a similar implementation for de you're welcome :+1:

DuyguA on 7 Mar 2018

❤1

@DuyguA I'm currently implementing IWNLP as a spaCy extension. I do have 1-2 bugs that I need to fix though. With the release of v2.0, spaCy now supports lemmatization for German. I not sure how useful this extension will be now / how often it will be used. I manually compared my extension with some of the lemmas from spaCy and, subjectively and based on a couple of test sentences, think that the lemmatization of nouns is comparable but that the lemmatization of verbs works better with IWNLP.

I will probably benchmark the lemmatization of spaCy in late April, after my thesis defense.

@DuyguA I just skimmed through your paper but I could not find any evaluation results in terms of accuracy or F1 to back up your claim of a state-of-the-art system. Did I miss it?

Liebeck on 7 Mar 2018

❤1

Great then!

Extended abstract is now available, a benchmarks and evaluation chapter + 1 more chapter will be released soon. We did the evaluation on Tiger corpus; however my colleagues has more ideas for evaluation. We complete that sections, and make the additional chapters.

DuyguA on 7 Mar 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.