Spacy: Question: German model

Created on 14 Apr 2016 · 8Comments · Source: explosion/spaCy

HI,

So I noticed there are a lot of German language related parts in the code. But when I try to initialize

from spacy.de import German
nlp = German()

I get
model 'de>=1.0.0,<1.1.0' not installed. Please run 'python -m spacy.de.download' to install latest compatible model.

Because it is not present https://index.spacy.io/models. So I was wondering what the status of German support is?

Source

Arttii

👍2

Most helpful comment

Well spotted. We are going to support German soon.

henningpeters on 14 Apr 2016

🎉4 😄2

All 8 comments

Well spotted. We are going to support German soon.

henningpeters on 14 Apr 2016

🎉4 😄2

I was wondering how far are you in this. Which parts are missing? Can one contribute somehow?

Arttii on 28 Apr 2016

Here's the list of features and their status as I understand it:

Linguistic models

[x] POS tags
[x] Dependency parse
[x] Sentence boundaries (via dependency parse)
[x] Noun chunks (via dependency parse)
[x] Word vectors
[ ] NER
[ ] Lemmas

Engineering integrations

[x] Load languages by ID string (e.g. nlp = spacy.load('de')
[x] Inspect language for data and models via token.lang, span.lang, doc.lang, nlp.lang etc
[x] Select correct noun chunk rules depending on language after parse
[ ] Select correct noun chunk rules depending on language after deserialise
[ ] Select correct lemmatisation depending on language after parse
[ ] Select correct lemmatisation depending on language after deserialise

We're happy to launch without all of the annotations, so the NER and lemmatisation isn't holding things up. But we've gone through a few iterations of how to get the engineering integrations done nicely. The crux of the issue is that we want to avoid a design where we commit to having different language subclasses of the different parts of the library, e.g. we don't want a GermanDoc, or a GermanParser, etc. I think this gets very messy very quickly. We've got a design that works now, so I'm just finishing up a small refactor to keep everything neat.

We'll try to get a development model hosted so you can play with something. But mostly it's just finishing this small refactor, and writing the documentation.

honnibal on 28 Apr 2016

I can't wait to play with the german model! 💃

KarimJedda on 29 Apr 2016

Sneak peek:

git clone https://github.com/spacy-io/spaCy
cd spaCy
pip install -r requirements.txt
pip install -e .
python -m spacy.de.download all
python

>>> import spacy
>>> nlp = spacy.load('de')
>>> doc =  nlp(u'Ich bin ein Berliner.')

Pretty much everything should be working except lemmatization. Word vectors are trained on a combination of the open subtitles corpus and Wikipedia, so they should have reasonablish domain independence. The NER and syntax are trained on newspaper text, so they might be a bit brittle to conversational text.

Full announcement to follow :)

honnibal on 3 May 2016

You can now find the German-compatible code on PyPi and conda:

pip install --upgrade spacy
python -m spacy.de.download all
python

>>> import spacy
>>> nlp = spacy.load('de')
>>> doc =  nlp(u'Ich bin ein Berliner.')

honnibal on 5 May 2016

👍3 🎉1

@honnibal , you're the best man! Thanks! One beer on me next time i'm in Berlin.

KarimJedda on 5 May 2016

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.