Spacy: Question: German model

Created on 14 Apr 2016  路  8Comments  路  Source: explosion/spaCy

HI,

So I noticed there are a lot of German language related parts in the code. But when I try to initialize

from spacy.de import German
nlp = German()

I get
model 'de>=1.0.0,<1.1.0' not installed. Please run 'python -m spacy.de.download' to install latest compatible model.

Because it is not present https://index.spacy.io/models. So I was wondering what the status of German support is?

Most helpful comment

Well spotted. We are going to support German soon.

All 8 comments

Well spotted. We are going to support German soon.

I was wondering how far are you in this. Which parts are missing? Can one contribute somehow?

Here's the list of features and their status as I understand it:

Linguistic models

  • [x] POS tags
  • [x] Dependency parse
  • [x] Sentence boundaries (via dependency parse)
  • [x] Noun chunks (via dependency parse)
  • [x] Word vectors
  • [ ] NER
  • [ ] Lemmas

Engineering integrations

  • [x] Load languages by ID string (e.g. nlp = spacy.load('de')
  • [x] Inspect language for data and models via token.lang, span.lang, doc.lang, nlp.lang etc
  • [x] Select correct noun chunk rules depending on language after parse
  • [ ] Select correct noun chunk rules depending on language after deserialise
  • [ ] Select correct lemmatisation depending on language after parse
  • [ ] Select correct lemmatisation depending on language after deserialise

We're happy to launch without all of the annotations, so the NER and lemmatisation isn't holding things up. But we've gone through a few iterations of how to get the engineering integrations done nicely. The crux of the issue is that we want to avoid a design where we commit to having different language subclasses of the different parts of the library, e.g. we don't want a GermanDoc, or a GermanParser, etc. I think this gets very messy very quickly. We've got a design that works now, so I'm just finishing up a small refactor to keep everything neat.

We'll try to get a development model hosted so you can play with something. But mostly it's just finishing this small refactor, and writing the documentation.

I can't wait to play with the german model! 馃拑

Sneak peek:

git clone https://github.com/spacy-io/spaCy
cd spaCy
pip install -r requirements.txt
pip install -e .
python -m spacy.de.download all
python

>>> import spacy
>>> nlp = spacy.load('de')
>>> doc =  nlp(u'Ich bin ein Berliner.')

Pretty much everything should be working except lemmatization. Word vectors are trained on a combination of the open subtitles corpus and Wikipedia, so they should have reasonablish domain independence. The NER and syntax are trained on newspaper text, so they might be a bit brittle to conversational text.

Full announcement to follow :)

You can now find the German-compatible code on PyPi and conda:

pip install --upgrade spacy
python -m spacy.de.download all
python

>>> import spacy
>>> nlp = spacy.load('de')
>>> doc =  nlp(u'Ich bin ein Berliner.')

@honnibal , you're the best man! Thanks! One beer on me next time i'm in Berlin.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings