Spacy: German model for lowercase texts

Created on 2 Feb 2018 · 10Comments · Source: explosion/spaCy

The existing German model seems to perform reasonably well as long as the correct letter case is present. In German, nouns are capitalised, and apparently the tagger and parsers have learnt to treat this a very important feature. In chat/web language everything is often written in lower case and the performance of the tagger and parser degrades dramatically when presented with all-lower-case-input.
It would be very useful to have a second model for German — one trained on the same input corpora but turned to lower case before training.
EDIT: perhaps the corresponding pipeline could have a letter case changer sewn into the pipeline, but this would be just the icing on the cake.

enhancement lang / de models

Source

adam-ra

👍2

Most helpful comment

Thanks. Why couldn't you just lower case the trainig data and learn the same algorithm the proper tags? I think this would be enough.

And where are the new german model (md)?

ctrado18 on 18 Jul 2018

👍3

All 10 comments

It would be very easy to at least do a case insensitive stop word check

bbrinx on 3 Jun 2018

Does this approach work? Has anyone tested it? What about mixing lower and upper cases? @ines Why do you need to experiment with that? When you train the tagger on lower cases. It should work same like the other way around, right?

ctrado18 on 17 Jul 2018

The current German tagging and parsing data is sourced from newspaper text, which I'm sure also differs in other ways from conversational text.

It's not possible to train a model which is case-insensitive while also retaining as high accuracy on formal language --- after all, the model is paying attention to features which are very helpful on that domain. This is why the German models are named de_core_news_sm and de_core_news_md (in v2.1.0) --- they're optimised for news text.

In order to produce German models that work well on social media text, we'll need training and evaluation data from those domains. In the meantime, you could approximate the data by trying to take well cased text, parse it, and lower case it while retaining the annotations produced by the model. You could then use the annotations from the correctly cased text as gold-standard examples, to update the model. Effectively what you'd be doing here is trying to teach the model to parse the same way whether or not the text is cased correctly.

You'll want to have a good evaluation on your problem domain set up before you start these experiments, as it'll take some fiddling to get the right strategy for your problem.

honnibal on 18 Jul 2018

Thanks. Why couldn't you just lower case the trainig data and learn the same algorithm the proper tags? I think this would be enough.

And where are the new german model (md)?

ctrado18 on 18 Jul 2018

👍3

If a stop word list exists, it could just contain a lowercase and an uppercase version of the same word, if it can't be trained case-insensitive. As is, the stop word tagging does not work correctly. As soon as a stop word is at the beginning of a sentence it isn't recognised.

bbrinx on 4 Aug 2018

The stop word issue should be fixed in the new models for v2.1.0. The new version will also include a fix that makes stop words case-insensitive. You can already try the new models (and the other new features) via spacy-nightly – see here for the release details.

ines on 4 Aug 2018

👍1

@bbrinx Have already tried to lowercase and training a new model? What are your results?

ctrado18 on 8 Aug 2018

I tried the new spacy-nightly version and it works like a charm. The bigger German corpus is also great news! Thanks guys! After having built my own tokenizer I'm gonna switch back to Spacy now.

bbrinx on 28 Aug 2018

👍1

Merging this with the master thread in #3056!

ines on 16 Dec 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.