The existing German model seems to perform reasonably well as long as the correct letter case is present. In German, nouns are capitalised, and apparently the tagger and parsers have learnt to treat this a very important feature. In chat/web language everything is often written in lower case and the performance of the tagger and parser degrades dramatically when presented with all-lower-case-input.
It would be very useful to have a second model for German — one trained on the same input corpora but turned to lower case before training.
EDIT: perhaps the corresponding pipeline could have a letter case changer sewn into the pipeline, but this would be just the icing on the cake.
It would be very easy to at least do a case insensitive stop word check
Does this approach work? Has anyone tested it? What about mixing lower and upper cases? @ines Why do you need to experiment with that? When you train the tagger on lower cases. It should work same like the other way around, right?
The current German tagging and parsing data is sourced from newspaper text, which I'm sure also differs in other ways from conversational text.
It's not possible to train a model which is case-insensitive while also retaining as high accuracy on formal language --- after all, the model is paying attention to features which are very helpful on that domain. This is why the German models are named de_core_news_sm and de_core_news_md (in v2.1.0) --- they're optimised for news text.
In order to produce German models that work well on social media text, we'll need training and evaluation data from those domains. In the meantime, you could approximate the data by trying to take well cased text, parse it, and lower case it while retaining the annotations produced by the model. You could then use the annotations from the correctly cased text as gold-standard examples, to update the model. Effectively what you'd be doing here is trying to teach the model to parse the same way whether or not the text is cased correctly.
You'll want to have a good evaluation on your problem domain set up before you start these experiments, as it'll take some fiddling to get the right strategy for your problem.
Thanks. Why couldn't you just lower case the trainig data and learn the same algorithm the proper tags? I think this would be enough.
And where are the new german model (md)?
If a stop word list exists, it could just contain a lowercase and an uppercase version of the same word, if it can't be trained case-insensitive. As is, the stop word tagging does not work correctly. As soon as a stop word is at the beginning of a sentence it isn't recognised.
The stop word issue should be fixed in the new models for v2.1.0. The new version will also include a fix that makes stop words case-insensitive. You can already try the new models (and the other new features) via spacy-nightly – see here for the release details.
@bbrinx Have already tried to lowercase and training a new model? What are your results?
I tried the new spacy-nightly version and it works like a charm. The bigger German corpus is also great news! Thanks guys! After having built my own tokenizer I'm gonna switch back to Spacy now.
Merging this with the master thread in #3056!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Thanks. Why couldn't you just lower case the trainig data and learn the same algorithm the proper tags? I think this would be enough.
And where are the new german model (md)?