Flair: Add support for UD German-HDT

Created on 17 Apr 2019 · 9Comments · Source: flairNLP/flair

Hi,

the UD German-HDT dataset (/cc @akoehn) has just been released today :heart:

The Hamburg Dependency Treebank consists of 261,821 sentences (4.8M tokens). The sentences were all sourced from the German news site heise.de, from articles published between 1996 and 2001. The content of the articles ranges from formulaic periodic updates on new BIOS revisions and processor models or quarterly earnings of tech companies over features about general trends in the hardware and software market to general coverage of social, legal and political issues in cyberspace, sometimes in the form of extensive weekly editorial comments. The creation of the treebank through manual annotation was largely interleaved with the creation of a standard for morphologically and syntactically annotating sentences as well as a constraint-based parser.

More information:

GitHub repository (make sure that you're using the dev branch)
HDT Paper

I am thrilled to train a PoS tagging model on that 🤗 so adding native support for that dataset in flair would be the first step.

Source

stefan-it

Most helpful comment

See the HDT paper for a description of the annotation process. The morphological annotations are semi-manual; the annotator had to select the correct lexical entry but if a feature has no discriminative power regarding syntax, it can be underspecified (e.g. a determiner might have not_fem as feature instead of masc or neut).

akoehn on 17 Apr 2019

👍2

All 9 comments

Wow this looks great - are the morphological annotations all manual? i.e. genus, numerus etc.?

alanakbik on 17 Apr 2019

akoehn on 17 Apr 2019

👍2

Thanks for clarifying - looks really interesting and we should really add support for this dataset to Flair. A while back I trained Flair with the German UD to predict case (Dativ, Akkusativ etc.) - would be interesting to train such a model with this dataset and see how well it works.

alanakbik on 17 Apr 2019

I trained a PoS tagging model (for Universal Dependencies Tags). Training took ~ 4 days. So I highly recommend to decrease patience from 3 to 2. embeddings_in_memory should be set to False, unless you have > 128 GB of RAM 😅

Accuracy was then 98.5% (using both training datasets).

@alanakbik I could upload the trained model (size: 4.5GB), so we can integrate it into flair if you want :)

stefan-it on 23 Apr 2019

@stefan-it yes that would be great! What embeddings are you using, i.e. why is the model so large?

alanakbik on 23 Apr 2019

I used the following embedding types:

embedding_types: List[TokenEmbeddings] = [
    WordEmbeddings('de-crawl'),
    PooledFlairEmbeddings('german-forward'),
    PooledFlairEmbeddings('german-backward'),
]

Hidden size was 256, batch size was 32.

stefan-it on 23 Apr 2019

I'll do more experiments with different pooling operations: in another PoS tagging experiment, fade achieves worse results than the normal flair embeddings.

stefan-it on 23 Apr 2019

Ah interesting - generally, I'd expect the impact of pooling to be less pronounced on PoS tagging than NER. It is probably ok to use normal FlairEmbeddings for PoS. That also has the advantage that models are smaller and training/inference.

alanakbik on 23 Apr 2019

👍1

Closed by #696: Support is in master branch and will be part of 0.4.2. Thanks @stefan-it!

alanakbik on 16 May 2019

Was this page helpful?

0 / 5 - 0 ratings