Hi,
the UD German-HDT dataset (/cc @akoehn) has just been released today :heart:
The Hamburg Dependency Treebank consists of 261,821 sentences (4.8M tokens). The sentences were all sourced from the German news site heise.de, from articles published between 1996 and 2001. The content of the articles ranges from formulaic periodic updates on new BIOS revisions and processor models or quarterly earnings of tech companies over features about general trends in the hardware and software market to general coverage of social, legal and political issues in cyberspace, sometimes in the form of extensive weekly editorial comments. The creation of the treebank through manual annotation was largely interleaved with the creation of a standard for morphologically and syntactically annotating sentences as well as a constraint-based parser.
More information:
dev branch)I am thrilled to train a PoS tagging model on that 馃 so adding native support for that dataset in flair would be the first step.
Wow this looks great - are the morphological annotations all manual? i.e. genus, numerus etc.?
See the HDT paper for a description of the annotation process. The morphological annotations are semi-manual; the annotator had to select the correct lexical entry but if a feature has no discriminative power regarding syntax, it can be underspecified (e.g. a determiner might have not_fem as feature instead of masc or neut).
Thanks for clarifying - looks really interesting and we should really add support for this dataset to Flair. A while back I trained Flair with the German UD to predict case (Dativ, Akkusativ etc.) - would be interesting to train such a model with this dataset and see how well it works.
I trained a PoS tagging model (for Universal Dependencies Tags). Training took ~ 4 days. So I highly recommend to decrease patience from 3 to 2. embeddings_in_memory should be set to False, unless you have > 128 GB of RAM 馃槄
Accuracy was then 98.5% (using both training datasets).
@alanakbik I could upload the trained model (size: 4.5GB), so we can integrate it into flair if you want :)
@stefan-it yes that would be great! What embeddings are you using, i.e. why is the model so large?
I used the following embedding types:
embedding_types: List[TokenEmbeddings] = [
WordEmbeddings('de-crawl'),
PooledFlairEmbeddings('german-forward'),
PooledFlairEmbeddings('german-backward'),
]
Hidden size was 256, batch size was 32.
I'll do more experiments with different pooling operations: in another PoS tagging experiment, fade achieves worse results than the normal flair embeddings.
Ah interesting - generally, I'd expect the impact of pooling to be less pronounced on PoS tagging than NER. It is probably ok to use normal FlairEmbeddings for PoS. That also has the advantage that models are smaller and training/inference.
Closed by #696: Support is in master branch and will be part of 0.4.2. Thanks @stefan-it!
Most helpful comment
See the HDT paper for a description of the annotation process. The morphological annotations are semi-manual; the annotator had to select the correct lexical entry but if a feature has no discriminative power regarding syntax, it can be underspecified (e.g. a determiner might have not_fem as feature instead of masc or neut).