Flair: Classic Word Embedding baseline in POS far below that reported in the paper

Created on 27 Feb 2019 · 4Comments · Source: flairNLP/flair

Hi,

First thank you for the great work on this library! :)

I'm trying to replicate the "Classic Word Embedding + BiLSTM-CRF" result from http://aclweb.org/anthology/C18-1139 on the PTB POS dataset of 96.94 ± 0.02 accuracy.

I followed the instructions at https://github.com/zalandoresearch/flair/blob/master/resources/docs/EXPERIMENTS.md#penn-treebank-part-of-speech-tagging-english

My code and corpus statistics are available at https://gist.github.com/alexandres/a54506e31d038cce75f31d09c60c9df8

My corpus statistics exactly match those from https://nlp.stanford.edu/pubs/CICLing2011-manning-tagging.pdf , namely:

Unfortunately my POS accuracy is around 94% with the "Classic Word Embedding + BiLSTM-CRF" using the Komninos embeddings.

Any idea what I'm doing wrong?

Note: I notice that the embeddings are not fine-tuned during training. There is no mention of this in the paper. Perhaps this is the cause?

Thanks!

question wontfix

Source

alexandres

Most helpful comment

Cool, thanks for checking this out!

For us, this means we have to take a closer look what changed between the versions. Generally, quality should get better with newer versions not the other way around :) We'll take a look!

alanakbik on 28 Feb 2019

❤2 👍1

All 4 comments

Hi @alexandres that is strange - your code looks good.

Could you try going back to Flair version 0.2. and run the experiment again with the instructions in the 0.2. EXPERIMENTS.md?

alanakbik on 28 Feb 2019

Thanks a bunch @alanakbik . That did it!

On latest (pip install flair) release, scores at the end of training:

2019-02-27 23:49:38,087 loading file resources/taggers/pos-extvec/best-model.pt     
2019-02-27 23:50:44,474 MICRO_AVG: acc 0.9358 - f1-score 0.9668                     
2019-02-27 23:50:44,475 MACRO_AVG: acc 0.876 - f1-score 0.9218586956521738

On v0.2.0, after a single epoch:

0       (11:45:56)      11.641517       0       0.100000        DEV   7082      0.9462540222208731      TEST    7095 0.9452774307001712

So 0.9358 for 150 epochs vs 0.94625 for 1 epoch.

Thanks! You saved me a lot of time.

alexandres on 28 Feb 2019

Cool, thanks for checking this out!

For us, this means we have to take a closer look what changed between the versions. Generally, quality should get better with newer versions not the other way around :) We'll take a look!

alanakbik on 28 Feb 2019

❤2 👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.