Flair: Classic Word Embedding baseline in POS far below that reported in the paper

Created on 27 Feb 2019  路  4Comments  路  Source: flairNLP/flair

Hi,

First thank you for the great work on this library! :)

I'm trying to replicate the "Classic Word Embedding + BiLSTM-CRF" result from http://aclweb.org/anthology/C18-1139 on the PTB POS dataset of 96.94 卤 0.02 accuracy.

I followed the instructions at https://github.com/zalandoresearch/flair/blob/master/resources/docs/EXPERIMENTS.md#penn-treebank-part-of-speech-tagging-english

My code and corpus statistics are available at https://gist.github.com/alexandres/a54506e31d038cce75f31d09c60c9df8

My corpus statistics exactly match those from https://nlp.stanford.edu/pubs/CICLing2011-manning-tagging.pdf , namely:
image

Unfortunately my POS accuracy is around 94% with the "Classic Word Embedding + BiLSTM-CRF" using the Komninos embeddings.

Any idea what I'm doing wrong?

Note: I notice that the embeddings are not fine-tuned during training. There is no mention of this in the paper. Perhaps this is the cause?

Thanks!

question wontfix

Most helpful comment

Cool, thanks for checking this out!

For us, this means we have to take a closer look what changed between the versions. Generally, quality should get better with newer versions not the other way around :) We'll take a look!

All 4 comments

Hi @alexandres that is strange - your code looks good.

Could you try going back to Flair version 0.2. and run the experiment again with the instructions in the 0.2. EXPERIMENTS.md?

Thanks a bunch @alanakbik . That did it!

On latest (pip install flair) release, scores at the end of training:

2019-02-27 23:49:38,087 loading file resources/taggers/pos-extvec/best-model.pt     
2019-02-27 23:50:44,474 MICRO_AVG: acc 0.9358 - f1-score 0.9668                     
2019-02-27 23:50:44,475 MACRO_AVG: acc 0.876 - f1-score 0.9218586956521738          

On v0.2.0, after a single epoch:

0       (11:45:56)      11.641517       0       0.100000        DEV   7082      0.9462540222208731      TEST    7095 0.9452774307001712

So 0.9358 for 150 epochs vs 0.94625 for 1 epoch.

Thanks! You saved me a lot of time.

Cool, thanks for checking this out!

For us, this means we have to take a closer look what changed between the versions. Generally, quality should get better with newer versions not the other way around :) We'll take a look!

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mittalsuraj18 picture mittalsuraj18  路  3Comments

alanakbik picture alanakbik  路  3Comments

aschmu picture aschmu  路  3Comments

happypanda5 picture happypanda5  路  3Comments

Rahulvks picture Rahulvks  路  3Comments