Flair: Train NER for Swedish

Created on 9 Jul 2018  路  16Comments  路  Source: flairNLP/flair

Train a simple NER tagger for Swedish trained for instance over this dataset.

For this task, we need to adapt the NLPTaskDataFetcher for the appropriate Swedish dataset and train a simple model using Swedish word embeddings. How to train a model is illustrated here.

Swedish word embeddings can now be loaded with

embeddings = WordEmbeddings('sv-fasttext')

For issue #2

good first issue help wanted new language wontfix

Most helpful comment

LM's for Swedish are uploaded now:

wget https://schweter.eu/cloud/flair-lms/lm-sv-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sv-large-backward-v0.1.pt

On Universal Dependencies (v1.2) an accuracy of 96.59 % can be achieved with only using fasttext embeddings. Using the forward + backward language model an accuracy of 98.32 % can be achieved. Current state-of-the-art is Yasunaga et. al (2017) with adversarial training achieving an accuracy of 96.70 %.

Feel free to integrate the language models in flair!

For the NER task: the dataset (suc_3.0_iob.txt) mentioned in the first two posts is not splitted into training, dev and test. So we need to split the original dataset manually -> maybe @alanakbik could do the splitting :)

All 16 comments

I鈥檓 on vacation right now, so just for reference, if someone else wants to start on this: here is code to parse SUC 3.0 into IOB format. It鈥檚 licensed so you can just copy parts of it into flair. https://github.com/EmilStenstrom/suc_to_iob

If no one else is working on this, I might give it a go!

@roshammar I'm not working on it, but am very interested in anything you can get working!

Great, I'm currently training a backward lm for Dutch (forward is already completed), so just let me know, if you need a lm for Swedish :)

@stefan-it Sure, I'd be very interested in that!

I'm currently training the forward lm and will post back, when the training has finished :)

LM's for Swedish are uploaded now:

wget https://schweter.eu/cloud/flair-lms/lm-sv-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sv-large-backward-v0.1.pt

On Universal Dependencies (v1.2) an accuracy of 96.59 % can be achieved with only using fasttext embeddings. Using the forward + backward language model an accuracy of 98.32 % can be achieved. Current state-of-the-art is Yasunaga et. al (2017) with adversarial training achieving an accuracy of 96.70 %.

Feel free to integrate the language models in flair!

For the NER task: the dataset (suc_3.0_iob.txt) mentioned in the first two posts is not splitted into training, dev and test. So we need to split the original dataset manually -> maybe @alanakbik could do the splitting :)

Hey this is great! We will absolutely include this in the 0.4 release - looks like we're getting serious about multilinguality :)

For 0.4. we just pushed a PR that does random sampling to get dev data from train if no dev data exists. I think we can add a similar thing for test data in this case!

So, finally I had the time to look at this.

I have now trained a first model on Swedish (dataset SUC 3.0), using only PRS, LOC, and ORG as entities.

Overall test score is 0.9121 (LOC 0.8575, ORG 0.6383, PRS: 0.9298).

I have observed many errors in the training data, both TP and FP, so I will try to improve the data and do more experiments to get even better results.
I will also train other models with more NER tags than PRS, ORG, LOC.

A thought: Currently this is trained on sentence level. Would it not be beneficial to train on document level, since then the same entity might be mentioned several times, increasing our confidence? How long sequences can be handled?

And, of course, a big thank you to @stefan-it for the models!

Hey this is great - thanks for sharing the results!

Yes, we've been thinking a lot about how to get better document-level infos into the classifier. We have a simple prototype embeddings class for one way to do this in the current release-0.4 branch - called FlairEmbeddings. It embeds and averages over all sentence words in a batch and also keeps a memory of previously embedded words. It looks like this gives us an F1-score boost, but we are still tinkering around, so the class might still change.

Any ideas / contributions in this space will be very welcome :)

Is there anything else that's required to train a ner model for swedish? Can I help somehow? Any updates @roshammar @stefan-it ?

what is the location of the model ?

Hi @thak123 you can use the Flair embeddings with:

from flair.embeddings import FlairEmbeddings
forward_embeddings = FlairEmbeddings("sv-forward")
backward_embeddings = FlairEmbeddings("sv-backward")

Details about the amount of data for training these embeddings can be found here :)

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stefan-it no news I guess? :/

Was this page helpful?
0 / 5 - 0 ratings

Related issues

gopalkalpande picture gopalkalpande  路  3Comments

frtacoa picture frtacoa  路  3Comments

davidsbatista picture davidsbatista  路  3Comments

prematurelyoptimized picture prematurelyoptimized  路  3Comments

ChessMateK picture ChessMateK  路  3Comments