Flair: Training `CharLmEmbeddings` from scratch?

Created on 31 Jul 2018 · 6Comments · Source: flairNLP/flair

I would like to train my own contextual string embeddings from scratch, since my target corpus is very idiosyncratic and likely won't work very well with the pre-trained ones (historically diverse OCRed scanned books/newspapers).
Unfortunately I could not find a way to train these in the code, are there plans to add something like the TagTrainer for the CharLmEmbeddings?

Source

jbaiter

Most helpful comment

Thanks for your interest!

We are currently working on the next release, which will also contain a trainer for contextual string embeddings (see https://github.com/zalandoresearch/flair/issues/17). If everything goes well, we will release the new release end of this week. So, please just wait a couple of more days :)

tabergma on 31 Jul 2018

👍2 🎉1

All 6 comments

Thanks for your interest!

tabergma on 31 Jul 2018

👍2 🎉1

Thanks tabergma! My issue goes along the same line as jbaiter was saying.

I wanted to write a script to what jbaiter mentionned earlier.
Could you elborate on how to do that?

As far I could understand from the paper you would have to do forward pass of corpus and backward pass to acquire the fLM and bLM.

Would it be equivalent to getting char embeddings of the whole corpus sentence by sentence and then pass through BiLSTM ?

Oushesh on 31 Jul 2018

Hi Oushesh,

we will include a tutorial on how to train your own CharLM embeddings with the upcoming release, probably in a few days! For best results, you need to train a forward and backward character language model over a large corpus, preferably from the same domain as your downstream task(s). So, it is rather two LSTMs (one forward, trained to predict the next char, and one backward, trained to predict the previous char).

Are you planning on training a language model for a particular language? If you're interested in English or German, the current release already packages pre-trained language models for training your own sequence labelers in these languages.

alanakbik on 31 Jul 2018

Hi Alan,

Thanks for your reponse. I am working with Keras with Tensorflow backend and implementing the paper. The current model loads the .pt file (pytorch model I guess).

I will train the character model with the 1 Billion language model from Chelba. with the settings from paper: batch_size=100, clipping gradient = 0.25, dropout=0.25, optimise=sgd.
output = LSTM(200, return_sequences=True, dropout=dropout)(input)

Oushesh on 1 Aug 2018

Thanks for the implementation and the documentation :heart: I will definitively try this out in the upcoming days )