Flair: Fine tuning of embeddings in training classification model

Created on 26 Apr 2019  路  2Comments  路  Source: flairNLP/flair

Hi,

I'm currently trying out various word embeddings with DocumentRNNEmbeddings to train a text classification model. In this context I'm trying to understand the behaviour of fine-tuning of the underlying word embeddings.

Is using a pre-trained BERT embeddings (say 'bert-large-uncased') for this task also going to fine tune the underlying BERT model weights? Or does the BERT model stay 'fixed' after training? Also it would be good to understand the fine-tuning behaviour for other word vectors used in training a text classification model.

Thanks for your help.

question

Most helpful comment

Hello @bhavikm - currently, most word embeddings (BERT, ELMo, Flair, GloVe) stay fixed in our implementation, but there is the possibility to add a linear layer on top of each embedding that learns an updated representation of each word before it gets passed into document RNN. You can enable this layer by setting reproject_words to True. So the weights of this linear layer and the BiLSTM get updated during training. We do a similar thing for word-level sequence labeling tasks (see #632) to achieve some of the effect of fine-tuning.

However, this is only a "reprojection" of the word embedding into another embedding space, not a fine-tuning of the process that generates the embedding. In previous versions of Flair, we had the option to fine-tune FlairEmbeddings but took it out becaue we somehow thought no-one was using it. Now, with many people asking about fine-tuning, I think we will put it back it and look into supporting this feature for other embeddings such as BERT in future version.

Hope this clarifies!

All 2 comments

Hello @bhavikm - currently, most word embeddings (BERT, ELMo, Flair, GloVe) stay fixed in our implementation, but there is the possibility to add a linear layer on top of each embedding that learns an updated representation of each word before it gets passed into document RNN. You can enable this layer by setting reproject_words to True. So the weights of this linear layer and the BiLSTM get updated during training. We do a similar thing for word-level sequence labeling tasks (see #632) to achieve some of the effect of fine-tuning.

However, this is only a "reprojection" of the word embedding into another embedding space, not a fine-tuning of the process that generates the embedding. In previous versions of Flair, we had the option to fine-tune FlairEmbeddings but took it out becaue we somehow thought no-one was using it. Now, with many people asking about fine-tuning, I think we will put it back it and look into supporting this feature for other embeddings such as BERT in future version.

Hope this clarifies!

Thanks @alanakbik for the reply. In my experience with text classification I have almost always had improved results with fine-tuning of embeddings. Also the BERT paper recommends to fine-tune on downstream tasks so I think it would be a good feature to include in the future.

Was this page helpful?
0 / 5 - 0 ratings