Flair: How to make Bert trainable ?

Created on 29 Nov 2019 · 8Comments · Source: flairNLP/flair

Can you please explain as to how can I make the Bert model used in BertEmbeddings trainable during a sequence tagging model?

question

Source

PradyumnaGupta

Most helpful comment

Fine-tuning is a special case of training where we start with an existing (i.e. already trained) model and just "fine-tune" (make slight modifications to) the weights for a new task. If not fine-tuning (i.e. normal "training"), we start with a randomly initialized model, so it is trained from scratch instead.

Generally, for NER there are two broad ways of creating taggers using language models (LMs):

Train RNN+CRF from scratch on top of an LM. In this case, LM weights are frozen, so only the RNN is trained. Here one typically uses SGD with annealing and early stopping. This is how we have been doing it in Flair for NER.
Place linear layer on top of LM. In this case, LM weights are not frozen but fine-tuned during the NER training. Here one typically uses Adam with a very low learning rate and hard-codes a very small number of epochs. This is how you use transformers for NER.

alanakbik on 19 Apr 2020

👍3

All 8 comments

Any news on this? As soon as I start using BertEmbeddings I get CUDA OOM errors and I'm unable to find much on how to manage GPU memory in Flair.

RonRademaker on 9 Jan 2020

Hi @PradyumnaGupta ,

fine-tuning a BERT model is currently not possible in Flair. But you can use the fine-tuning example from Hugging Face library: https://github.com/huggingface/transformers/tree/master/examples/ner.

After you've fine-tuned your model, you can load it with Flair :)

stefan-it on 16 Mar 2020

@stefan-it is there a difference in the term "fine-tuning". Because fine-tuning in the context of downstream tasks (NER, ..) is possible with flair, looking at the tutorial. Or am i mistake the term "fine-tuning"

pascalhuszar on 18 Apr 2020

@pascalhuszar see my answer in #1527 - in master branch, you can now fine-tune BERT and other transformer embeddings in the task.

alanakbik on 18 Apr 2020

👍1

Once we're done testing this, we'll do a release of Flair that adds fine-tuning transformers. This seems to work especially well for text classification - for sequence labeling we still get best results using a feature based approach (i.e. no fine-tuning).

alanakbik on 18 Apr 2020

Thanks for the fast reply! @alanakbik i'm a bit confused with the terms "training" and "fine-tuning": Is the meaning of the terms the same in context of tutorial 7.
Or is training the term for downstream task (e.g. NER; ..)? And fine-tuning is then?

pascalhuszar on 18 Apr 2020

Generally, for NER there are two broad ways of creating taggers using language models (LMs):

Train RNN+CRF from scratch on top of an LM. In this case, LM weights are frozen, so only the RNN is trained. Here one typically uses SGD with annealing and early stopping. This is how we have been doing it in Flair for NER.
Place linear layer on top of LM. In this case, LM weights are not frozen but fine-tuned during the NER training. Here one typically uses Adam with a very low learning rate and hard-codes a very small number of epochs. This is how you use transformers for NER.

alanakbik on 19 Apr 2020

👍3

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.