Flair: How to make Bert trainable ?

Created on 29 Nov 2019  路  8Comments  路  Source: flairNLP/flair

Can you please explain as to how can I make the Bert model used in BertEmbeddings trainable during a sequence tagging model?

question

Most helpful comment

Fine-tuning is a special case of training where we start with an existing (i.e. already trained) model and just "fine-tune" (make slight modifications to) the weights for a new task. If not fine-tuning (i.e. normal "training"), we start with a randomly initialized model, so it is trained from scratch instead.

Generally, for NER there are two broad ways of creating taggers using language models (LMs):

  1. Train RNN+CRF from scratch on top of an LM. In this case, LM weights are frozen, so only the RNN is trained. Here one typically uses SGD with annealing and early stopping. This is how we have been doing it in Flair for NER.
  2. Place linear layer on top of LM. In this case, LM weights are not frozen but fine-tuned during the NER training. Here one typically uses Adam with a very low learning rate and hard-codes a very small number of epochs. This is how you use transformers for NER.

All 8 comments

Any news on this? As soon as I start using BertEmbeddings I get CUDA OOM errors and I'm unable to find much on how to manage GPU memory in Flair.

Hi @PradyumnaGupta ,

fine-tuning a BERT model is currently not possible in Flair. But you can use the fine-tuning example from Hugging Face library: https://github.com/huggingface/transformers/tree/master/examples/ner.

After you've fine-tuned your model, you can load it with Flair :)

@stefan-it is there a difference in the term "fine-tuning". Because fine-tuning in the context of downstream tasks (NER, ..) is possible with flair, looking at the tutorial. Or am i mistake the term "fine-tuning"

@pascalhuszar see my answer in #1527 - in master branch, you can now fine-tune BERT and other transformer embeddings in the task.

Once we're done testing this, we'll do a release of Flair that adds fine-tuning transformers. This seems to work especially well for text classification - for sequence labeling we still get best results using a feature based approach (i.e. no fine-tuning).

Thanks for the fast reply! @alanakbik i'm a bit confused with the terms "training" and "fine-tuning": Is the meaning of the terms the same in context of tutorial 7.
Or is training the term for downstream task (e.g. NER; ..)? And fine-tuning is then?

Fine-tuning is a special case of training where we start with an existing (i.e. already trained) model and just "fine-tune" (make slight modifications to) the weights for a new task. If not fine-tuning (i.e. normal "training"), we start with a randomly initialized model, so it is trained from scratch instead.

Generally, for NER there are two broad ways of creating taggers using language models (LMs):

  1. Train RNN+CRF from scratch on top of an LM. In this case, LM weights are frozen, so only the RNN is trained. Here one typically uses SGD with annealing and early stopping. This is how we have been doing it in Flair for NER.
  2. Place linear layer on top of LM. In this case, LM weights are not frozen but fine-tuned during the NER training. Here one typically uses Adam with a very low learning rate and hard-codes a very small number of epochs. This is how you use transformers for NER.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mittalsuraj18 picture mittalsuraj18  路  3Comments

ciaochiaociao picture ciaochiaociao  路  3Comments

happypanda5 picture happypanda5  路  3Comments

gopalkalpande picture gopalkalpande  路  3Comments

Aditya715 picture Aditya715  路  3Comments