Hey everyone.
Im trying to understand the differences in using several word embeddings (BERT, XLM, ..) with this framework and using the embeddings in other framework, e.g. huggingface
To be more precise:
When using BertEmbeddings('bert-base-cased') here and fine-tune for ner task (conll_03) what are the key differences to using ('bert-base-cased') with hf?
Is the performance similar?
For now, a bert model is not finetuned in Flair.
Please correct me if im wrong:
With the BertEmbeddings('bert-base-cased') im fine-tuning a model under Flair and that will be a "custom"-model whereas in hf the ('bert-base-cased') is an already fine-tuned model?
But what happens when i use them in downstream task like NER. Will they grand nearly same results?
Flair does not support finetuning bert - bert weights are frozen.
We've actually added this functionality to master branch and it's currently being tested. You can now instantiate any transformer embedding like this:
embeddings = TransformerWordEmbeddings(
'distilbert-base-uncased', # which transformer model
layers="-1", # which layers (here: only last layer when fine-tuning)
pooling_operation='first_last', # how to pool over split tokens
fine_tune=True, # whether or not to fine-tune
)
By setting fine_tune to either True or False, you can select whether to fine-tune the embeddings during training. For instance, to fine-tune a transformer model for sequence labeling, you could use code like this:
# sequence tagger with fine-tuneable transformer embeddings and no RNN or CRF
tagger: SequenceTagger = SequenceTagger(
hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type='ner',
use_crf=False,
use_rnn=False,
)
# use Adam optimizer when fine-tuning
trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.Adam)
# fine-tune with setting from BERT paper
trainer.train(f'resources/taggers/ner-distilbert-base-uncased-{run}-prestarts-256',
learning_rate=3e-5, # very low learning rate
mini_batch_chunk_size=2, # set this if you get OOM errors
max_epochs=4, # very few epochs of fine-tuning
)
This is only in master branch so if you pip install flair, you won't yet be able to do this.
This sounds interesting. I've just upgraded to master branch and i'm trying to follow your instructions however with a TextClassifier i run into some errors. Is there a chance for a similar code snippet that would work for text classification?
For test classification, you should use the TransformerDocumentEmbeddings variant, i.e.
transformer_model = 'bert-base-cased'
embeddings = TransformerDocumentEmbeddings(model=transformer_model, fine_tune=True)
corpus = [....] # load your corpus here
# make label dictionary for your corpus
label_dict = corpus.make_label_dictionary()
print(label_dict)
# instantiate classifier with embeddings and label dictionary
model: TextClassifier = TextClassifier(embeddings, label_dict)
# use Adam optimizer
trainer = ModelTrainer(model, corpus, optimizer=torch.optim.Adam)
trainer.train(
f'path/to/output/folder',
learning_rate=3e-5, # low learning rate as per BERT paper
mini_batch_size=256, # set this high if yo have lots of data, otherwise low
mini_batch_chunk_size=2, # set this low if you experience memory errors
max_epochs=4, # very few epochs of fine-tuning
min_learning_rate=3e-6, # lower the min learning rate
)
Thank you very much for such a quick response!
The code works, but now i start to wonder what does it actually do.
Before, when i used the DocumentRNNEmbeddings class, i could specify e.g. the type of RNN and it's parameters that was used and trained to create document-level embeddings. Now i'm fine-tuning the whole model (last layer as i understand) so for instance distilbert. But how are the document-level embeddings created with TransformerDocumentEmbeddings? Is there any way to feed this fine-tuned distilbert to a similar RNN as in DocumentRNNEmbeddings?
The whole model is trained (fine-tuned) and by default the last layer embeddings are taken for each token. In the case of text classification and BERT, the embedding of CLS token is representing the text.
Thank you again.
So by using the CLS token let's say i'm using the BERT's built-in way of creating document-level embedding, right?
And if i wanted to avoid using the CLS output, i could also theoretically take this whole saved fine-tuned model, and feed it again, this time to an LSTM in the DocumentRNNEmbeddings? Are there any tips for feeding the fine-tuned model path to DocumentRNNEmbeddings instead of for instance "BERTaEmbeddings('bert-base-cased')"?
I have not seen such approach in papers, but it is doable.
If we have more token features (e.g. OneHotEmbeddings) then this sounds fine. But maybe without separate fine-tuning. I would try fine-tuning within DocumentRNNEmbeddings.
OK, forgive me my ignorance, but i got confused and i'm not sure anymore if i understand everything correctly.
To create document-level embeddings in Flair i can use a selected LM (e.g. BERT in its pre-trained version) by:
1) DocumentPoolEmbeddings with pooling: str = "mean" as the default option. Includes no training at all;
2) DocumentRNNEmbeddings with a RNN and fine_tune=True as default. I understand that this trains the RNN (fed with token embeddings provided by the selected LM) on my corpus, but now i have to ask: does it also train (fine tune) the LM as well? So is the RNN and LM trained together?
3) TransformerDocumentEmbeddings which doesn't use "mean" nor RNN, it uses the LM's architecture and the CLS output to provide document-level embedding. I see the LM can be fine tuned with this option. My previous question which @djstrong answered already was regarding the possibility to use this TransformerDocumentEmbeddings to train the LM on its own, save the model, and later feed it to DocumentRNNEmbeddings to compare if it changes anything?
Ad. 2 It depends on used Embeddings whether they are fine-tunable or frozen.
Ad. 3 You can do it. Train TransformerDocumentEmbeddings then extract BERT model and use it in DocumentRNNEmbeddings.
Thank you again for answering.
Ad. 2 Is there a list of models that enable fine-tuning in Flair?
Ad. 3. I will give it a try.
Ad. 3. How can i "extract BERT model" as @djstrong mentioned? I tried to follow https://github.com/flairNLP/flair/blob/a49bc42eb38de21d62c50532ae3d91430aab7213/resources/docs/TUTORIAL_9_TRAINING_LM_EMBEDDINGS.md so i did literaly:
word_embeddings = [RoBERTaEmbeddings("./data/model_sentiment_0/roblftadam_best-model.pt")] (because i have fine-tuned a roberta large model as i understand this is the same architecture and should work as well) and get:
ValueError: Calling RobertaTokenizer.from_pretrained() with the path to a single file or url is not supported.Use a model identifier or the path to a directory instead.
I will share some results that confirm that using the CLS output from fine-tuned transformer model for text classification seems the best option (better than LSTM). I work on a 2000 tweet corpus with 5 classes of sentiment and extract features by various ways, namely:
1) LIWC (Linguistic Inquiry and Word Count)
2) simple Term Frequency (TF) model with selection of 100 most important features (by means of Mutual Information method),
3) Deep Learning Fasttext embeddings computed for each token separately (100 size vector) and simply averaged to create a tweet-level representation (also 100 size),
4) Deep Learning Roberta Large embeddings computed for each token separately (1024 size vector) and simply averaged to create a tweet-level representation (also 1024 size),
5) Deep Learning Fasttext embeddings聽computed for each token separately (100 size vector)聽+ LSTM creating a 512 size tweet-level vector (so this means training of LSTM to create tweet-level vector),
6)聽Deep Learning Roberta Large embeddings聽computed for each token separately (1024 size vector)聽+ LSTM creating a 512 size tweet-level vector (so this means training of LSTM聽+ fine tuning of Roberta Large model), and
7) Deep Learning Universal Sentence Encoder (USE), a pre-trained model outputting not token-level embeddings but already tweet-level embeddings (512 size vector).
8) Deep Learning Roberta Large embeddings from CLS output after fine-tuning with SGD optimizer
9) Deep Learning Roberta Large embeddings from CLS output after fine-tuning with Adam optimizer
All extracted features are fed separately to Gradient Boosting (GB) (n_estimators=250) ML model. One exception: i train also a Naive Bayes (NB) classifier for features derived from tweets by simple Term Frequency (TF) model as this combination is known to provide higher quality results.
Results (Matthews Correlation Coefficient scores, everything is 5-fold cross validated and the fine-tuning of LMs and LSTMs is carried out on the same train/val/test splits as used in the final ML stage):
1)聽LIWC GB 0.375 +/- 0.046
2) TF NB 0.405 +/- 0.023 and TF GB 0.375 +/- 0.047
3) FT GB 0.398 +/- 0.036 (pooled)
4) RL GB 0.465 +/- 0.033 (pooled)
5) FT LSTM GB 0.426 +/- 0.02
6) RL LSTM GB 0.528 +/- 0.028
7)聽USE GB 0.439 +/- 0.018
8) RLFT GB 0.407 +/- 0.029 (fine-tuned with default SGD optimizer)
9) RLFTA GB 0.578 +/- 0.068 (fine-tuned with Adam optimizer)
So the fine-tuning regime proposed here by @alanakbik allowed to outperform everything else that i tried here.
@krzysztoffiok thanks for sharing these results! Yes, for document-level classification tasks the fine-tuning transformer approach is currently the state-of-the-art and in many cases significantly outperforms previous approaches such as LSTM over words. If you have the computational power, this is currently the go-to approach.
Thanks for your answer. If i looked for a knowledge source that described the CLS output, how it is actually trained, what would you recommend?
The CLS token is described in the original BERT paper and there are a lot of blog posts about BERT that also describe the CLS token that are probably helpful.
Thank you very much for all these answers. I of course read that paper before, but now with thanks to this discussion and more understanding it was a whole new experience. So i followed the proposal from BERT original paper and took a mix of features from 4 last layers of RoBERTa and fed it to a 2-layer biLSTM (only i did 512 not 768 hidden states) and the results are closer to applying fine tuning and CLS representation.
smix_4_last_layers+2layer_biLSTM(512) gave: 0.553 +/- 0.083
Fine tuning and CLS gave: 0.578 +/- 0.068
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.