Flair: Differences in using several embeddings

Created on 18 Apr 2020 · 20Comments · Source: flairNLP/flair

Hey everyone.
Im trying to understand the differences in using several word embeddings (BERT, XLM, ..) with this framework and using the embeddings in other framework, e.g. huggingface

To be more precise:
When using BertEmbeddings('bert-base-cased') here and fine-tune for ner task (conll_03) what are the key differences to using ('bert-base-cased') with hf?

Is the performance similar?

question

Source

pascalhuszar

All 20 comments

For now, a bert model is not finetuned in Flair.

djstrong on 18 Apr 2020

Please correct me if im wrong:
With the BertEmbeddings('bert-base-cased') im fine-tuning a model under Flair and that will be a "custom"-model whereas in hf the ('bert-base-cased') is an already fine-tuned model?

But what happens when i use them in downstream task like NER. Will they grand nearly same results?

pascalhuszar on 18 Apr 2020

Flair does not support finetuning bert - bert weights are frozen.

djstrong on 18 Apr 2020

We've actually added this functionality to master branch and it's currently being tested. You can now instantiate any transformer embedding like this:

    embeddings = TransformerWordEmbeddings(
        'distilbert-base-uncased', # which transformer model
        layers="-1", # which layers (here: only last layer when fine-tuning)
        pooling_operation='first_last', # how to pool over split tokens
        fine_tune=True, # whether or not to fine-tune
    )

By setting fine_tune to either True or False, you can select whether to fine-tune the embeddings during training. For instance, to fine-tune a transformer model for sequence labeling, you could use code like this:

    # sequence tagger with fine-tuneable transformer embeddings and no RNN or CRF
    tagger: SequenceTagger = SequenceTagger(
        hidden_size=256,
        embeddings=embeddings,
        tag_dictionary=tag_dictionary,
        tag_type='ner',
        use_crf=False,
        use_rnn=False,
    )

    # use Adam optimizer when fine-tuning
    trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.Adam)

    # fine-tune with setting from BERT paper
    trainer.train(f'resources/taggers/ner-distilbert-base-uncased-{run}-prestarts-256',
        learning_rate=3e-5, # very low learning rate
        mini_batch_chunk_size=2, # set this if you get OOM errors
        max_epochs=4, # very few epochs of fine-tuning
    )

This is only in master branch so if you pip install flair, you won't yet be able to do this.

alanakbik on 18 Apr 2020

This sounds interesting. I've just upgraded to master branch and i'm trying to follow your instructions however with a TextClassifier i run into some errors. Is there a chance for a similar code snippet that would work for text classification?

krzysztoffiok on 19 Apr 2020

For test classification, you should use the TransformerDocumentEmbeddings variant, i.e.

    transformer_model = 'bert-base-cased'
    embeddings = TransformerDocumentEmbeddings(model=transformer_model, fine_tune=True)

    corpus = [....] # load your corpus here

    # make label dictionary for your corpus
    label_dict = corpus.make_label_dictionary()
    print(label_dict)

    # instantiate classifier with embeddings and label dictionary
    model: TextClassifier = TextClassifier(embeddings, label_dict)

    # use Adam optimizer
    trainer = ModelTrainer(model, corpus, optimizer=torch.optim.Adam)

    trainer.train(
        f'path/to/output/folder',
        learning_rate=3e-5, # low learning rate as per BERT paper
        mini_batch_size=256, # set this high if yo have lots of data, otherwise low
        mini_batch_chunk_size=2, # set this low if you experience memory errors
        max_epochs=4, # very few epochs of fine-tuning
        min_learning_rate=3e-6, # lower the min learning rate  
    )

alanakbik on 19 Apr 2020

Thank you very much for such a quick response!

The code works, but now i start to wonder what does it actually do.

Before, when i used the DocumentRNNEmbeddings class, i could specify e.g. the type of RNN and it's parameters that was used and trained to create document-level embeddings. Now i'm fine-tuning the whole model (last layer as i understand) so for instance distilbert. But how are the document-level embeddings created with TransformerDocumentEmbeddings? Is there any way to feed this fine-tuned distilbert to a similar RNN as in DocumentRNNEmbeddings?

krzysztoffiok on 19 Apr 2020

The whole model is trained (fine-tuned) and by default the last layer embeddings are taken for each token. In the case of text classification and BERT, the embedding of CLS token is representing the text.

djstrong on 19 Apr 2020

👍1

Thank you again.
So by using the CLS token let's say i'm using the BERT's built-in way of creating document-level embedding, right?

And if i wanted to avoid using the CLS output, i could also theoretically take this whole saved fine-tuned model, and feed it again, this time to an LSTM in the DocumentRNNEmbeddings? Are there any tips for feeding the fine-tuned model path to DocumentRNNEmbeddings instead of for instance "BERTaEmbeddings('bert-base-cased')"?

krzysztoffiok on 19 Apr 2020

I have not seen such approach in papers, but it is doable.
If we have more token features (e.g. OneHotEmbeddings) then this sounds fine. But maybe without separate fine-tuning. I would try fine-tuning within DocumentRNNEmbeddings.

djstrong on 19 Apr 2020

OK, forgive me my ignorance, but i got confused and i'm not sure anymore if i understand everything correctly.
To create document-level embeddings in Flair i can use a selected LM (e.g. BERT in its pre-trained version) by:
1) DocumentPoolEmbeddings with pooling: str = "mean" as the default option. Includes no training at all;
2) DocumentRNNEmbeddings with a RNN and fine_tune=True as default. I understand that this trains the RNN (fed with token embeddings provided by the selected LM) on my corpus, but now i have to ask: does it also train (fine tune) the LM as well? So is the RNN and LM trained together?
3) TransformerDocumentEmbeddings which doesn't use "mean" nor RNN, it uses the LM's architecture and the CLS output to provide document-level embedding. I see the LM can be fine tuned with this option. My previous question which @djstrong answered already was regarding the possibility to use this TransformerDocumentEmbeddings to train the LM on its own, save the model, and later feed it to DocumentRNNEmbeddings to compare if it changes anything?

krzysztoffiok on 19 Apr 2020

Ad. 2 It depends on used Embeddings whether they are fine-tunable or frozen.
Ad. 3 You can do it. Train TransformerDocumentEmbeddings then extract BERT model and use it in DocumentRNNEmbeddings.

djstrong on 19 Apr 2020

Thank you again for answering.

Ad. 2 Is there a list of models that enable fine-tuning in Flair?
Ad. 3. I will give it a try.

krzysztoffiok on 19 Apr 2020

Ad. 3. How can i "extract BERT model" as @djstrong mentioned? I tried to follow https://github.com/flairNLP/flair/blob/a49bc42eb38de21d62c50532ae3d91430aab7213/resources/docs/TUTORIAL_9_TRAINING_LM_EMBEDDINGS.md so i did literaly:
word_embeddings = [RoBERTaEmbeddings("./data/model_sentiment_0/roblftadam_best-model.pt")] (because i have fine-tuned a roberta large model as i understand this is the same architecture and should work as well) and get:

ValueError: Calling RobertaTokenizer.from_pretrained() with the path to a single file or url is not supported.Use a model identifier or the path to a directory instead.

krzysztoffiok on 20 Apr 2020

I will share some results that confirm that using the CLS output from fine-tuned transformer model for text classification seems the best option (better than LSTM). I work on a 2000 tweet corpus with 5 classes of sentiment and extract features by various ways, namely:

1) LIWC (Linguistic Inquiry and Word Count)
2) simple Term Frequency (TF) model with selection of 100 most important features (by means of Mutual Information method),
3) Deep Learning Fasttext embeddings computed for each token separately (100 size vector) and simply averaged to create a tweet-level representation (also 100 size),
4) Deep Learning Roberta Large embeddings computed for each token separately (1024 size vector) and simply averaged to create a tweet-level representation (also 1024 size),
5) Deep Learning Fasttext embeddings computed for each token separately (100 size vector) + LSTM creating a 512 size tweet-level vector (so this means training of LSTM to create tweet-level vector),
6) Deep Learning Roberta Large embeddings computed for each token separately (1024 size vector) + LSTM creating a 512 size tweet-level vector (so this means training of LSTM + fine tuning of Roberta Large model), and
7) Deep Learning Universal Sentence Encoder (USE), a pre-trained model outputting not token-level embeddings but already tweet-level embeddings (512 size vector).
8) Deep Learning Roberta Large embeddings from CLS output after fine-tuning with SGD optimizer
9) Deep Learning Roberta Large embeddings from CLS output after fine-tuning with Adam optimizer

All extracted features are fed separately to Gradient Boosting (GB) (n_estimators=250) ML model. One exception: i train also a Naive Bayes (NB) classifier for features derived from tweets by simple Term Frequency (TF) model as this combination is known to provide higher quality results.

Results (Matthews Correlation Coefficient scores, everything is 5-fold cross validated and the fine-tuning of LMs and LSTMs is carried out on the same train/val/test splits as used in the final ML stage):

1) LIWC GB 0.375 +/- 0.046
2) TF NB 0.405 +/- 0.023 and TF GB 0.375 +/- 0.047
3) FT GB 0.398 +/- 0.036 (pooled)
4) RL GB 0.465 +/- 0.033 (pooled)
5) FT LSTM GB 0.426 +/- 0.02
6) RL LSTM GB 0.528 +/- 0.028
7) USE GB 0.439 +/- 0.018
8) RLFT GB 0.407 +/- 0.029 (fine-tuned with default SGD optimizer)
9) RLFTA GB 0.578 +/- 0.068 (fine-tuned with Adam optimizer)

So the fine-tuning regime proposed here by @alanakbik allowed to outperform everything else that i tried here.

krzysztoffiok on 20 Apr 2020

@krzysztoffiok thanks for sharing these results! Yes, for document-level classification tasks the fine-tuning transformer approach is currently the state-of-the-art and in many cases significantly outperforms previous approaches such as LSTM over words. If you have the computational power, this is currently the go-to approach.

alanakbik on 20 Apr 2020

Thanks for your answer. If i looked for a knowledge source that described the CLS output, how it is actually trained, what would you recommend?

krzysztoffiok on 21 Apr 2020

The CLS token is described in the original BERT paper and there are a lot of blog posts about BERT that also describe the CLS token that are probably helpful.

alanakbik on 21 Apr 2020

Thank you very much for all these answers. I of course read that paper before, but now with thanks to this discussion and more understanding it was a whole new experience. So i followed the proposal from BERT original paper and took a mix of features from 4 last layers of RoBERTa and fed it to a 2-layer biLSTM (only i did 512 not 768 hidden states) and the results are closer to applying fine tuning and CLS representation.

smix_4_last_layers+2layer_biLSTM(512) gave: 0.553 +/- 0.083
Fine tuning and CLS gave: 0.578 +/- 0.068

krzysztoffiok on 23 Apr 2020

👍1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.