Transformers: How to build a Text-to-Feature Extractor based on Fine-Tuned BERT Model

Created on 24 Sep 2019 · 31Comments · Source: huggingface/transformers

I have now tried for several days to solve an issue I have...

I need to make a feature extractor for a project I am doing, so I am able to translate a given sentence e.g. "My hat is blue" into a vector of a given length e.g. 768. That vector will then later on be combined with several other values for the final prediction in e.g. a random forest algorithm.

My dataset contains a text column + a label column (with 0 and 1 values) + several other columns that are not of interest for this problem.

I know how to do make that feature extractor using word2vec, Glove, FastText and pre-trained BERT/Elmo Models. That works okay.

Now I want to improve the text-to-feature extractor by using a FINE-TUNED BERT model, instead of a PRE-TRAINED BERT MODEL. I want to fine-tune the BERT model on my dataset and then use that new BERT model to do the feature extraction. I am NOT INTERESTED in using the bert model for the predictions themselves! Only for the feature extraction. How can i do that? I think i need the run_lm_finetuning.py somehow, but simply cant figure out how to do it.

I could really need some help...

P.S. I have already created a binary classifier using the text information to predict the label (0/1), by adding an additional layer. Could I in principle use the output of the previous layers, in evaluation mode, as word embeddings? If I can, then I am not sure how to get the output of those in evaluation mode.

Source

pvester

Most helpful comment

@BenjiTheC I don't have any blog post to link to, but I wrote a small snippet that could help get you started. You just have to make sure the dimensions are correct for the features that you want to include. For more help you may want to get in touch via the forum. You can tag me there as well.

import torch
import torch.nn as nn
from torch.nn import GELU
from transformers import BertModel


class ExtendedBert(nn.Module):
    def __init__(self):
        super().__init__()

        self.bert = BertModel.from_pretrained("bert-base-cased")
        self.linear = nn.Linear(1024, 1024)
        self.act = GELU()
        # regression problem: one label
        self.classifier = nn.Linear(1024, 1)

    def forward(self, encoded, other_feats):
        # get the hidden state of the last layer
        last_hidden = self.bert(**encoded)[0]
        # concatenate with the other given features
        cat = torch.cat([last_hidden, other_feats], dim=-1)
        # pass through linear layer
        output = self.linear(cat)
        # pass through non-linear activation and final classifier layer
        return self.classifier(self.act(output))

BramVanroy on 15 Jul 2020

🎉2 ❤1 👍1

All 31 comments

The explanation for fine-tuning is in the README https://github.com/huggingface/pytorch-transformers#quick-tour-of-the-fine-tuningusage-scripts.

BramVanroy on 25 Sep 2019

👍1

Thanks, but as far as i understands its about "Fine-tuning on GLUE tasks for sequence classification". I want to do "Fine-tuning on My Data for word-to-features extraction". I am not interested in building a classifier, just a fine-tuned word-to-features extraction. I am not sure how to get there, from the GLUE example?? I need to somehow do the fine-tuning and then find a way to extract the output from e.g. the last four layers in evalution mode for each sentence i want to extract features from. But how to do that?

pvester on 25 Sep 2019

You can only fine-tune a model if you have a task, of course, otherwise the model doesn't know whether it is improving over some baseline or not. Since 'feature extraction', as you put it, doesn't come with a predefined correct result, that doesn't make since. In your case it might be better to fine-tune the masked LM on your dataset. https://github.com/huggingface/pytorch-transformers/blob/master/pytorch_transformers/modeling_bert.py#L713

BramVanroy on 25 Sep 2019

👍1

But wouldnt it be possible to proceed like thus:

1) fine-tune the BERT model on my labelled data by adding a layer with two nodes (for 0 and 1) [ALREADY DONE]
2) Run all my data/sentences through the fine-tuned model in evalution, and use the output of the last layers (before the classification layer) as the word-embeddings instead of the predictons? Then I can use that feature vector in my further analysis of my problem and I have created a feature extractor fine-tuned on my data.

What do you think of that approach?

pvester on 25 Sep 2019

But what do you wish to use these word representations for? It's a bit odd using word representations from deep learning as features in other kinds of systems.

But, yes, what you say is theoretically possible. But take into account that those are not word embeddings what you are extracting. They are the final task specific representation of words. In other words, if you finetune the model on another task, you'll get other word representations.

BramVanroy on 25 Sep 2019

❤3

The idea is that I have several columns in my dataset. Most of them have numerical values and then I have ONE text column. The idea is to extract features from the text, so I can represent the text fields as numerical values.

Now that all my columns have numerical values (after feature extraction) I can use e.g. a neural network or random forest algorithm to do the predictions based on both the text column and the other columns with numerical values

By the way, do you know - after I fine-tune the model - how do I get the output from the last four layers in evalution mode? My model is BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2) but i can only figure out how to get the final predictions (model.eval() -> predictions = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask), not the output from all the layers...

pvester on 25 Sep 2019

If I were you, I would just extend BERT and add the features there, so that everything is optimised in one go. That will give you the cleanest pipeline and most reproducible. But of course you can do what you want. I also once tried Sent2Vec as features in SVR and that worked pretty well. So what I'm saying is, it might _work_ but the pipeline might get messy. So make sure that your code is well structured and easy to follow along. The more broken up your pipeline, the easier it is for errors the sneak in.

I advise you to read through the whole BERT process. Especially its config counterpart. Down the line you'll find that there's this option that can be used:

https://github.com/huggingface/pytorch-transformers/blob/7c0f2d0a6a8937063bb310fceb56ac57ce53811b/pytorch_transformers/configuration_utils.py#L55

When you enable output_hidden_states all layers' final states will be returned.

bert = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
out = bert.(input_ids=input_ids, attention_mask=attention_mask
# out is a tuple, the hidden states are the third element (cf. source code)
hidden_states = out[2]

BramVanroy on 25 Sep 2019

❤3

Thanks alot! Now my only problem is that, when I do:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2, output_hidden_states=True)

I get:
TypeError: __init__() got an unexpected keyword argument 'output_hidden_states'

pvester on 25 Sep 2019

@pvester what version of pytorch-transformers are you using? I'm on 1.2.0 and it seems to be working with output_hidden_states = True.

cformosa on 25 Sep 2019

👍1

@cformosa I am using 1.2.0

This is the full output

TypeError Traceback (most recent call last)
in ()
1
----> 2 model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2, output_hidden_states=True)
3 model.cuda()
4

/usr/local/lib/python3.6/dist-packages/pytorch_pretrained_bert/modeling.py in from_pretrained(cls, pretrained_model_name_or_path, inputs, *kwargs)
598 logger.info("Model config {}".format(config))
599 # Instantiate model.
--> 600 model = cls(config, inputs, *kwargs)
601 if state_dict is None and not from_tf:
602 weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)

TypeError: __init__() got an unexpected keyword argument 'output_hidden_states'

pvester on 25 Sep 2019

@pvester perhaps this will help?
#1073

cformosa on 25 Sep 2019

👍1

thanks @cformosa

I think I got more confused than before. I hope you guys are able to help me making this work. My latest try is:

config = BertConfig.from_pretrained("bert-base-uncased", output_hidden_states=True)
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2, config=config)

ERROR:
AttributeError: type object 'BertConfig' has no attribute 'from_pretrained'

pvester on 25 Sep 2019

No, don't do it like that. Your first approach was correct. (You don't need to use config manually when using a pre-trained model.) So

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2, output_hidden_states=True)

is correct. I tested it and it works. I would assume that you are on an older version of pytorch-transformers. Try updating the package to the latest pip release.

EDIT: I just read the reference by cformosa. Apparently there are different ways. But if they don't work, it might indicate a version issue.

BramVanroy on 25 Sep 2019

👍1

Are you sure you have a recent version of pytorch_transformers ?

import pytorch_transformers
pytorch_transformers.__version__

On Wed, 25 Sep 2019 at 15:47, pvester notifications@github.com wrote:

I think I got more confused than before. I hope you guys are able to help
me making this work. My latest try is:

config = BertConfig.from_pretrained("bert-base-uncased",
output_hidden_states=True)
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
num_labels=2, config=config)

ERROR:
AttributeError: type object 'BertConfig' has no attribute 'from_pretrained'

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/huggingface/pytorch-transformers/issues/1323?email_source=notifications&email_token=ABYDIHOSVHXKBF5PTRPEYHDQLNTWBA5CNFSM4IZ5GVFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7R64AY#issuecomment-535031299,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABYDIHPW7ZATNPB2MYISKVTQLNTWBANCNFSM4IZ5GVFA
.

thomwolf on 25 Sep 2019

👍1

@BramVanroy, @thomwolf

pytorch_transformers.__version__ gives me "1.2.0"

Everything works when i do a it without output_hidden_states=True

I do a pip install of pytorch-transformers right before, with the output
Requirement already satisfied: pytorch-transformers in /usr/local/lib/python3.6/dist-packages (1.2.0)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from pytorch-transformers) (2.21.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from pytorch-transformers) (4.28.1)
Requirement already satisfied: regex in /usr/local/lib/python3.6/dist-packages (from pytorch-transformers) (2019.8.19)
Requirement already satisfied: torch>=1.0.0 in /usr/local/lib/python3.6/dist-packages (from pytorch-transformers) (1.1.0)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.6/dist-packages (from pytorch-transformers) (0.0.34)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.6/dist-packages (from pytorch-transformers) (0.1.83)
Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from pytorch-transformers) (1.9.224)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from pytorch-transformers) (1.16.5)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->pytorch-transformers) (2019.6.16)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->pytorch-transformers) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->pytorch-transformers) (2.8)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->pytorch-transformers) (1.24.3)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from sacremoses->pytorch-transformers) (1.12.0)
Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->pytorch-transformers) (7.0)
Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from sacremoses->pytorch-transformers) (0.13.2)
Requirement already satisfied: botocore<1.13.0,>=1.12.224 in /usr/local/lib/python3.6/dist-packages (from boto3->pytorch-transformers) (1.12.224)
Requirement already satisfied: s3transfer<0.3.0,>=0.2.0 in /usr/local/lib/python3.6/dist-packages (from boto3->pytorch-transformers) (0.2.1)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->pytorch-transformers) (0.9.4)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1; python_version >= "2.7" in /usr/local/lib/python3.6/dist-packages (from botocore<1.13.0,>=1.12.224->boto3->pytorch-transformers) (2.5.3)
Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.13.0,>=1.12.224->boto3->pytorch-transformers) (0.15.2)

pvester on 25 Sep 2019

I tried with two different python setups now and always the same error:

TypeError: __init__() got an unexpected keyword argument 'output_hidden_states'

I can upload a Google Colab notesbook, if it helps to find the error??

pvester on 25 Sep 2019

You're sure that you are passing in the keyword argument after the 'bert-base-uncased' argument, right? Yes, you can try a Colab.

BramVanroy on 25 Sep 2019

👍1

@BramVanroy

Okat thanks, the Colab link is here:

https://colab.research.google.com/drive/1tIFeHITri6Au8jb4c64XyVH7DhyEOeMU

scroll down to the end for the error message

pvester on 25 Sep 2019

You're loading it from the old pytorch_pretrained_bert, not from the new pytorch_transformers. Why are you importing pytorch_pretrained_bert in the first place? Using both at the same time will definitely lead to mistakes or at least confusion. Stick to one.

This line

from pytorch_pretrained_bert import BertAdam, BertForSequenceClassification

should be

from pytorch_transformers import BertAdam, BertForSequenceClassification

BramVanroy on 25 Sep 2019

👍1

@BramVanroy

Now i get

ImportError: cannot import name 'BertAdam'

pvester on 25 Sep 2019

I'm sorry but this is getting annoying. If you'd just _read_, you'd understand what's wrong. In the README it is stated that there have been changes to the optimizers. Now you can use AdamW and it's in optimizer.py. It's not hard to find out why an import goes wrong. Just look through the source code here.

BramVanroy on 25 Sep 2019

@BramVanroy @thomwolf @cformosa

Thanks for your help. I now managed to do my task as intended with a quite good performance and I am very happy with the results.

Thank to all of you for your valuable help and patience

I am sorry I did not understand everything in the documentation right away - it has been a learning experience for as well for me :) I now feel more at ease with these packages and manipulating an existing neural network.

pvester on 26 Sep 2019

No worries. Just remember that reading the documentation and particularly the source code will help you a lot. Not only for your current problem, but also for better understanding the bigger picture.

Glad that your results are as good as you expected.

BramVanroy on 26 Sep 2019

I'm trying to extract the features from FlaubertForSequenceClassification. My concern is the huge size of embeddings being extracted. Is there any work you can point me to which involves compressing the embeddings/features extracted from the model.
Thanks in advance!

depshad on 26 Jun 2020

I'm trying to extract the features from FlaubertForSequenceClassification. My concern is the huge size of embeddings being extracted. Is there any work you can point me to which involves compressing the embeddings/features extracted from the model.
Thanks in advance!

You can use pooling for this. Typically average or maxpooling. You'll find a lot of info if you google it.

BramVanroy on 26 Jun 2020

If I were you, I would just extend BERT and add the features there, so that everything is optimised in one go. That will give you the cleanest pipeline and most reproducible. But of course you can do what you want. I also once tried Sent2Vec as features in SVR and that worked pretty well. So what I'm saying is, it might _work_ but the pipeline might get messy. So make sure that your code is well structured and easy to follow along. The more broken up your pipeline, the easier it is for errors the sneak in.

I advise you to read through the whole BERT process. Especially its config counterpart. Down the line you'll find that there's this option that can be used:

https://github.com/huggingface/pytorch-transformers/blob/7c0f2d0a6a8937063bb310fceb56ac57ce53811b/pytorch_transformers/configuration_utils.py#L55

When you enable output_hidden_states all layers' final states will be returned.
bert = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
out = bert.(input_ids=input_ids, attention_mask=attention_mask
# out is a tuple, the hidden states are the third element (cf. source code)
hidden_states = out[2]

Hi @BramVanroy , I'm relatively new to neural network and I'm using 🤗transformer to fine-tune a BERT for my research thesis.

The major challenge I'm having now happens to be mentioned in your comment here, that's _"extend BERT and add features"_. Is it possible to integrate the fine-tuned BERT model into a bigger network? Something like appending some more features in the output layer of BERT then continue forward to the next layer in the bigger network.

I know it's more of a ML question than a specific question toward this package, but it would be MUCH MUCH appreciated if you can refer some material/blog that explain similar practice. Thanks!

BenjiTheC on 15 Jul 2020

import torch
import torch.nn as nn
from torch.nn import GELU
from transformers import BertModel


class ExtendedBert(nn.Module):
    def __init__(self):
        super().__init__()

        self.bert = BertModel.from_pretrained("bert-base-cased")
        self.linear = nn.Linear(1024, 1024)
        self.act = GELU()
        # regression problem: one label
        self.classifier = nn.Linear(1024, 1)

    def forward(self, encoded, other_feats):
        # get the hidden state of the last layer
        last_hidden = self.bert(**encoded)[0]
        # concatenate with the other given features
        cat = torch.cat([last_hidden, other_feats], dim=-1)
        # pass through linear layer
        output = self.linear(cat)
        # pass through non-linear activation and final classifier layer
        return self.classifier(self.act(output))

BramVanroy on 15 Jul 2020

🎉2 ❤1 👍1

@BenjiTheC I don't have any blog post to link to, but I wrote a small smippet that could help get you started. You just have to make sure the dimensions are correct for the features that you want to include. For more help you may want to get in touch via the forum. You can tag me there as well.

import torch
import torch.nn as nn
from torch.nn import GELU
from transformers import BertModel


class ExtendedBert(nn.Module):
    def __init__(self):
        super().__init__()

        self.bert = BertModel.from_pretrained("bert-base-cased")
        self.linear = nn.Linear(1024, 1024)
        self.act = GELU()
        # regression problem: one label
        self.classifier = nn.Linear(1024, 1)

    def forward(self, encoded, other_feats):
        # get the hidden state of the last layer
        last_hidden = self.bert(**encoded)[0]
        # concatenate with the other given features
        cat = torch.cat([last_hidden, other_feats], dim=-1)
        # pass through linear layer
        output = self.linear(cat)
        # pass through non-linear activation and final classifier layer
        return self.classifier(self.act(output))

Thank you so much for such a timely response!
I'm a TF2 user but your snippet definitely point me to the right direction - to concat the last layer's state and new features to forward. One more follow up question though: I saw in the previous discussion, to get the hidden state of the model, you need to set output_hidden_state to True, do I need this flag to be True to get what I want?

BenjiTheC on 15 Jul 2020

@BenjiTheC That flag is needed if you want the hidden states of _all_ layers. If you just want the last layer's hidden state (as in my example), then you do not need that flag.

BramVanroy on 15 Jul 2020

❤1 👍1

@BenjiTheC That flag is needed if you want the hidden states of _all_ layers. If you just want the last layer's hidden state (as in my example), then you do not need that flag.

Thanks so much! Will stay tuned in the forum and continue the discussion there if needed.

BenjiTheC on 15 Jul 2020

👍1

hi @BramVanroy, I am relatively new to 🤗transformers. I would like to know is it possible to use a fine-tuned model to be retrained/reused on a different set of labels? The new set of labels may be a subset of the old labels or the old labels + some additional labels. I already ask this on the forum but no reply yet. AFAIK now it is not possible to use the fine-tuned model to be retrained on a new set of labels. A workaround for this is to fine-tune a pre-trained model use whole (old + new) data with a superset of the old + new labels. Is true?

I know it's more of an ML question than a specific question toward this package, but I will really appreciate it if you can refer me to some reference that explains this. Thank you in advance.