Transformers: Can we use GPT-2 sentence embedding for classification tasks?

Created on 7 Mar 2020 · 12Comments · Source: huggingface/transformers

I am experimenting on the use of transformer embeddings in sentence classification tasks without finetuning them. I have used BERT embeddings and those experiments gave me very good results. Now I want to use GPT-2 embeddings (without fine-tuning). So I have two questions,

Can I use GPT-2 embeddings like that (because I know Gpt-2 is trained on the left to right)
Is there any example uses of GPT-2 in classification tasks other than generation tasks?
If I can use GPT-2 embeddings, how should I do it ?

Discussion wontfix

Source

shamanez

Most helpful comment

I tried GPT-2 embeddings and compare them with Roberta embeddings for the task of sentiment classification (both networks were frozen during the training). GPT-2 couldn't outperform the results of Roberta.

Do you still have the notebooks? I would be interested to see how you implemented a classification head on top of gpt-2.

https://github.com/cozek/OffensEval2020-code/blob/master/notebooks/Eng%20Task%20A%20-%20Ensemble%20DistilGPT2.ipynb

Here you go. I used it for OffenEval 2020, Hate Speech Detection. I used the distilled version. Feel free to swap it out and take the full GPT-2. We got 0.90 Macro f1 with this model.

cozek on 9 Jun 2020

👍4

All 12 comments

GPT-2 and BERT are both transformer networks with very similar architectures. You can use the GPT-2 embeddings the same way you used BERT embeddings.

As you said, GPT-2 only handles left context. You can read the paper where the authors showcase results on several tasks in a zero-shot setting (section 3).

LysandreJik on 8 Mar 2020

👍4

I recently imported the GPT2Model and built a simple classifier. I think the model is too naive. And could use some improvements. If you notice any mistakes, please correct me. :)

class SimpleGPT2SequenceClassifier(nn.Module):
    def __init__(self, hidden_size: int, num_classes:int ,max_seq_len:int, gpt_model_name:str, 
                 cache_dir:str):
        super(SimpleGPT2SequenceClassifier,self).__init__()
        self.gpt2model = GPT2Model.from_pretrained(
            gpt_model_name, cache_dir = cache_dir
        )
        self.fc1 = nn.Linear(hidden_size, num_classes)

    def forward(self, x_in):
        """
        Args:
                x_in: encoded inputs ids of sent.
        """

        gpt_out = self.gpt2model(x_in)[0] #returns tuple
        batch_size = gpt_out.shape[0]
        prediction_vector = self.fc1(gpt_out.view(batch_size,-1)) #(batch_size , max_len, num_classes)

        return prediction_vector

For preprocessing the text before encoding them with the tokenizer.

punkt_sentence_detector = nltk.data.load('tokenizers/punkt/english.pickle')
class GPT2Preprocessor:
    def __init__(self, transformer_tokenizer, sentence_detector):
        self.transformer_tokenizer = transformer_tokenizer
        self.sentence_detector = sentence_detector

    def add_eos_tokens(self, text):
        eos_token = " " + self.transformer_tokenizer.eos_token + " "
        sentences = self.sentence_detector.tokenize(text)
        eos_added_text = (
            eos_token.join(sentences) + " " + self.transformer_tokenizer.eos_token
        )
        return eos_added_text

cozek on 9 Mar 2020

I tried GPT-2 embeddings and compare them with Roberta embeddings for the task of sentiment classification (both networks were frozen during the training). GPT-2 couldn't outperform the results of Roberta.

shamanez on 15 Mar 2020

@cozek from the code, it isn't obvious whether you've frozen gpt2 layers or not ?

singhay on 15 Apr 2020

@cozek from the code, it isn't obvious whether you've frozen gpt2 layers or not ?

Of course, I have not frozen any layers. It is not always necessary to freeze the layers. If required you can easily freeze the layers as necessary.

cozek on 15 Apr 2020

I tried GPT-2 embeddings and compare them with Roberta embeddings for the task of sentiment classification (both networks were frozen during the training). GPT-2 couldn't outperform the results of Roberta.

Do you still have the notebooks? I would be interested to see how you implemented a classification head on top of gpt-2.

AsmirMumin on 9 Jun 2020

I tried GPT-2 embeddings and compare them with Roberta embeddings for the task of sentiment classification (both networks were frozen during the training). GPT-2 couldn't outperform the results of Roberta.

Do you still have the notebooks? I would be interested to see how you implemented a classification head on top of gpt-2.

https://github.com/cozek/OffensEval2020-code/blob/master/notebooks/Eng%20Task%20A%20-%20Ensemble%20DistilGPT2.ipynb

Here you go. I used it for OffenEval 2020, Hate Speech Detection. I used the distilled version. Feel free to swap it out and take the full GPT-2. We got 0.90 Macro f1 with this model.

cozek on 9 Jun 2020

👍4

You can add a CLS token to the vocabulary

tokenizer.add_special_tokens({'cls_token': '[CLS]'}) model.resize_token_embeddings(len(tokenizer))

Then append this CLS token at the end of your input
Then use the representation of this CLS token for classification as done in BERT.
cc @cozek

nrjvarshney on 10 Jun 2020

I tried GPT-2 embeddings and compare them with Roberta embeddings for the task of sentiment classification (both networks were frozen during the training). GPT-2 couldn't outperform the results of Roberta.

Do you still have the notebooks? I would be interested to see how you implemented a classification head on top of gpt-2.

https://github.com/cozek/OffensEval2020-code/blob/master/notebooks/Eng%20Task%20A%20-%20Ensemble%20DistilGPT2.ipynb

Here you go. I used it for OffenEval 2020, Hate Speech Detection. I used the distilled version. Feel free to swap it out and take the full GPT-2. We got 0.90 Macro f1 with this model.

Thanks a lot. Very helpful!

AsmirMumin on 10 Jun 2020

🚀3

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 9 Aug 2020

@cozek I see in your code that you concatenate all the token embeddings together to produce the sentence representation and then pass that through fc1:

        gpt_out = self.gpt2model(x_in)[0] #returns tuple
        batch_size = gpt_out.shape[0]
        prediction_vector = self.fc1(gpt_out.view(batch_size,-1))

Instead of concatenating all the token embeddings, did you try:

pooling over all the tokens to get the sentence representation? For example, max pooling or mean pooling?
using the embedding of the last token?

@AsmirMumin

githubrandomuser2017 on 19 Sep 2020

@cozek I see in your code that you concatenate all the token embeddings together to produce the sentence representation and then pass that through fc1:
        gpt_out = self.gpt2model(x_in)[0] #returns tuple
        batch_size = gpt_out.shape[0]
        prediction_vector = self.fc1(gpt_out.view(batch_size,-1))
Instead of concatenating all the token embeddings, did you try:

pooling over all the tokens to get the sentence representation? For example, max pooling or mean pooling?

using the embedding of the last token?

I did not try 1 or 2. Option 1 seems logical as it would reduce the size of the FC layer and increase training speed.
I am not familiar with option 2.

cozek on 23 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings