I am experimenting on the use of transformer embeddings in sentence classification tasks without finetuning them. I have used BERT embeddings and those experiments gave me very good results. Now I want to use GPT-2 embeddings (without fine-tuning). So I have two questions,
GPT-2 and BERT are both transformer networks with very similar architectures. You can use the GPT-2 embeddings the same way you used BERT embeddings.
As you said, GPT-2 only handles left context. You can read the paper where the authors showcase results on several tasks in a zero-shot setting (section 3).
I recently imported the GPT2Model and built a simple classifier. I think the model is too naive. And could use some improvements. If you notice any mistakes, please correct me. :)
class SimpleGPT2SequenceClassifier(nn.Module):
def __init__(self, hidden_size: int, num_classes:int ,max_seq_len:int, gpt_model_name:str,
cache_dir:str):
super(SimpleGPT2SequenceClassifier,self).__init__()
self.gpt2model = GPT2Model.from_pretrained(
gpt_model_name, cache_dir = cache_dir
)
self.fc1 = nn.Linear(hidden_size, num_classes)
def forward(self, x_in):
"""
Args:
x_in: encoded inputs ids of sent.
"""
gpt_out = self.gpt2model(x_in)[0] #returns tuple
batch_size = gpt_out.shape[0]
prediction_vector = self.fc1(gpt_out.view(batch_size,-1)) #(batch_size , max_len, num_classes)
return prediction_vector
For preprocessing the text before encoding them with the tokenizer.
punkt_sentence_detector = nltk.data.load('tokenizers/punkt/english.pickle')
class GPT2Preprocessor:
def __init__(self, transformer_tokenizer, sentence_detector):
self.transformer_tokenizer = transformer_tokenizer
self.sentence_detector = sentence_detector
def add_eos_tokens(self, text):
eos_token = " " + self.transformer_tokenizer.eos_token + " "
sentences = self.sentence_detector.tokenize(text)
eos_added_text = (
eos_token.join(sentences) + " " + self.transformer_tokenizer.eos_token
)
return eos_added_text
I tried GPT-2 embeddings and compare them with Roberta embeddings for the task of sentiment classification (both networks were frozen during the training). GPT-2 couldn't outperform the results of Roberta.
@cozek from the code, it isn't obvious whether you've frozen gpt2 layers or not ?
@cozek from the code, it isn't obvious whether you've frozen gpt2 layers or not ?
Of course, I have not frozen any layers. It is not always necessary to freeze the layers. If required you can easily freeze the layers as necessary.
I tried GPT-2 embeddings and compare them with Roberta embeddings for the task of sentiment classification (both networks were frozen during the training). GPT-2 couldn't outperform the results of Roberta.
Do you still have the notebooks? I would be interested to see how you implemented a classification head on top of gpt-2.
I tried GPT-2 embeddings and compare them with Roberta embeddings for the task of sentiment classification (both networks were frozen during the training). GPT-2 couldn't outperform the results of Roberta.
Do you still have the notebooks? I would be interested to see how you implemented a classification head on top of gpt-2.
Here you go. I used it for OffenEval 2020, Hate Speech Detection. I used the distilled version. Feel free to swap it out and take the full GPT-2. We got 0.90 Macro f1 with this model.
You can add a CLS token to the vocabulary
tokenizer.add_special_tokens({'cls_token': '[CLS]'})
model.resize_token_embeddings(len(tokenizer))
Then append this CLS token at the end of your input
Then use the representation of this CLS token for classification as done in BERT.
cc @cozek
I tried GPT-2 embeddings and compare them with Roberta embeddings for the task of sentiment classification (both networks were frozen during the training). GPT-2 couldn't outperform the results of Roberta.
Do you still have the notebooks? I would be interested to see how you implemented a classification head on top of gpt-2.
Here you go. I used it for OffenEval 2020, Hate Speech Detection. I used the distilled version. Feel free to swap it out and take the full GPT-2. We got 0.90 Macro f1 with this model.
Thanks a lot. Very helpful!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@cozek I see in your code that you concatenate all the token embeddings together to produce the sentence representation and then pass that through fc1:
gpt_out = self.gpt2model(x_in)[0] #returns tuple
batch_size = gpt_out.shape[0]
prediction_vector = self.fc1(gpt_out.view(batch_size,-1))
Instead of concatenating all the token embeddings, did you try:
@AsmirMumin
@cozek I see in your code that you concatenate all the token embeddings together to produce the sentence representation and then pass that through
fc1:gpt_out = self.gpt2model(x_in)[0] #returns tuple batch_size = gpt_out.shape[0] prediction_vector = self.fc1(gpt_out.view(batch_size,-1))Instead of concatenating all the token embeddings, did you try:
- pooling over all the tokens to get the sentence representation? For example, max pooling or mean pooling?
- using the embedding of the last token?
I did not try 1 or 2. Option 1 seems logical as it would reduce the size of the FC layer and increase training speed.
I am not familiar with option 2.
Most helpful comment
https://github.com/cozek/OffensEval2020-code/blob/master/notebooks/Eng%20Task%20A%20-%20Ensemble%20DistilGPT2.ipynb
Here you go. I used it for OffenEval 2020, Hate Speech Detection. I used the distilled version. Feel free to swap it out and take the full GPT-2. We got 0.90 Macro f1 with this model.