Transformers: Additional layers to BERT

Created on 16 Jul 2020 · 12Comments · Source: huggingface/transformers

❓ Questions & Help

Details

I'm currently fine-tuning BERTForSequenceClassification model for a classification task and I wanted to know if there are ways to add additional layers before the final classification layer?

wontfix

Source

psureshmagadi17

Most helpful comment

A small example:

import torch.nn as nn
from transformers import BertModel

class CustomBERTModel(nn.Module):
    def __init__(self):
          super(CustomBERTModel, self).__init__()
          self.bert = BertModel.from_pretrained("bert-base-uncased")
          # add your additional layers here, for example a dropout layer followed by a linear classification head
          self.dropout = nn.Dropout(0.3)
          self.out = nn.Linear(768, 2)

    def forward(self, ids, mask, token_type_ids):
          sequence_output, pooled_output = self.bert(
               ids, 
               attention_mask=mask,
               token_type_ids=token_type_ids
          )

          # we apply dropout to the sequence output, tensor has shape (batch_size, sequence_length, 768)
          sequence_output = self.dropout(sequence_output)

          # next, we apply the linear layer. The linear layer (which applies a linear transformation)
          # takes as input the hidden states of all tokens (so seq_len times a vector of size 768, each corresponding to
          # a single token in the input sequence) and outputs 2 numbers (scores, or logits) for every token
          # so the logits are of shape (batch_size, sequence_length, 2)
          logits = self.out(sequence_output)

          return logits

NielsRogge on 16 Jul 2020

👍2

All 12 comments

Hi @psureshmagadi17 , you can add additional layers easily, take a loot the source code for BERTForSequenceClassification, you can take that code as it is and add the additional layers before the final classifier.

patil-suraj on 16 Jul 2020

Hi @psureshmagadi17 , you can add additional layers easily, take a loot the source code for BERTForSequenceClassification, you can take that code as it is and add the additional layers before the final classifier.

Hi @patil-suraj , thank you for your response. Did you mean that we can just alter the code in main class? If yes, do you have an example?

psureshmagadi17 on 16 Jul 2020

Hi @psureshmagadi17, if your goal is to add layers to a pretrained model only for fine-tuning BERTForSequenceClassification I think the best option is to modify the BertForSequenceClassification Module.

If you want to add attention layers, make sure to use the sequence_output of the BertModel Module and not the pooled_output in the forward function, then use a BertPooler layer before the classifier.

nassim-yagoub on 16 Jul 2020

Hi @psureshmagadi17, if your goal is to add layers to a pretrained model only for fine-tuning BERTForSequenceClassification I think the best option is to modify the BertForSequenceClassification Module.

If you want to add attention layers, make sure to use the sequence_output of the BertModel Module and not the pooled_output in the forward function, then use a BertPooler layer before the classifier.

Hi @nassim-yagoub - thank you for the response! I'm fairly new to this process i.e., modify the network structure. Do you have an example or discussion that I can follow to help me through this process?

psureshmagadi17 on 16 Jul 2020

A small example:

import torch.nn as nn
from transformers import BertModel

class CustomBERTModel(nn.Module):
    def __init__(self):
          super(CustomBERTModel, self).__init__()
          self.bert = BertModel.from_pretrained("bert-base-uncased")
          # add your additional layers here, for example a dropout layer followed by a linear classification head
          self.dropout = nn.Dropout(0.3)
          self.out = nn.Linear(768, 2)

    def forward(self, ids, mask, token_type_ids):
          sequence_output, pooled_output = self.bert(
               ids, 
               attention_mask=mask,
               token_type_ids=token_type_ids
          )

          # we apply dropout to the sequence output, tensor has shape (batch_size, sequence_length, 768)
          sequence_output = self.dropout(sequence_output)

          # next, we apply the linear layer. The linear layer (which applies a linear transformation)
          # takes as input the hidden states of all tokens (so seq_len times a vector of size 768, each corresponding to
          # a single token in the input sequence) and outputs 2 numbers (scores, or logits) for every token
          # so the logits are of shape (batch_size, sequence_length, 2)
          logits = self.out(sequence_output)

          return logits

NielsRogge on 16 Jul 2020

👍2

A small example:

import torch.nn as nn
from transformers import BertModel

class CustomBERTModel(nn.Module):
    def __init__(self):
          super(CustomBERTModel, self).__init__()
          self.bert = BertModel.from_pretrained("bert-base-uncased")
          # add your additional layers here, for example a dropout layer followed by a linear classification head
          self.dropout = nn.Dropout(0.3)
          self.out = nn.Linear(768, 2)

    def forward(self, ids, mask, token_type_ids):
          sequence_output, pooled_output = self.bert(
               ids, 
               attention_mask=mask,
               token_type_ids=token_type_ids
          )

          # we apply dropout to the sequence output, tensor has shape (batch_size, sequence_length, 768)
          sequence_output = self.dropout(sequence_output)

          # next, we apply the linear layer. The linear layer (which applies a linear transformation)
          # takes as input the hidden states of all tokens (so seq_len times a vector of size 768, each corresponding to
          # a single token in the input sequence) and outputs 2 numbers (scores, or logits) for every token
          # so the logits are of shape (batch_size, sequence_length, 2)
          logits = self.out(sequence_output)

          return logits

Thank you, @NielsRogge

psureshmagadi17 on 16 Jul 2020

For example if you want to add the same layers used in Bert, you may want to modify the Module this way (with new_layers_config being the same than the original config, except for the number of layers):

class BertForSequenceClassification(BertPreTrainedModel):
    def __init__(self, config, new_layers_config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.bert = BertModel(config)
        self.new_layers = BertEncoder(new_layers_config)
        self.pooler = BertPooler(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

        self.init_weights()

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
    ):

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
        )

        sequence_output = outputs[0]

        new_layers_output = self.new_layers(sequence_output)[0]

        pooled_output = self.pooler(new_layers_output)

        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here

        if labels is not None:
            if self.num_labels == 1:
                #  We are doing regression
                loss_fct = MSELoss()
                loss = loss_fct(logits.view(-1), labels.view(-1))
            else:
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            outputs = (loss,) + outputs

        return outputs  # (loss), logits, (hidden_states), (attentions)

We added a BertEncoder and a BertPooler to the base implementation.
You can also retreive the hidden_states and attention of the new layers if you want to, I did not do it here.

nassim-yagoub on 16 Jul 2020

👍1

Thanks @nassim-yagoub !

psureshmagadi17 on 16 Jul 2020

@nassim-yagoub - I had another question : are the weights for BERTForSequenceClassification Model layers frozen by default?

psureshmagadi17 on 16 Jul 2020

The weights are not frozen by default when you load them, however you can manually freeze them with .requires_grad = False

nassim-yagoub on 17 Jul 2020

👍1

The weights are not frozen by default when you load them, however you can manually freeze them with .requires_grad = False

Thank you @nassim-yagoub!

psureshmagadi17 on 17 Jul 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.