Transformers: Additional layers to BERT

Created on 16 Jul 2020  ยท  12Comments  ยท  Source: huggingface/transformers

โ“ Questions & Help

Details


I'm currently fine-tuning BERTForSequenceClassification model for a classification task and I wanted to know if there are ways to add additional layers before the final classification layer?

wontfix

Most helpful comment

A small example:

import torch.nn as nn
from transformers import BertModel

class CustomBERTModel(nn.Module):
    def __init__(self):
          super(CustomBERTModel, self).__init__()
          self.bert = BertModel.from_pretrained("bert-base-uncased")
          # add your additional layers here, for example a dropout layer followed by a linear classification head
          self.dropout = nn.Dropout(0.3)
          self.out = nn.Linear(768, 2)

    def forward(self, ids, mask, token_type_ids):
          sequence_output, pooled_output = self.bert(
               ids, 
               attention_mask=mask,
               token_type_ids=token_type_ids
          )

          # we apply dropout to the sequence output, tensor has shape (batch_size, sequence_length, 768)
          sequence_output = self.dropout(sequence_output)

          # next, we apply the linear layer. The linear layer (which applies a linear transformation)
          # takes as input the hidden states of all tokens (so seq_len times a vector of size 768, each corresponding to
          # a single token in the input sequence) and outputs 2 numbers (scores, or logits) for every token
          # so the logits are of shape (batch_size, sequence_length, 2)
          logits = self.out(sequence_output)

          return logits

All 12 comments

Hi @psureshmagadi17 , you can add additional layers easily, take a loot the source code for BERTForSequenceClassification, you can take that code as it is and add the additional layers before the final classifier.

Hi @psureshmagadi17 , you can add additional layers easily, take a loot the source code for BERTForSequenceClassification, you can take that code as it is and add the additional layers before the final classifier.

Hi @patil-suraj , thank you for your response. Did you mean that we can just alter the code in main class? If yes, do you have an example?

Hi @psureshmagadi17, if your goal is to add layers to a pretrained model only for fine-tuning BERTForSequenceClassification I think the best option is to modify the BertForSequenceClassification Module.

If you want to add attention layers, make sure to use the sequence_output of the BertModel Module and not the pooled_output in the forward function, then use a BertPooler layer before the classifier.

Hi @psureshmagadi17, if your goal is to add layers to a pretrained model only for fine-tuning BERTForSequenceClassification I think the best option is to modify the BertForSequenceClassification Module.

If you want to add attention layers, make sure to use the sequence_output of the BertModel Module and not the pooled_output in the forward function, then use a BertPooler layer before the classifier.

Hi @nassim-yagoub - thank you for the response! I'm fairly new to this process i.e., modify the network structure. Do you have an example or discussion that I can follow to help me through this process?

A small example:

import torch.nn as nn
from transformers import BertModel

class CustomBERTModel(nn.Module):
    def __init__(self):
          super(CustomBERTModel, self).__init__()
          self.bert = BertModel.from_pretrained("bert-base-uncased")
          # add your additional layers here, for example a dropout layer followed by a linear classification head
          self.dropout = nn.Dropout(0.3)
          self.out = nn.Linear(768, 2)

    def forward(self, ids, mask, token_type_ids):
          sequence_output, pooled_output = self.bert(
               ids, 
               attention_mask=mask,
               token_type_ids=token_type_ids
          )

          # we apply dropout to the sequence output, tensor has shape (batch_size, sequence_length, 768)
          sequence_output = self.dropout(sequence_output)

          # next, we apply the linear layer. The linear layer (which applies a linear transformation)
          # takes as input the hidden states of all tokens (so seq_len times a vector of size 768, each corresponding to
          # a single token in the input sequence) and outputs 2 numbers (scores, or logits) for every token
          # so the logits are of shape (batch_size, sequence_length, 2)
          logits = self.out(sequence_output)

          return logits

A small example:

import torch.nn as nn
from transformers import BertModel

class CustomBERTModel(nn.Module):
    def __init__(self):
          super(CustomBERTModel, self).__init__()
          self.bert = BertModel.from_pretrained("bert-base-uncased")
          # add your additional layers here, for example a dropout layer followed by a linear classification head
          self.dropout = nn.Dropout(0.3)
          self.out = nn.Linear(768, 2)

    def forward(self, ids, mask, token_type_ids):
          sequence_output, pooled_output = self.bert(
               ids, 
               attention_mask=mask,
               token_type_ids=token_type_ids
          )

          # we apply dropout to the sequence output, tensor has shape (batch_size, sequence_length, 768)
          sequence_output = self.dropout(sequence_output)

          # next, we apply the linear layer. The linear layer (which applies a linear transformation)
          # takes as input the hidden states of all tokens (so seq_len times a vector of size 768, each corresponding to
          # a single token in the input sequence) and outputs 2 numbers (scores, or logits) for every token
          # so the logits are of shape (batch_size, sequence_length, 2)
          logits = self.out(sequence_output)

          return logits

Thank you, @NielsRogge

For example if you want to add the same layers used in Bert, you may want to modify the Module this way (with new_layers_config being the same than the original config, except for the number of layers):

class BertForSequenceClassification(BertPreTrainedModel):
    def __init__(self, config, new_layers_config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.bert = BertModel(config)
        self.new_layers = BertEncoder(new_layers_config)
        self.pooler = BertPooler(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

        self.init_weights()

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
    ):

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
        )

        sequence_output = outputs[0]

        new_layers_output = self.new_layers(sequence_output)[0]

        pooled_output = self.pooler(new_layers_output)

        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here

        if labels is not None:
            if self.num_labels == 1:
                #  We are doing regression
                loss_fct = MSELoss()
                loss = loss_fct(logits.view(-1), labels.view(-1))
            else:
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            outputs = (loss,) + outputs

        return outputs  # (loss), logits, (hidden_states), (attentions)

We added a BertEncoder and a BertPooler to the base implementation.
You can also retreive the hidden_states and attention of the new layers if you want to, I did not do it here.

Thanks @nassim-yagoub !

@nassim-yagoub - I had another question : are the weights for BERTForSequenceClassification Model layers frozen by default?

The weights are not frozen by default when you load them, however you can manually freeze them with .requires_grad = False

The weights are not frozen by default when you load them, however you can manually freeze them with .requires_grad = False

Thank you @nassim-yagoub!

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings