Transformers: How do a put a different classifier on top of BertForSequenceClassification?

Created on 10 Aug 2019  路  12Comments  路  Source: huggingface/transformers

Hi,

Thanks for providing an efficient and easy-to-use implementation of BERT and other models.

I am working on a project that requires me to do binary classification of sentences. I am using BertForSequenceClassification for that but I am not getting good results i.e. my loss function doesn't converge. I noticed that by default there is only a single LinearClassifier on top of the BERT model. Is is possible to change that?

Thanks,
Shivin

Most helpful comment

You can also replace self.classifier with your own model.

model = BertForSequenceClassification.from_pretrained("bert-base-multilingual-cased")
model.classifier = new_classifier

where new_classifier is any pytorch model that you want.

All 12 comments

Sure, one way you could go about it would be to create a new class similar to BertForSequenceClassification and implement your own custom final classifier.

The lib is pretty modular so you can usually subclass/extend what you need.

You can also replace self.classifier with your own model.

model = BertForSequenceClassification.from_pretrained("bert-base-multilingual-cased")
model.classifier = new_classifier

where new_classifier is any pytorch model that you want.

ok... Thanks a lot. I will try it.

@dhpollack Maybe its a little unrelated to this issue, but still I'll state the situation. I am using the BERT model to classify sentences on two different datasets. It is working fine on the first dataset but not on the second. Is it possible that it is because BERT has saved its weights according to the first dataset and is loading that for the second one also and thus not performing well. For example. the model configuration looks like this for BOTH the datasets. I suspect whether it should have the same vocabulary size.

INFO:pytorch_pretrained_bert.modeling:Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 28996
}

It shows the same message on both the datasets

INFO:pytorch_pretrained_bert.tokenization:loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at /home/pytorch/.pytorch_pretrained_bert/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
INFO:pytorch_pretrained_bert.modeling:loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz from cache at cache/a803ce83ca27fecf74c355673c434e51c265fb8a3e0e57ac62a80e38ba98d384.681017f415dfb33ec8d0e04fe51a619f3f01532ecea04edbfd48c5d160550d9c
INFO:pytorch_pretrained_bert.modeling:extracting archive file cache/a803ce83ca27fecf74c355673c434e51c265fb8a3e0e57ac62a80e38ba98d384.681017f415dfb33ec8d0e04fe51a619f3f01532ecea04edbfd48c5d160550d9c to temp dir /tmp/tmpgummmons

How can effectively use BERT for two different datasets?

@shivin9 this is definitely not related to the classifier layer. Also, it's a little unclear what you what to do. Are you training on one dataset and then doing inference on another? If that's the case, then you do something like

# training
model = BertForSequenceClassification.from_pretrained("bert-base-cased")
...
model.save_pretrained("/tmp/trained_model_dir")

# inference
model = BertForSequenceClassification.from_pretrained("/tmp/trained_model_dir")

But as I said, it's unclear. If you are training on both datasets and getting good results on one but not the other than it probably has to do with your preprocessing. Good luck solving your problem.

Hi, I have a related question. I am experimenting with BERT for classification task. When I use `BertForSequenceClassification.from_pretrained ```, I can get 100% accuracy for a small data set. But if I have a customized classification head as shown below which is almost similar to ` `BertForSequenceClassification I get bad accuracy.

here is my customized classification head:

class Bertclfhead(nn.Module):
    def __init__(self, config, adapt_args, bertmodel):
        super().__init__()
        self.num_labels = adapt_args.num_classes
        self.config = config
        self.bert = bertmodel
        self.dropout = nn.Dropout(config['hidden_dropout_prob'])
        self.classifier = nn.Linear(config['hidden_size'], adapt_args.num_classes)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None, position_ids=None, head_mask=None):
        outputs = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
                            attention_mask=attention_mask, head_mask=head_mask)

        pooled_output = outputs[1] # see note below

        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here

        if labels is not None:
            if self.num_labels == 1:
                #  We are doing regression
                loss_fct = MSELoss()
                loss = loss_fct(logits.view(-1), labels.view(-1))
            else:
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            outputs = (loss,) + outputs

        return outputs  # (loss), logits, (hidden_states), (attentions)

and I initialize my model like this:

model = Bertclfhead(bertconfig, adapt_args, BertModel.from_pretrained('bert-base-uncased'))

am I missing something?

@dhpollack I am first training on x and then inferring on x. Then I'm training on y and inferring on y.

I am also trying to put a BiLSTM on top of BERT but it seems that BERT doesn't output the vectors in the required format i.e. (#batches, seq_len, input_dim). Do you have any idea how it can be solved? Right now BERT is just outputting a (BATCH_SIZE, 768) sized vector. 768 being the size of hidden layer.

@shivin9 you should read the docs. You want to output of the hidden layers but I think an lstm on top of Bert is overkill. What you are getting now is the output of the pooling layer.

Also you should close this issue since it's clear this is not an issue with the library.

Yeah sure. thanks for the help.

@mehdimashayekhi Do you solve it? Ihave the same question! By directly use BertForSequenceClassification and custom a classification similar to BertForSequenceClassification , the results totally different.

@dhpollack I am first training on x and then inferring on x. Then I'm training on y and inferring on y.

I am also trying to put a BiLSTM on top of BERT but it seems that BERT doesn't output the vectors in the required format i.e. (#batches, seq_len, input_dim). Do you have any idea how it can be solved? Right now BERT is just outputting a (BATCH_SIZE, 768) sized vector. 768 being the size of hidden layer.

Were you able to resolve this?

Re dhpollack's August 12 comment. Maybe something got changed between then and now but I found you also have to set the model's number of labels to get that to work.

model.classifier = torch.nn.Linear(768, 8)
model.num_labels = 8
Was this page helpful?
0 / 5 - 0 ratings

Related issues

hsajjad picture hsajjad  路  3Comments

zhezhaoa picture zhezhaoa  路  3Comments

quocnle picture quocnle  路  3Comments

alphanlp picture alphanlp  路  3Comments

0x01h picture 0x01h  路  3Comments