Transformers: pytorch-transformers returns output of 13 layers?

Created on 25 Sep 2019  路  9Comments  路  Source: huggingface/transformers

馃摎 Migration

Model I am using (Bert, XLNet....): BertModel

Language I am using the model on (English, Chinese....): English

The problem arise when using:

  • [x] my own modified scripts: (give details)

The tasks I am working on is:

  • [x] my own task or dataset: (give details)

Details of the issue:

I am using pytorch-transformers for the rather unconventional task of regression (one output). In my research I use BERT and I'm planning to try out the other transformers as well. When I started, I got good results with pytorch-pretrained-bert. However, running the same code with pytorch-transformers gives me results that are a lot worse.

In the original code, I use the output of the model, and concatenate the last four layers - as was proposed in the BERT paper. The architecture that I used looks like this:

from pytorch_pretrained_bert.modeling import BertModel
import torch
from torch import nn


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.bert_model = BertModel.from_pretrained('bert-base-uncased')
        self.pre_classifier = nn.Linear(3072, 512)
        self.dropout = nn.Dropout(0.2)
        self.classifier = nn.Linear(512, 1)

    def forward(self, bert_ids, bert_mask):
        all_bert_layers, _ = self.bert_model(bert_ids, attention_mask=bert_mask)
        print('hidden_states', len(all_bert_layers))
        # concat last four layers
        out = torch.cat(tuple([all_bert_layers[i] for i in [-1, -2, -3, -4]]), dim=-1)
        print('output', out.size())

        # Pooling by also setting masked items to zero
        bert_mask = bert_mask.unsqueeze(2)
        # Multiply output with mask to only retain non-paddding tokens
        out = torch.mul(out, bert_mask)
        print('output', out.size())

        # First item ['CLS'] is sentence representation
        out = out[:, 0, :]
        print('pooled_output', out.size())

        out = self.pre_classifier(out)
        print('pre_classifier', out.size())

        out = self.dropout(out)
        print('dropout', out.size())

        out = self.classifier(out)
        print('classifier', out.size())

        return out

When porting this to pytorch-transformers, the main thing was that now we get a tuple back from the model and we have to explicitly ask to get all hidden states back. As such, the converted code looks like this:

from pytorch_transformers import BertModel
import torch
from torch import nn


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.bert_model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
        self.pre_classifier = nn.Linear(3072, 512)
        self.dropout = nn.Dropout(0.2)
        self.classifier = nn.Linear(512, 1)

    def forward(self, bert_ids, bert_mask):
        out, _ = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
        hidden_states = out[2]
        print('hidden_states', len(hidden_states))

        out = torch.cat(tuple([hidden_states[i] for i in [-1, -2, -3, -4]]), dim=-1)
        print('output', out.size())

        # Pooling by also setting masked items to zero
        bert_mask = bert_mask.unsqueeze(2)
        # Multiply output with mask to only retain non-paddding tokens
        out = torch.mul(out, bert_mask)
        print('output', out.size())

        # First item ['CLS'] is sentence representation
        out = out[:, 0, :]
        print('pooled_output', out.size())

        out = self.pre_classifier(out)
        print('pre_classifier', out.size())

        out = self.dropout(out)
        print('dropout', out.size())

        out = self.classifier(out)
        print('classifier', out.size())

        return out

As I said before, this leads to very different results. Seeding cannot be the issue, since I set all seeds manually in both cases, like this:

def set_seed():
    torch.manual_seed(3)
    torch.cuda.manual_seed_all(3)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(3)
    random.seed(3)
    os.environ['PYTHONHASHSEED'] = str(3)

I have added the print statements as a sort of debugging and I quickly found that there is a fundamental difference between the two architectures. The hidden_states print statement will yield 12 for pytorch-pretrained-bert and 13 for pytorch-transformers! I am not sure how that relates, but I would assume that this could be the starting point to start looking.

I have tried comparing the created models, but in both cases the encoder consists of 12 layers, so I am not sure why pytorch-transformers returns 13? What's the extra one?

Going through the source code, it seems that the first hidden_state (= last hidden_state from the embeddings) is included. Is that true?

https://github.com/huggingface/pytorch-transformers/blob/7c0f2d0a6a8937063bb310fceb56ac57ce53811b/pytorch_transformers/modeling_bert.py#L340-L352

Even so, since the embeddings would be the first item in all_hidden_states, the last four layers should be the same still. Therefore, I am not sure why there is such a big difference in the results of the above two. If you spot any faults, please advise.

Environment

  • OS: Win 10
  • Python version: 3.7
  • PyTorch version: 1.2
  • PyTorch Transformers version (or branch):
  • Using GPU ? Yes, CUDA 10
  • Distributed of parallel setup ? No

Checklist

  • [x] I have read the migration guide in the readme.

Most helpful comment

Hi Bram,
It's written in the link to the doc that I've sent you above and also in the docstring of the model:
image
I'll see if I can find a way to make it more visible.

There are a few reasons we did that, one is this great paper by Tenney et al (http://arxiv.org/abs/1905.05950) which use the output of the embeddings as well at the hidden states to study Bert's performances. Another is to have easy access to the embeddings as you mention.

All 9 comments

I am looking at this too and I believe (might be wrong) that the embedding layer sits in the last position. So I guess you should do [-2:-5]

I am looking at this too and I believe (might be wrong) that the embedding layer sits in the last position. So I guess you should do [-2:-5]

Hm, I don't think so. The embedding state is passed to the forward function, and that state is used to initialize the all_hidden_states variable. Then you iterate over all layers and append to the tuple sequentially.

https://github.com/huggingface/pytorch-transformers/blob/7c0f2d0a6a8937063bb310fceb56ac57ce53811b/pytorch_transformers/modeling_bert.py#L337-L359

Hi Bram,

Please read the details of BertModel's outputs in the docstring or the doc here: https://huggingface.co/pytorch-transformers/model_doc/bert.html#pytorch_transformers.BertModel

The first element of the output tuple of Bert is always the last hidden-state and the full list of hidden-states is the last element of the output tuple in your case.

These lines:

out, _ = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = out[2]

should be changed in:

model_outputs = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = model_outputs[-1]

Hi Bram,

Please read the details of BertModel's outputs in the docstring or the doc here: https://huggingface.co/pytorch-transformers/model_doc/bert.html#pytorch_transformers.BertModel

The first element of the output tuple of Bert is always the last hidden-state and the full list of hidden-states is the last element of the output tuple in your case.

These lines:

out, _ = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = out[2]

should be changed in:

model_outputs = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = model_outputs[-1]

Hi Thomas, thank you for your time

Apparently a mistake crept into my comment on GitHub. In my code, I do have the correct version, i.e.

out = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = out[2]

The question that I have is, when you then print the length of those hidden states, you get different numbers.

print(len(hidden_states))
# 13 for pytorch_transformers, 12 for pytorch_pretrained_bert

Going through the source code, it seems that the input hidden state (final hidden state of the embeddings) is included when using pytorch_transformers, but not for pytorch_pretrained_bert.

https://github.com/huggingface/pytorch-transformers/blob/7c0f2d0a6a8937063bb310fceb56ac57ce53811b/pytorch_transformers/modeling_bert.py#L337-L352

I couldn't find this documented anywhere, but I am curious to see the reasoning behind this - since the embedding state is _not_ an encoder state, so it might not be what one expects to get back from the model. On the other hand, it does make it easy for users to get the embeddings.

Hi Bram,
It's written in the link to the doc that I've sent you above and also in the docstring of the model:
image
I'll see if I can find a way to make it more visible.

There are a few reasons we did that, one is this great paper by Tenney et al (http://arxiv.org/abs/1905.05950) which use the output of the embeddings as well at the hidden states to study Bert's performances. Another is to have easy access to the embeddings as you mention.

Add last layer

 if self.output_hidden_states: 
     all_hidden_states = all_hidden_states + (hidden_states,)

https://github.com/huggingface/transformers/blob/7c0f2d0a6a8937063bb310fceb56ac57ce53811b/pytorch_transformers/modeling_bert.py#L350-L352

But on line 350-352, it adds the "hidden states" (last layer of embedding) to the "all_hidden_states", so the last item is the embedding output.

Add last layer

 if self.output_hidden_states: 
     all_hidden_states = all_hidden_states + (hidden_states,)

https://github.com/huggingface/transformers/blob/7c0f2d0a6a8937063bb310fceb56ac57ce53811b/pytorch_transformers/modeling_bert.py#L350-L352

But on line 350-352, it adds the "hidden states" (last layer of embedding) to the "all_hidden_states", so the last item is the embedding output.

No, by that time the initial hidden_states variable has already been reassigned in the for loop. So at each step hidden_states is:

enter function: it is the embeddings
on each iteration in the loop: `hidden_states = layer_outputs[0]`

Perhaps the not-so-intuitive part is that the hidden_states are appended to all_hidden_states as the first thing in the loop. That means that in the at the end of the first iteration; all_hidden_states consists only of the embeddings, and at the end of the last iteration, it does not contain the last hidden state yet (because appending happens before getting the layer_outputs). Therefore, the hidden states of the last layer (iteration) have to be added manually still, on the lines that you mentioned.

Add last layer

 if self.output_hidden_states: 
     all_hidden_states = all_hidden_states + (hidden_states,)

https://github.com/huggingface/transformers/blob/7c0f2d0a6a8937063bb310fceb56ac57ce53811b/pytorch_transformers/modeling_bert.py#L350-L352

But on line 350-352, it adds the "hidden states" (last layer of embedding) to the "all_hidden_states", so the last item is the embedding output.

No, by that time the initial hidden_states variable has already been reassigned in the for loop. So at each step hidden_states is:

enter function: it is the embeddings
on each iteration in the loop: `hidden_states = layer_outputs[0]`

Perhaps the not-so-intuitive part is that the hidden_states are appended to all_hidden_states as the first thing in the loop. That means that in the at the end of the first iteration; all_hidden_states consists _only_ of the embeddings, and at the end of the last iteration, it does not contain the last hidden state yet (because appending happens _before_ getting the layer_outputs). Therefore, the hidden states of the last layer (iteration) have to be added manually still, on the lines that you mentioned.

You are right, thanks for the clarification!

@thomwolf Thanks for the clarification. I was looking in all the wrong places, it appears. Particularly, I had expected this in the README's migration part. If you want I can do a small doc pull request for that.

Re-opened. Will close after doc change if requested.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

HanGuo97 picture HanGuo97  路  3Comments

fabiocapsouza picture fabiocapsouza  路  3Comments

yspaik picture yspaik  路  3Comments

chuanmingliu picture chuanmingliu  路  3Comments

hsajjad picture hsajjad  路  3Comments