Model I am using (Bert, XLNet....): BertModel
Language I am using the model on (English, Chinese....): English
The problem arise when using:
The tasks I am working on is:
Details of the issue:
I am using pytorch-transformers for the rather unconventional task of regression (one output). In my research I use BERT and I'm planning to try out the other transformers as well. When I started, I got good results with pytorch-pretrained-bert. However, running the same code with pytorch-transformers gives me results that are a lot worse.
In the original code, I use the output of the model, and concatenate the last four layers - as was proposed in the BERT paper. The architecture that I used looks like this:
from pytorch_pretrained_bert.modeling import BertModel
import torch
from torch import nn
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.bert_model = BertModel.from_pretrained('bert-base-uncased')
self.pre_classifier = nn.Linear(3072, 512)
self.dropout = nn.Dropout(0.2)
self.classifier = nn.Linear(512, 1)
def forward(self, bert_ids, bert_mask):
all_bert_layers, _ = self.bert_model(bert_ids, attention_mask=bert_mask)
print('hidden_states', len(all_bert_layers))
# concat last four layers
out = torch.cat(tuple([all_bert_layers[i] for i in [-1, -2, -3, -4]]), dim=-1)
print('output', out.size())
# Pooling by also setting masked items to zero
bert_mask = bert_mask.unsqueeze(2)
# Multiply output with mask to only retain non-paddding tokens
out = torch.mul(out, bert_mask)
print('output', out.size())
# First item ['CLS'] is sentence representation
out = out[:, 0, :]
print('pooled_output', out.size())
out = self.pre_classifier(out)
print('pre_classifier', out.size())
out = self.dropout(out)
print('dropout', out.size())
out = self.classifier(out)
print('classifier', out.size())
return out
When porting this to pytorch-transformers, the main thing was that now we get a tuple back from the model and we have to explicitly ask to get all hidden states back. As such, the converted code looks like this:
from pytorch_transformers import BertModel
import torch
from torch import nn
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.bert_model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
self.pre_classifier = nn.Linear(3072, 512)
self.dropout = nn.Dropout(0.2)
self.classifier = nn.Linear(512, 1)
def forward(self, bert_ids, bert_mask):
out, _ = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = out[2]
print('hidden_states', len(hidden_states))
out = torch.cat(tuple([hidden_states[i] for i in [-1, -2, -3, -4]]), dim=-1)
print('output', out.size())
# Pooling by also setting masked items to zero
bert_mask = bert_mask.unsqueeze(2)
# Multiply output with mask to only retain non-paddding tokens
out = torch.mul(out, bert_mask)
print('output', out.size())
# First item ['CLS'] is sentence representation
out = out[:, 0, :]
print('pooled_output', out.size())
out = self.pre_classifier(out)
print('pre_classifier', out.size())
out = self.dropout(out)
print('dropout', out.size())
out = self.classifier(out)
print('classifier', out.size())
return out
As I said before, this leads to very different results. Seeding cannot be the issue, since I set all seeds manually in both cases, like this:
def set_seed():
torch.manual_seed(3)
torch.cuda.manual_seed_all(3)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(3)
random.seed(3)
os.environ['PYTHONHASHSEED'] = str(3)
I have added the print statements as a sort of debugging and I quickly found that there is a fundamental difference between the two architectures. The hidden_states print statement will yield 12 for pytorch-pretrained-bert and 13 for pytorch-transformers! I am not sure how that relates, but I would assume that this could be the starting point to start looking.
I have tried comparing the created models, but in both cases the encoder consists of 12 layers, so I am not sure why pytorch-transformers returns 13? What's the extra one?
Going through the source code, it seems that the first hidden_state (= last hidden_state from the embeddings) is included. Is that true?
Even so, since the embeddings would be the first item in all_hidden_states, the last four layers should be the same still. Therefore, I am not sure why there is such a big difference in the results of the above two. If you spot any faults, please advise.
I am looking at this too and I believe (might be wrong) that the embedding layer sits in the last position. So I guess you should do [-2:-5]
I am looking at this too and I believe (might be wrong) that the embedding layer sits in the last position. So I guess you should do [-2:-5]
Hm, I don't think so. The embedding state is passed to the forward function, and that state is used to initialize the all_hidden_states variable. Then you iterate over all layers and append to the tuple sequentially.
Hi Bram,
Please read the details of BertModel's outputs in the docstring or the doc here: https://huggingface.co/pytorch-transformers/model_doc/bert.html#pytorch_transformers.BertModel
The first element of the output tuple of Bert is always the last hidden-state and the full list of hidden-states is the last element of the output tuple in your case.
These lines:
out, _ = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = out[2]
should be changed in:
model_outputs = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = model_outputs[-1]
Hi Bram,
Please read the details of
BertModel's outputs in the docstring or the doc here: https://huggingface.co/pytorch-transformers/model_doc/bert.html#pytorch_transformers.BertModelThe first element of the output tuple of Bert is always the last hidden-state and the full list of hidden-states is the last element of the output tuple in your case.
These lines:
out, _ = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask) hidden_states = out[2]should be changed in:
model_outputs = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask) hidden_states = model_outputs[-1]
Hi Thomas, thank you for your time
Apparently a mistake crept into my comment on GitHub. In my code, I do have the correct version, i.e.
out = self.bert_model(input_ids=bert_ids, attention_mask=bert_mask)
hidden_states = out[2]
The question that I have is, when you then print the length of those hidden states, you get different numbers.
print(len(hidden_states))
# 13 for pytorch_transformers, 12 for pytorch_pretrained_bert
Going through the source code, it seems that the input hidden state (final hidden state of the embeddings) is included when using pytorch_transformers, but not for pytorch_pretrained_bert.
I couldn't find this documented anywhere, but I am curious to see the reasoning behind this - since the embedding state is _not_ an encoder state, so it might not be what one expects to get back from the model. On the other hand, it does make it easy for users to get the embeddings.
Hi Bram,
It's written in the link to the doc that I've sent you above and also in the docstring of the model:

I'll see if I can find a way to make it more visible.
There are a few reasons we did that, one is this great paper by Tenney et al (http://arxiv.org/abs/1905.05950) which use the output of the embeddings as well at the hidden states to study Bert's performances. Another is to have easy access to the embeddings as you mention.
Add last layer
if self.output_hidden_states: all_hidden_states = all_hidden_states + (hidden_states,)
But on line 350-352, it adds the "hidden states" (last layer of embedding) to the "all_hidden_states", so the last item is the embedding output.
Add last layer
if self.output_hidden_states: all_hidden_states = all_hidden_states + (hidden_states,)But on line 350-352, it adds the "hidden states" (last layer of embedding) to the "all_hidden_states", so the last item is the embedding output.
No, by that time the initial hidden_states variable has already been reassigned in the for loop. So at each step hidden_states is:
enter function: it is the embeddings
on each iteration in the loop: `hidden_states = layer_outputs[0]`
Perhaps the not-so-intuitive part is that the hidden_states are appended to all_hidden_states as the first thing in the loop. That means that in the at the end of the first iteration; all_hidden_states consists only of the embeddings, and at the end of the last iteration, it does not contain the last hidden state yet (because appending happens before getting the layer_outputs). Therefore, the hidden states of the last layer (iteration) have to be added manually still, on the lines that you mentioned.
Add last layer
if self.output_hidden_states: all_hidden_states = all_hidden_states + (hidden_states,)But on line 350-352, it adds the "hidden states" (last layer of embedding) to the "all_hidden_states", so the last item is the embedding output.
No, by that time the initial
hidden_statesvariable has already been reassigned in the for loop. So at each step hidden_states is:enter function: it is the embeddings on each iteration in the loop: `hidden_states = layer_outputs[0]`Perhaps the not-so-intuitive part is that the
hidden_statesare appended toall_hidden_statesas the first thing in the loop. That means that in the at the end of the first iteration;all_hidden_statesconsists _only_ of the embeddings, and at the end of the last iteration, it does not contain the last hidden state yet (because appending happens _before_ getting the layer_outputs). Therefore, the hidden states of the last layer (iteration) have to be added manually still, on the lines that you mentioned.
You are right, thanks for the clarification!
@thomwolf Thanks for the clarification. I was looking in all the wrong places, it appears. Particularly, I had expected this in the README's migration part. If you want I can do a small doc pull request for that.
Re-opened. Will close after doc change if requested.
Most helpful comment
Hi Bram,

It's written in the link to the doc that I've sent you above and also in the docstring of the model:
I'll see if I can find a way to make it more visible.
There are a few reasons we did that, one is this great paper by Tenney et al (http://arxiv.org/abs/1905.05950) which use the output of the embeddings as well at the hidden states to study Bert's performances. Another is to have easy access to the embeddings as you mention.