Adding [PAD] symbols to an input sentence changes the output of the model. I put together a small example here:
https://gist.github.com/juditacs/8be068d5f9063ad68e3098a473b497bd
I also noticed that the seed state affects the output as well. Resetting it in every run ensures that the output is always the same. Is this because of layernorm?
Hi Judit:
attention_mask with the input if the input is smaller than the tensor you are sending in (see the description on BertModel in the README).model.eval()) to disable the dropout layers.@thomwolf
Despite the attention_mask the values are a slightly different.
It is normal that [PAD] vectors have different values?
from pytorch_transformers import BertModel
from rest.run_glue import *
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=False)
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()
torch.manual_seed(0)
sent = "this is a complicated sentence [SEP]"
tokens = ['[CLS]'] + tokenizer.tokenize(sent)
ids = tokenizer.convert_tokens_to_ids(tokens)
t = torch.LongTensor([ids])
with torch.no_grad():
out = model(t)[0]
torch.manual_seed(0)
sent = "this is a complicated sentence [SEP]"
tokens = ['[CLS]'] + tokenizer.tokenize(sent)
tokens.extend(['[PAD]'] * 3)
ids = torch.tensor(tokenizer.convert_tokens_to_ids(tokens)).unsqueeze(0)
mask = torch.zeros((1, ids.shape[1], ids.shape[1]), dtype=torch.float)
mask[:, :, 0:-3] = 1.0
with torch.no_grad():
out2 = model(ids, attention_mask = mask[:, 0])[0]
print('------------')
for i in range(out.shape[1]):
print(i, out[0][0, i].item())
print('------------')
for i in range(out2.shape[1]):
torch.manual_seed(0)
print(i, out2[0][0, i].item())
here is the output
0 -0.10266201943159103
1 0.11214534193277359
2 -0.1575649380683899
3 -0.3163739740848541
4 -0.4168904423713684
5 -0.4069269001483917
6 0.28849801421165466
------------
0 -0.10266169905662537
1 0.1121453121304512
2 -0.15756472945213318
3 -0.3163738548755646
4 -0.41689014434814453
5 -0.40692687034606934
6 0.288497656583786
7 0.28312715888023376
8 0.08457585424184799
9 -0.3077544569969177
[PAD]'s are different, is that normal?
7 0.28312715888023376
8 0.08457585424184799
9 -0.3077544569969177
I am having same problem and couldn't find a reason or fix yet.
Due to Position Embeddings every token results in different vectors.
You might want to google "How the Embedding Layers in BERT Were Implemented"
Due to Position Embeddings every token results in different vectors.
Could you be more specific what is the source of this numerical instability? Perhaps refer to exact code? I am still not exactly sure why output changes slightly when using attention mask, when I use differently padded inputs. There should be no self-attention over padded inputs. Self-attention scores are set to large negative number before softmax:
attention_scores = attention_scores + attention_mask
Could it be that sometimes -10_000 might not be enough to get 0 from softmax? I have recorded differences at most in the order of 2e-6.
Or is it because of arithmetic errors? According to https://en.wikipedia.org/wiki/Machine_epsilon, upped bound for the relative error in 32bit format is somewhere at 1.19e-07, which is still an order away. Could that be because of the error propagation through many FP32 operations?
Most helpful comment
Due to Position Embeddings every token results in different vectors.
You might want to google "How the Embedding Layers in BERT Were Implemented"