Transformers: PAD symbols change the output

Created on 14 Feb 2019  路  5Comments  路  Source: huggingface/transformers

Adding [PAD] symbols to an input sentence changes the output of the model. I put together a small example here:

https://gist.github.com/juditacs/8be068d5f9063ad68e3098a473b497bd

I also noticed that the seed state affects the output as well. Resetting it in every run ensures that the output is always the same. Is this because of layernorm?

Most helpful comment

Due to Position Embeddings every token results in different vectors.
You might want to google "How the Embedding Layers in BERT Were Implemented"

All 5 comments

Hi Judit:

  • Regarding the padding: you should send an attention_mask with the input if the input is smaller than the tensor you are sending in (see the description on BertModel in the README).
  • Regarding the seed: don't forget to put your model in eval mode (model.eval()) to disable the dropout layers.

@thomwolf

Despite the attention_mask the values are a slightly different.

It is normal that [PAD] vectors have different values?

from pytorch_transformers import BertModel
from rest.run_glue import *

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=False)
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

torch.manual_seed(0)
sent = "this is a complicated sentence [SEP]"
tokens = ['[CLS]'] + tokenizer.tokenize(sent)
ids = tokenizer.convert_tokens_to_ids(tokens)
t = torch.LongTensor([ids])

with torch.no_grad():
    out = model(t)[0]

torch.manual_seed(0)
sent = "this is a complicated sentence [SEP]"
tokens = ['[CLS]'] + tokenizer.tokenize(sent)
tokens.extend(['[PAD]'] * 3)
ids = torch.tensor(tokenizer.convert_tokens_to_ids(tokens)).unsqueeze(0)
mask = torch.zeros((1, ids.shape[1], ids.shape[1]), dtype=torch.float)
mask[:, :, 0:-3] = 1.0

with torch.no_grad():
    out2 = model(ids, attention_mask = mask[:, 0])[0]

print('------------')
for i in range(out.shape[1]):
    print(i, out[0][0, i].item())

print('------------')
for i in range(out2.shape[1]):
    torch.manual_seed(0)
    print(i, out2[0][0, i].item())

here is the output

0 -0.10266201943159103
1 0.11214534193277359
2 -0.1575649380683899
3 -0.3163739740848541
4 -0.4168904423713684
5 -0.4069269001483917
6 0.28849801421165466
------------
0 -0.10266169905662537
1 0.1121453121304512
2 -0.15756472945213318
3 -0.3163738548755646
4 -0.41689014434814453
5 -0.40692687034606934
6 0.288497656583786
7 0.28312715888023376
8 0.08457585424184799
9 -0.3077544569969177

[PAD]'s are different, is that normal?

7 0.28312715888023376
8 0.08457585424184799
9 -0.3077544569969177

I am having same problem and couldn't find a reason or fix yet.

Due to Position Embeddings every token results in different vectors.
You might want to google "How the Embedding Layers in BERT Were Implemented"

Due to Position Embeddings every token results in different vectors.

Could you be more specific what is the source of this numerical instability? Perhaps refer to exact code? I am still not exactly sure why output changes slightly when using attention mask, when I use differently padded inputs. There should be no self-attention over padded inputs. Self-attention scores are set to large negative number before softmax:
attention_scores = attention_scores + attention_mask
Could it be that sometimes -10_000 might not be enough to get 0 from softmax? I have recorded differences at most in the order of 2e-6.

Or is it because of arithmetic errors? According to https://en.wikipedia.org/wiki/Machine_epsilon, upped bound for the relative error in 32bit format is somewhere at 1.19e-07, which is still an order away. Could that be because of the error propagation through many FP32 operations?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

adigoryl picture adigoryl  路  3Comments

rsanjaykamath picture rsanjaykamath  路  3Comments

delip picture delip  路  3Comments

zhezhaoa picture zhezhaoa  路  3Comments

lcswillems picture lcswillems  路  3Comments