When running evaluation, why am i getting slightly different output when running a batch size of 1 compared to batch size greater than 1?
It is possible to get slightly different results. Could you share more details on which evaluation script are you running and for which model/configuration etc?
I'm getting having the same issue. But with XLM-R:
I decided to write a simple script to demonstrate the difference between encoding individually and encoding with a batch:
import torch
from torchnlp.encoders.text import stack_and_pad_tensors
from torchnlp.utils import lengths_to_mask
from transformers import (BertModel, BertTokenizer, XLMRobertaModel,
XLMRobertaTokenizer)
torch.set_printoptions(precision=6)
def batch_encoder(samples, tokenizer):
batch = []
for sequence in samples:
batch.append(torch.tensor(tokenizer.encode(sequence)))
return stack_and_pad_tensors(batch, tokenizer.pad_token_id)
xlm = XLMRobertaModel.from_pretrained(
'xlm-roberta-base', output_hidden_states=True
)
bert = BertModel.from_pretrained(
'bert-base-multilingual-cased', output_hidden_states=True
)
xlm.eval()
bert.eval()
with torch.no_grad():
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
xlm_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')
samples = ["hello world!", "This is a batch and the first sentence will be padded"]
bert_tokens, bert_lengths = batch_encoder(samples, bert_tokenizer)
bert_attention_mask = lengths_to_mask(bert_lengths)
xlm_tokens, xlm_lengths = batch_encoder(samples, bert_tokenizer)
xlm_attention_mask = lengths_to_mask(xlm_lengths)
# Forward
bert_out = bert(input_ids=bert_tokens, attention_mask=bert_attention_mask)
xlm_out = xlm(input_ids=xlm_tokens, attention_mask=xlm_attention_mask)
bert_last_hidden_states, bert_pooler_output, bert_all_layers = bert_out
xlm_last_hidden_states, xlm_pooler_output, xlm_all_layers = xlm_out
# Testing by comparing pooler_out
bert_first_sample_tokens = torch.tensor(bert_tokenizer.encode(samples[0])).unsqueeze(0)
xlm_first_sample_tokens = torch.tensor(xlm_tokenizer.encode(samples[0])).unsqueeze(0)
bert_out = bert(input_ids=bert_first_sample_tokens)
xlm_out = xlm(input_ids=xlm_first_sample_tokens)
_, bert_pooler_output_1 , _ = bert_out
_, xlm_pooler_output_1 , _ = xlm_out
print (bert_pooler_output_1[0][:5])
print (bert_pooler_output[0][:5])
print ()
#assert torch.equal(bert_pooler_output_1[0], bert_pooler_output[0])
print (xlm_pooler_output_1[0][:5])
print (xlm_pooler_output[0][:5])
#assert torch.equal(xlm_pooler_output_1[0], xlm_pooler_output[0])```
Script Output:
tensor([ 0.264619, 0.191050, 0.120784, -0.024288, -0.186887])
tensor([ 0.264619, 0.191049, 0.120784, -0.024288, -0.186887])
tensor([-0.114997, -0.025624, -0.171540, 0.725383, 0.318024])
tensor([-0.042580, 0.237069, 0.136827, 0.484221, 0.019779])
For BERT the results don't change that much... But for XLM-R the results are shockingly different!
Am I missing something?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
unstale
I think I'm getting a similar issue. I'm using DistilBERT in this case, but depending on the batch size, I see different outputs. The differences are slight, but confusing nonetheless. It seems like the difference happens once the batch size goes beyond 3. All batch sizes beyond 3 are identical, but <=3 and >3 are diffierent. My example:
```import torch
from transformers import DistilBertModel, DistilBertTokenizer
MODEL_NAME = 'distilbert-base-uncased'
distil_model = DistilBertModel.from_pretrained(MODEL_NAME)
distil_tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
distil_model.eval()
torch.set_printoptions(precision=6)
samples = ["hello world!",
"goodbye world!",
"hello hello!",
"And so on and so on.",
"And so on and so forth."]
cond_output = {}
for cond in [2, 3, 5]:
tokens = distil_tokenizer.batch_encode_plus(
samples[:cond],
pad_to_max_length=True,
return_tensors="pt")
tokens.to(device)
outputs = distil_model(**tokens)
# just taking the first token of the first sample
cond_output[cond] = outputs[0][:,0][0][:10].cpu().detach().numpy()
print(cond_output)
{2: array([-0.18292062, -0.12333887, 0.1573697 , -0.1744302 , -0.25663155,
-0.20508605, 0.31887087, 0.45650607, -0.21000467, -0.14479966],
dtype=float32), 3: array([-0.18292062, -0.12333887, 0.1573697 , -0.1744302 , -0.25663155,
-0.20508605, 0.31887087, 0.45650607, -0.21000467, -0.14479966],
dtype=float32), 5: array([-0.1829206 , -0.12333884, 0.15736982, -0.1744302 , -0.25663146,
-0.20508616, 0.318871 , 0.45650616, -0.21000458, -0.14479981],
dtype=float32)}
```
Anyone have thoughts here? This causes some confusion when I run an individual sample through the model, as it's not the same as if I run it with 3 other samples.
I'm getting having the same issue. But with XLM-R:
I decided to write a simple script to demonstrate the difference between encoding individually and encoding with a batch:
import torch from torchnlp.encoders.text import stack_and_pad_tensors from torchnlp.utils import lengths_to_mask from transformers import (BertModel, BertTokenizer, XLMRobertaModel, XLMRobertaTokenizer) torch.set_printoptions(precision=6) def batch_encoder(samples, tokenizer): batch = [] for sequence in samples: batch.append(torch.tensor(tokenizer.encode(sequence))) return stack_and_pad_tensors(batch, tokenizer.pad_token_id) xlm = XLMRobertaModel.from_pretrained( 'xlm-roberta-base', output_hidden_states=True ) bert = BertModel.from_pretrained( 'bert-base-multilingual-cased', output_hidden_states=True ) xlm.eval() bert.eval() with torch.no_grad(): bert_tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased') xlm_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base') samples = ["hello world!", "This is a batch and the first sentence will be padded"] bert_tokens, bert_lengths = batch_encoder(samples, bert_tokenizer) bert_attention_mask = lengths_to_mask(bert_lengths) xlm_tokens, xlm_lengths = batch_encoder(samples, bert_tokenizer) xlm_attention_mask = lengths_to_mask(xlm_lengths) # Forward bert_out = bert(input_ids=bert_tokens, attention_mask=bert_attention_mask) xlm_out = xlm(input_ids=xlm_tokens, attention_mask=xlm_attention_mask) bert_last_hidden_states, bert_pooler_output, bert_all_layers = bert_out xlm_last_hidden_states, xlm_pooler_output, xlm_all_layers = xlm_out # Testing by comparing pooler_out bert_first_sample_tokens = torch.tensor(bert_tokenizer.encode(samples[0])).unsqueeze(0) xlm_first_sample_tokens = torch.tensor(xlm_tokenizer.encode(samples[0])).unsqueeze(0) bert_out = bert(input_ids=bert_first_sample_tokens) xlm_out = xlm(input_ids=xlm_first_sample_tokens) _, bert_pooler_output_1 , _ = bert_out _, xlm_pooler_output_1 , _ = xlm_out print (bert_pooler_output_1[0][:5]) print (bert_pooler_output[0][:5]) print () #assert torch.equal(bert_pooler_output_1[0], bert_pooler_output[0]) print (xlm_pooler_output_1[0][:5]) print (xlm_pooler_output[0][:5]) #assert torch.equal(xlm_pooler_output_1[0], xlm_pooler_output[0])```Script Output:
tensor([ 0.264619, 0.191050, 0.120784, -0.024288, -0.186887]) tensor([ 0.264619, 0.191049, 0.120784, -0.024288, -0.186887]) tensor([-0.114997, -0.025624, -0.171540, 0.725383, 0.318024]) tensor([-0.042580, 0.237069, 0.136827, 0.484221, 0.019779])For BERT the results don't change that much... But for XLM-R the results are shockingly different!
Am I missing something?
Also experienced same issue using BertForPreTraining. This doesn't make sense to me --- there's no component in Bert which depends on the batch size. I mean things like BatchNorm in training mode output different results with changed batch sizes. But no such component is in Bert AFAIK. Anything I missed?
Another thing I noticed is that if I use FP16, some instances yield quite different embeddings, but some instances have totally identical embeddings (across different batch sizes). If I use FP32, all instances have only slightly different embeddings (but none of them are identical).