Transformers: Batch size affecting output.

Created on 4 Jan 2020 · 6Comments · Source: huggingface/transformers

❓ Questions & Help

When running evaluation, why am i getting slightly different output when running a batch size of 1 compared to batch size greater than 1?

wontfix

Source

eriher

All 6 comments

It is possible to get slightly different results. Could you share more details on which evaluation script are you running and for which model/configuration etc?

NaxAlpha on 5 Jan 2020

I'm getting having the same issue. But with XLM-R:

I decided to write a simple script to demonstrate the difference between encoding individually and encoding with a batch:

import torch
from torchnlp.encoders.text import stack_and_pad_tensors
from torchnlp.utils import lengths_to_mask
from transformers import (BertModel, BertTokenizer, XLMRobertaModel,
                          XLMRobertaTokenizer)

torch.set_printoptions(precision=6)

def batch_encoder(samples, tokenizer):
    batch = []
    for sequence in samples:
        batch.append(torch.tensor(tokenizer.encode(sequence)))
    return stack_and_pad_tensors(batch, tokenizer.pad_token_id)

xlm = XLMRobertaModel.from_pretrained(
            'xlm-roberta-base', output_hidden_states=True
        )

bert = BertModel.from_pretrained(
            'bert-base-multilingual-cased', output_hidden_states=True
        )


xlm.eval()
bert.eval()
with torch.no_grad():
    bert_tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
    xlm_tokenizer  = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

    samples = ["hello world!", "This is a batch and the first sentence will be padded"]

    bert_tokens, bert_lengths = batch_encoder(samples, bert_tokenizer)
    bert_attention_mask = lengths_to_mask(bert_lengths)

    xlm_tokens, xlm_lengths = batch_encoder(samples, bert_tokenizer)
    xlm_attention_mask = lengths_to_mask(xlm_lengths)

     # Forward
    bert_out = bert(input_ids=bert_tokens, attention_mask=bert_attention_mask)
    xlm_out = xlm(input_ids=xlm_tokens, attention_mask=xlm_attention_mask)
    bert_last_hidden_states, bert_pooler_output, bert_all_layers = bert_out
    xlm_last_hidden_states, xlm_pooler_output, xlm_all_layers = xlm_out

    # Testing by comparing pooler_out
    bert_first_sample_tokens = torch.tensor(bert_tokenizer.encode(samples[0])).unsqueeze(0)
    xlm_first_sample_tokens = torch.tensor(xlm_tokenizer.encode(samples[0])).unsqueeze(0)
    bert_out = bert(input_ids=bert_first_sample_tokens)
    xlm_out = xlm(input_ids=xlm_first_sample_tokens)
    _, bert_pooler_output_1 , _ = bert_out
    _, xlm_pooler_output_1 , _ = xlm_out

    print (bert_pooler_output_1[0][:5])
    print (bert_pooler_output[0][:5])
    print ()
    #assert torch.equal(bert_pooler_output_1[0], bert_pooler_output[0])

    print (xlm_pooler_output_1[0][:5])
    print (xlm_pooler_output[0][:5])

    #assert torch.equal(xlm_pooler_output_1[0], xlm_pooler_output[0])```

Script Output:

tensor([ 0.264619,  0.191050,  0.120784, -0.024288, -0.186887])
tensor([ 0.264619,  0.191049,  0.120784, -0.024288, -0.186887])

tensor([-0.114997, -0.025624, -0.171540,  0.725383,  0.318024])
tensor([-0.042580,  0.237069,  0.136827,  0.484221,  0.019779])

For BERT the results don't change that much... But for XLM-R the results are shockingly different!

Am I missing something?

ricardorei on 9 Jan 2020

👀6

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 12 Mar 2020

unstale

AdityaSoni19031997 on 14 Apr 2020

I think I'm getting a similar issue. I'm using DistilBERT in this case, but depending on the batch size, I see different outputs. The differences are slight, but confusing nonetheless. It seems like the difference happens once the batch size goes beyond 3. All batch sizes beyond 3 are identical, but <=3 and >3 are diffierent. My example:

```import torch
from transformers import DistilBertModel, DistilBertTokenizer
MODEL_NAME = 'distilbert-base-uncased'
distil_model = DistilBertModel.from_pretrained(MODEL_NAME)
distil_tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)

distil_model.eval()
torch.set_printoptions(precision=6)
samples = ["hello world!",
"goodbye world!",
"hello hello!",
"And so on and so on.",
"And so on and so forth."]
cond_output = {}
for cond in [2, 3, 5]:
tokens = distil_tokenizer.batch_encode_plus(
samples[:cond],
pad_to_max_length=True,
return_tensors="pt")
tokens.to(device)
outputs = distil_model(**tokens)
# just taking the first token of the first sample
cond_output[cond] = outputs[0][:,0][0][:10].cpu().detach().numpy()
print(cond_output)

{2: array([-0.18292062, -0.12333887, 0.1573697 , -0.1744302 , -0.25663155,
-0.20508605, 0.31887087, 0.45650607, -0.21000467, -0.14479966],
dtype=float32), 3: array([-0.18292062, -0.12333887, 0.1573697 , -0.1744302 , -0.25663155,
-0.20508605, 0.31887087, 0.45650607, -0.21000467, -0.14479966],
dtype=float32), 5: array([-0.1829206 , -0.12333884, 0.15736982, -0.1744302 , -0.25663146,
-0.20508616, 0.318871 , 0.45650616, -0.21000458, -0.14479981],
dtype=float32)}
```

Anyone have thoughts here? This causes some confusion when I run an individual sample through the model, as it's not the same as if I run it with 3 other samples.

bpben on 2 Jun 2020

😕1

I'm getting having the same issue. But with XLM-R:

I decided to write a simple script to demonstrate the difference between encoding individually and encoding with a batch:

import torch
from torchnlp.encoders.text import stack_and_pad_tensors
from torchnlp.utils import lengths_to_mask
from transformers import (BertModel, BertTokenizer, XLMRobertaModel,
                          XLMRobertaTokenizer)

torch.set_printoptions(precision=6)

def batch_encoder(samples, tokenizer):
    batch = []
    for sequence in samples:
        batch.append(torch.tensor(tokenizer.encode(sequence)))
    return stack_and_pad_tensors(batch, tokenizer.pad_token_id)

xlm = XLMRobertaModel.from_pretrained(
            'xlm-roberta-base', output_hidden_states=True
        )

bert = BertModel.from_pretrained(
            'bert-base-multilingual-cased', output_hidden_states=True
        )


xlm.eval()
bert.eval()
with torch.no_grad():
    bert_tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
    xlm_tokenizer  = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

    samples = ["hello world!", "This is a batch and the first sentence will be padded"]

    bert_tokens, bert_lengths = batch_encoder(samples, bert_tokenizer)
    bert_attention_mask = lengths_to_mask(bert_lengths)

    xlm_tokens, xlm_lengths = batch_encoder(samples, bert_tokenizer)
    xlm_attention_mask = lengths_to_mask(xlm_lengths)

     # Forward
    bert_out = bert(input_ids=bert_tokens, attention_mask=bert_attention_mask)
    xlm_out = xlm(input_ids=xlm_tokens, attention_mask=xlm_attention_mask)
    bert_last_hidden_states, bert_pooler_output, bert_all_layers = bert_out
    xlm_last_hidden_states, xlm_pooler_output, xlm_all_layers = xlm_out

    # Testing by comparing pooler_out
    bert_first_sample_tokens = torch.tensor(bert_tokenizer.encode(samples[0])).unsqueeze(0)
    xlm_first_sample_tokens = torch.tensor(xlm_tokenizer.encode(samples[0])).unsqueeze(0)
    bert_out = bert(input_ids=bert_first_sample_tokens)
    xlm_out = xlm(input_ids=xlm_first_sample_tokens)
    _, bert_pooler_output_1 , _ = bert_out
    _, xlm_pooler_output_1 , _ = xlm_out

    print (bert_pooler_output_1[0][:5])
    print (bert_pooler_output[0][:5])
    print ()
    #assert torch.equal(bert_pooler_output_1[0], bert_pooler_output[0])

    print (xlm_pooler_output_1[0][:5])
    print (xlm_pooler_output[0][:5])

    #assert torch.equal(xlm_pooler_output_1[0], xlm_pooler_output[0])```

Script Output:

tensor([ 0.264619,  0.191050,  0.120784, -0.024288, -0.186887])
tensor([ 0.264619,  0.191049,  0.120784, -0.024288, -0.186887])

tensor([-0.114997, -0.025624, -0.171540,  0.725383,  0.318024])
tensor([-0.042580,  0.237069,  0.136827,  0.484221,  0.019779])

For BERT the results don't change that much... But for XLM-R the results are shockingly different!

Am I missing something?

Also experienced same issue using BertForPreTraining. This doesn't make sense to me --- there's no component in Bert which depends on the batch size. I mean things like BatchNorm in training mode output different results with changed batch sizes. But no such component is in Bert AFAIK. Anything I missed?
Another thing I noticed is that if I use FP16, some instances yield quite different embeddings, but some instances have totally identical embeddings (across different batch sizes). If I use FP32, all instances have only slightly different embeddings (but none of them are identical).