Transformers: encode_plus not returning attention_mask and not padding

Created on 11 Dec 2019  ·  16Comments  ·  Source: huggingface/transformers

🐛 Bug

Tested on RoBERTa and BERT of the master branch, the encode_plus method of the tokenizer does not return an attention mask. The documentation states that by default an attention_mask is returned, but I only get back the input_ids and the token_type_ids. Even when explicitly specifying return_attention_mask=True, I don't get that back.

If these specific tokenizers (RoBERTa/BERT) don't support this functionality (which would seem odd), it might be useful to also put that in the documentation.

As a small note, there's also a typo in the documentation:

return_attention_mask – (optional) Set to False to avoir returning attention mask (default True)

Finally, it seems that pad_to_max_length isn't padding my input (see the example below). I also tried True instead of an integer, hoping that it would automatically pad up to max seq length in the batch, but to no avail.

from transformers import BertTokenizer

if __name__ == '__main__':
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    orig_text = ['I like bananas.', 'Yesterday the mailman came by!', 'Do you enjoy cookies?']
    edit_text = ['Do you?', 'He delivered a mystery package.', 'My grandma just baked some!']

    # orig_sents and edit_text are lists of sentences
    for orig_sents, edit_sents in zip(orig_text, edit_text):
        orig_tokens = tokenizer.tokenize(orig_sents)
        edit_tokens = tokenizer.tokenize(edit_sents)

        seqs = tokenizer.encode_plus(orig_tokens,
                                     edit_tokens,
                                     return_attention_mask=True,
                                     return_tensors='pt',
                                     pad_to_max_length=120)
        print(seqs)

Output:

{'input_ids': tensor([[  101,  1045,  2066, 26191,  1012,   102,  2079,  2017,  1029,   102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])}
{'input_ids': tensor([[ 101, 7483, 1996, 5653, 2386, 2234, 2011,  999,  102, 2002, 5359, 1037, 6547, 7427, 1012,  102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])}
{'input_ids': tensor([[  101,  2079,  2017,  5959, 16324,  1029,   102,  2026, 13055,  2074, 17776,  2070,   999,   102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])}

Most helpful comment

Hey!
For me setting pad_to_max_length results in an error thrown. Just tried it out with the master branch but this resulted in the same error
The code I'm executing:

titles = [['allround developer', 'Visual Studio Code'],
 ['allround developer', 'IntelliJ IDEA / PyCharm'],
 ['allround developer', 'Version Control']]
enc_titles = [[tokenizer.encode_plus(title[0], max_length=13, pad_to_max_length=True), tokenizer.encode_plus(title[1], max_length=13, pad_to_max_length=True)] for title in titles]

The error that I am getting:
```TypeError Traceback (most recent call last)
in
4 # titles = [' '.join(title) for title in titles]
5 print(titles)
----> 6 enc_titles = [[tokenizer.encode_plus(title[0], max_length=4, pad_to_max_length=True), tokenizer.encode_plus(title[1], max_length=4)] for title in titles]

in (.0)
4 # titles = [' '.join(title) for title in titles]
5 print(titles)
----> 6 enc_titles = [[tokenizer.encode_plus(title[0], max_length=4, pad_to_max_length=True), tokenizer.encode_plus(title[1], max_length=4)] for title in titles]

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in encode_plus(self, text, text_pair, add_special_tokens, max_length, stride, truncation_strategy, return_tensors, return_token_type_ids, return_overflowing_tokens, return_special_tokens_mask, **kwargs)
816 If there are overflowing tokens, those will be added to the returned dictionary
817 stride: if set to a number along with max_length, the overflowing tokens returned will contain some tokens
--> 818 from the main sequence returned. The value of this argument defines the number of additional tokens.
819 truncation_strategy: string selected in the following options:
820 - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in get_input_ids(text)
808 the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids
809 method)
--> 810 text_pair: Optional second sequence to be encoded. This can be a string, a list of strings (tokenized
811 string using the tokenize method) or a list of integers (tokenized string ids using the
812 convert_tokens_to_ids method)

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in tokenize(self, text, **kwargs)
657 sub_text = sub_text.strip()
658 if i == 0 and not sub_text:
--> 659 result += [tok]
660 elif i == len(split_text) - 1:
661 if sub_text:

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in split_on_tokens(tok_list, text)
654 result = []
655 split_text = text.split(tok)
--> 656 for i, sub_text in enumerate(split_text):
657 sub_text = sub_text.strip()
658 if i == 0 and not sub_text:

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in (.0)
654 result = []
655 split_text = text.split(tok)
--> 656 for i, sub_text in enumerate(split_text):
657 sub_text = sub_text.strip()
658 if i == 0 and not sub_text:

TypeError: _tokenize() got an unexpected keyword argument 'pad_to_max_length'```

All 16 comments

Hi, thanks for raising this issue!

When running this code on the master branch, I do get the attention mask as output, but only when removing the return_tensors argument. When running with this argument, it crashes because a list is being concatenated to a tensor. I'm fixing this in #2148.

It's weird that you didn't get an error when running this line. On which commit are you based? encode and encode_plus take kwargs arguments so it wouldn't raise an error if one of your arguments (pad_to_max_length) was not supposed to be there (e.g. if running on an old version of transformers).

pad_to_max_length is a boolean flag: if set to True with no max_length specified, it will pad the sequence up to the maximum sequence length the model can handle. If a max_length is specified, it will pad the sequence up to that number.

Hey!
For me setting pad_to_max_length results in an error thrown. Just tried it out with the master branch but this resulted in the same error
The code I'm executing:

titles = [['allround developer', 'Visual Studio Code'],
 ['allround developer', 'IntelliJ IDEA / PyCharm'],
 ['allround developer', 'Version Control']]
enc_titles = [[tokenizer.encode_plus(title[0], max_length=13, pad_to_max_length=True), tokenizer.encode_plus(title[1], max_length=13, pad_to_max_length=True)] for title in titles]

The error that I am getting:
```TypeError Traceback (most recent call last)
in
4 # titles = [' '.join(title) for title in titles]
5 print(titles)
----> 6 enc_titles = [[tokenizer.encode_plus(title[0], max_length=4, pad_to_max_length=True), tokenizer.encode_plus(title[1], max_length=4)] for title in titles]

in (.0)
4 # titles = [' '.join(title) for title in titles]
5 print(titles)
----> 6 enc_titles = [[tokenizer.encode_plus(title[0], max_length=4, pad_to_max_length=True), tokenizer.encode_plus(title[1], max_length=4)] for title in titles]

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in encode_plus(self, text, text_pair, add_special_tokens, max_length, stride, truncation_strategy, return_tensors, return_token_type_ids, return_overflowing_tokens, return_special_tokens_mask, **kwargs)
816 If there are overflowing tokens, those will be added to the returned dictionary
817 stride: if set to a number along with max_length, the overflowing tokens returned will contain some tokens
--> 818 from the main sequence returned. The value of this argument defines the number of additional tokens.
819 truncation_strategy: string selected in the following options:
820 - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in get_input_ids(text)
808 the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids
809 method)
--> 810 text_pair: Optional second sequence to be encoded. This can be a string, a list of strings (tokenized
811 string using the tokenize method) or a list of integers (tokenized string ids using the
812 convert_tokens_to_ids method)

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in tokenize(self, text, **kwargs)
657 sub_text = sub_text.strip()
658 if i == 0 and not sub_text:
--> 659 result += [tok]
660 elif i == len(split_text) - 1:
661 if sub_text:

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in split_on_tokens(tok_list, text)
654 result = []
655 split_text = text.split(tok)
--> 656 for i, sub_text in enumerate(split_text):
657 sub_text = sub_text.strip()
658 if i == 0 and not sub_text:

/usr/local/lib/python3.7/site-packages/transformers/tokenization_utils.py in (.0)
654 result = []
655 split_text = text.split(tok)
--> 656 for i, sub_text in enumerate(split_text):
657 sub_text = sub_text.strip()
658 if i == 0 and not sub_text:

TypeError: _tokenize() got an unexpected keyword argument 'pad_to_max_length'```

Hm, you're right. I think it was (again) an issue with the notebook that I was testing this time, where some values from previous cells were used or something like that.

Thanks for the fix!

Now that we're at the topic, though, it might be nice to have a convenience method for batch processing? Something along these lines where pad_to_batch_length pads up to the max batch length (rather than max_seq_length of the model) to save computation/memory.

def enocde_batch_plus(batch, batch_pair=None, pad_to_batch_length=False, return_tensors=None, **kwargs):
    def merge_dicts(list_of_ds):
        # there's probably a better way of doing this
        d = defaultdict(list)
        for _d in list_of_ds:
            for _k, _v in _d.items():
                d[_k].append(_v)

        return dict(d)

    encoded_inputs = []
    batch_pair = [None] * len(batch) if batch_pair is None else batch_pair
    for firs_sent, second_sent in zip(batch, batch_pair):
        encoded_inputs.append(tokenizer.encode_plus(firs_sent,
                                          second_sent,
                                          **kwargs))

    encoded_inputs = merge_dicts(encoded_inputs)

    if pad_to_batch_length:
        max_batch_len = max([len(l) for l in encoded_inputs['input_ids']])
        # pad up to max_batch_len, similar to how it's done ine in prepare_for_model()

    if return_tensors:
        # convert to tensors, similar to how it's done in prepare_model()
        pass

    return encoded_inputs

@Jarvanerp I cannot reproduce your issue, though. Your code works for me.

# output
[[{'input_ids': [101, 2035, 22494, 4859, 9722, 102, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]}, {'input_ids': [101, 5107, 2996, 3642, 102, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]}], [{'input_ids': [101, 2035, 22494, 4859, 9722, 102, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]}, {'input_ids': [101, 13420, 3669, 3501, 2801, 1013, 1052, 17994, 27292, 102, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]}], [{'input_ids': [101, 2035, 22494, 4859, 9722, 102, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]}, {'input_ids': [101, 2544, 2491, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]}]]

@BramVanroy Thanks for your comment! It made me try it out in just a plain Python file instead of a Jupyter notebook and it worked... 😄

@BramVanroy Indeed, batch processing would be a cool feature, especially when padding's involved. We're thinking about it cc @mfuntowicz @thomwolf

@LysandreJik That's some good news! Looking forward to that; it will help getting rid of boiler plate stuff in our code.

@LysandreJik Just to keep you updated, this is what I am using now. (Padding and converting to tensors are modified versions of those in prepare_model.) I think it covers most if not all functionality of encode_plus. If you want, I can look at brushing it up, adding tests similar to those for encode_plus, add an encode_batch method and so on, and do a PR.

def encode_batch_plus(batch,
                      batch_pair=None,
                      pad_to_batch_length=False,
                      return_tensors=None,
                      return_token_type_ids=True,
                      return_attention_mask=True,
                      return_special_tokens_mask=False,
                      **kwargs):

    if pad_to_batch_length and 'pad_to_max_length' in kwargs and kwargs['pad_to_max_length']:
        raise ValueError("'pad_to_batch_length' and 'pad_to_max_length' cannot be used simultaneously.")

    def merge_dicts(list_of_ds):
        d = defaultdict(list)
        for _d in list_of_ds:
            for _k, _v in _d.items():
                d[_k].append(_v)

        return dict(d)

    # gather all encoded inputs in a list of dicts
    encoded = []
    batch_pair = [None] * len(batch) if batch_pair is None else batch_pair
    for firs_sent, second_sent in zip(batch, batch_pair):
        # return_tensors=None: don't convert to tensors yet. Do that manually as the last step
        encoded.append(TOKENIZER.encode_plus(firs_sent,
                                             second_sent,
                                             return_tensors=None,
                                             return_token_type_ids=return_token_type_ids,
                                             return_attention_mask=return_attention_mask,
                                             return_special_tokens_mask=return_special_tokens_mask,
                                             **kwargs))

    # convert list of dicts in a single merged dict
    encoded = merge_dicts(encoded)

    if pad_to_batch_length:
        max_batch_len = max([len(l) for l in encoded['input_ids']])

        if TOKENIZER.padding_side == 'right':
            if return_attention_mask:
                encoded['attention_mask'] = [mask + [0] * (max_batch_len - len(mask)) for mask in encoded['attention_mask']]
            if return_token_type_ids:
                encoded["token_type_ids"] = [ttis + [TOKENIZER.pad_token_type_id] * (max_batch_len - len(ttis)) for ttis in encoded['token_type_ids']]
            if return_special_tokens_mask:
                encoded['special_tokens_mask'] = [stm + [1] * (max_batch_len - len(stm)) for stm in encoded['special_tokens_mask']]
            encoded['input_ids'] = [ii + [TOKENIZER.pad_token_id] * (max_batch_len - len(ii)) for ii in encoded['input_ids']]
        elif TOKENIZER.padding_side == 'left':
            if return_attention_mask:
                encoded['attention_mask'] = [[0] * (max_batch_len - len(mask)) + mask for mask in encoded['attention_mask']]
            if return_token_type_ids:
                encoded['token_type_ids'] = [[TOKENIZER.pad_token_type_id] * (max_batch_len - len(ttis)) for ttis in encoded['token_type_ids']]
            if return_special_tokens_mask:
                encoded['special_tokens_mask'] = [[1] * (max_batch_len - len(stm)) + stm for stm in encoded['special_tokens_mask']]
            encoded['input_ids'] = [[TOKENIZER.pad_token_id] * (max_batch_len - len(ii)) + ii for ii in encoded['input_ids']]
        else:
            raise ValueError(f"Invalid padding strategy: {TOKENIZER.padding_side}")

    if return_tensors is not None:
        if return_tensors in {'pt', 'tf'}:
            encoded['input_ids'] = tf.constant(encoded['input_ids']) if return_tensors == 'tf' \
                else torch.tensor(encoded['input_ids'])
            if 'attention_mask' in encoded:
                encoded['attention_mask'] = tf.constant(encoded['attention_mask']) if return_tensors == 'tf' \
                    else torch.tensor(encoded['attention_mask'])
            if 'token_type_ids' in encoded:
                encoded['token_type_ids'] = tf.constant(encoded['token_type_ids']) if return_tensors == 'tf' \
                    else torch.tensor(encoded['token_type_ids'])
            if 'special_tokens_mask' in encoded:
                encoded['special_tokens_mask'] = tf.constant(encoded['special_tokens_mask']) if return_tensors == 'tf' \
                    else torch.tensor(encoded['special_tokens_mask'])
            # should num_truncated_tokens, overflowing_tokens also be converted to tensors?
            # if yes then this could be generalised in a for loop/dict comprehension converting all k,v to k,tensor(v)
        else:
            raise ValueError(f"Cannot return tensors with value '{return_tensors}'")

    return encoded

Hi @BramVanroy, thank you for sharing! I believe @mfuntowicz is working on a similar implementation on the cli branch

Aha, great. I couldn't wait because I needed it for a shared task, but nice to see it's taking form. Almost there!

@BramVanroy @LysandreJik I don't think the padding issue is still resolved yet.

@BramVanroy @LysandreJik I don't think the padding issue is still resolved yet.

Can you give more information? A minimal example that we can copy-and-paste as well as your expected output would be nice.

Hello, I confirm that the padding issue is not resolved yet.

It works with return_overflowing_tokens=False but not return_overflowing_tokens=True for some reason, see sample code below:

>>> tokenizer=BertTokenizer.from_pretrained('bert-base-cased')
>>> fake_batch = ["foo "*100, "foo "*42] 

>>> text_encoded_plus=tokenizer.batch_encode_plus(fake_batch,
                                              add_special_tokens=False,
                                              max_length=10,
                                              pad_to_max_length=True,
                                              return_tensors='pt',
                                              return_attention_mask=True,
                                              return_overflowing_tokens=False)
>>> print(text_encoded_plus['input_ids'].shape, text_encoded_plus['attention_mask'].shape)
torch.Size([2, 10]) torch.Size([2, 10])
>>> text_encoded_plus=tokenizer.batch_encode_plus(fake_batch,
                                                  add_special_tokens=False,
                                                  max_length=10,
                                                  pad_to_max_length=True,
                                                  return_tensors='pt',
                                                  return_attention_mask=True,
                                                  return_overflowing_tokens=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/anaconda3/envs/pyannote/lib/python3.7/site-packages/transformers/tokenization_utils.py in convert_to_tensors_(self, batch_outputs, return_tensors)
   1801                 try:
-> 1802                     batch_outputs[key] = torch.tensor(value)
   1803                 except ValueError:

ValueError: expected sequence of length 190 at dim 1 (got 74)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-249-da5ce1e175a8> in <module>
      7                                               return_tensors='pt',
      8                                               return_attention_mask=mask,
----> 9                                               return_overflowing_tokens=True)
     10 print(text_encoded_plus['input_ids'].shape)

~/anaconda3/envs/pyannote/lib/python3.7/site-packages/transformers/tokenization_utils.py in batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, max_length, stride, truncation_strategy, pad_to_max_length, is_pretokenized, return_tensors, return_token_type_ids, return_attention_masks, return_overflowing_tokens, return_special_tokens_masks, return_offsets_mapping, return_lengths, **kwargs)
   1784         if return_tensors is not None:
   1785 
-> 1786             self.convert_to_tensors_(batch_outputs, return_tensors)
   1787         return BatchEncoding(batch_outputs)
   1788 

~/anaconda3/envs/pyannote/lib/python3.7/site-packages/transformers/tokenization_utils.py in convert_to_tensors_(self, batch_outputs, return_tensors)
   1802                     batch_outputs[key] = torch.tensor(value)
   1803                 except ValueError:
-> 1804                     raise ValueError(self.UNEVEN_SEQUENCES_FOR_BATCH_MSG)
   1805                 except RuntimeError:
   1806                     if None in [item for sequence in value for item in sequence]:

ValueError: The sequences building the batch are not of the same size, no tensor can be built. Set `pad_to_max_length=True` to pad the smaller sequencesup to the larger sequence's length.

Indeed, I can reproduce. Looking into it now.

The issue with this is that slow tokenizers cannot convert the overflowing_tokens to tensors as these have mismatching dimensions. This was never handled, unfortunately, so I added a better error message in #5633.

The good news is that fast tokenizers handle this feature! Simply replacing the BertTokenizer by BertTokenizerFast should do the job.

Thanks for letting us know of this issue.

Oh okay, thank you !
I thought that the regular, kept tokens were not padded :)

Was this page helpful?
0 / 5 - 0 ratings