Transformers: BertTokenizerFast does not support `pad_to_max_length` argument

Created on 25 Jun 2020 · 4Comments · Source: huggingface/transformers

🐛 Bug

The fast tokenizer has different behavior from the normal tokenizer.

from transformers import BertTokenizer, BertTokenizerFast

BertTokenizer.from_pretrained("bert-base-uncased").encode("hello world", max_length=128, pad_to_max_length="right")
# succeeds
BertTokenizerFast.from_pretrained("bert-base-uncased").encode("hello world", max_length=128, pad_to_max_length="right")
*** TypeError: enable_padding() got an unexpected keyword argument 'max_length'

Environment info

transformers version: 2.11.0
tokenizers version: 0.8.0rc3
Platform: Ubuntu 18.04
Python version: 3.7

Tokenization

Source

jarednielsen

Most helpful comment

Yes, we even have a nice tutorial on the new tokenizer API now thanks to the amazing @sgugger:
https://huggingface.co/transformers/master/preprocessing.html

thomwolf on 25 Jun 2020

❤2

All 4 comments

Hi @jarednielsen, if you installed from source then padding is handled in a different way. You'll need to use the newly added padding argument. According to the docs

padding (:obj:Union[bool, str], optional, defaults to :obj:False):
Activate and control padding. Accepts the following values:

        * `True` or `'longest'`: pad to the longest sequence in the batch (or no padding if only a single sequence if provided),
        * `'max_length'`: pad to a max length specified in `max_length` or to the max acceptable input length for the model if no length is provided (`max_length=None`)
        * `False` or `'do_not_pad'` (default): No padding (i.e. can output batch with sequences of uneven lengths)

patil-suraj on 25 Jun 2020

Yes, this works on master (both the old and new tokenizer API) and should work in the new release that will be out very soon.

thomwolf on 25 Jun 2020

👍2

Thank you for the quick response! Reading https://github.com/huggingface/transformers/pull/4510 makes it much clearer.

jarednielsen on 25 Jun 2020

Yes, we even have a nice tutorial on the new tokenizer API now thanks to the amazing @sgugger:
https://huggingface.co/transformers/master/preprocessing.html

thomwolf on 25 Jun 2020

❤2

Was this page helpful?

0 / 5 - 0 ratings