The fast tokenizer has different behavior from the normal tokenizer.
from transformers import BertTokenizer, BertTokenizerFast
BertTokenizer.from_pretrained("bert-base-uncased").encode("hello world", max_length=128, pad_to_max_length="right")
# succeeds
BertTokenizerFast.from_pretrained("bert-base-uncased").encode("hello world", max_length=128, pad_to_max_length="right")
*** TypeError: enable_padding() got an unexpected keyword argument 'max_length'
transformers version: 2.11.0tokenizers version: 0.8.0rc3Hi @jarednielsen, if you installed from source then padding is handled in a different way. You'll need to use the newly added padding argument. According to the docs
padding (:obj:Union[bool, str], optional, defaults to :obj:False):
Activate and control padding. Accepts the following values:
* `True` or `'longest'`: pad to the longest sequence in the batch (or no padding if only a single sequence if provided),
* `'max_length'`: pad to a max length specified in `max_length` or to the max acceptable input length for the model if no length is provided (`max_length=None`)
* `False` or `'do_not_pad'` (default): No padding (i.e. can output batch with sequences of uneven lengths)
Yes, this works on master (both the old and new tokenizer API) and should work in the new release that will be out very soon.
Thank you for the quick response! Reading https://github.com/huggingface/transformers/pull/4510 makes it much clearer.
Yes, we even have a nice tutorial on the new tokenizer API now thanks to the amazing @sgugger:
https://huggingface.co/transformers/master/preprocessing.html
Most helpful comment
Yes, we even have a nice tutorial on the new tokenizer API now thanks to the amazing @sgugger:
https://huggingface.co/transformers/master/preprocessing.html