Transformers: tokenizer started throwing this warning, ""Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'only_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you may want to check this is the right behavior.""

Created on 30 Jun 2020 · 13Comments · Source: huggingface/transformers

Recently while experimenting, BertTokenizer start to throw this warning

Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'only_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you may want to check this is the right behavior.

I know, this warning asks to provide truncation value.
I'm asking here because this warning started this morning.

Tokenization

Source

saahiluppal

👍18

Most helpful comment

This is because we recently upgraded the library to version v3.0.0, which has an improved tokenizers API. You can either disable warnings or put truncation=True to remove that warning (as indicated in the warning).

LysandreJik on 1 Jul 2020

👍6

All 13 comments

LysandreJik on 1 Jul 2020

👍6

how do you disable the warnings for this? I'm encountering the same issue. But I don't want to set the truncation=True

rainean on 2 Jul 2020

You can disable the warnings with:

import logging
logging.basicConfig(level=logging.ERROR)

LysandreJik on 2 Jul 2020

👎8 👍3 😕1

I've changed the logging level and removed max_length but am still getting this error:

WARNING:transformers.tokenization_utils_base:Truncation was not explicitely activated but max_length is provided a specific value, please use truncation=True to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation.

tutmoses on 18 Jul 2020

On which version are you running? Can you try to install v3.0.2 to see if it fixes this issue?

LysandreJik on 28 Jul 2020

I've tried with v3.0.2 and I'm getting the same warning messages even when I changed the logging level with the code snippet above.

wise-east on 29 Jul 2020

@tutmoses @wise-east can you give us a self-contained code example reproducing the behavior?

thomwolf on 29 Jul 2020

I got the same question

iamxinxin on 11 Aug 2020

update transformers library to v3 and explicitly provide "trucation=True" while encoding text using tokenizers

saahiluppal on 11 Aug 2020

Could reproduce the error with this code:

from transformers.data.processors.utils import SingleSentenceClassificationProcessor
tokenizer = CamembertTokenizer.from_pretrained("camembert-base")

texts = ["hi", "hello", "salut", "bonjour"]
labels = [0, 0, 1, 1,]

processor = SingleSentenceClassificationProcessor().create_from_examples(texts, labels)
dataset = processor.get_features(tokenizer=tokenizer)

RBeaudet on 13 Aug 2020

Hello,

Using the following command had solved the problem:

import logging logging.basicConfig(level = logging.ERROR)

However, since today 15h40 (Paris time), it does not work anymore and the following warning continues to pop up until crashing Google Colab:

Truncation was not explicitely activated butmax_lengthis provided a specific value, please usetruncation=Trueto explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy totruncation.

Could you please tell me how to solve it? I also tried to deactivate truncation from the encode_plus tokenizer:

encoded_dict = tokenizer.encode_plus( sent, # Sentence to encode. add_special_tokens = True, # Add '[CLS]' and '[SEP]' max_length = 128, # Pad & truncate all sentences. pad_to_max_length = True, return_attention_mask = True, # Construct attn. masks. return_tensors = 'pt', # Return pytorch tensors. truncation = False )

But it did not work.

Thank for your help/replies,

----------EDIT---------------

I modified my code in the following way by setting "truncation = True" as suggested on this post. It worked perfectly! From what I understood, this should consider the max_lenght I'm applying and avoid the warning from comming up.

jusugac1 on 1 Sep 2020

'truncation=True' solves the problem.
tokenizer = BertTokenizer.from_pretrained(cfg.text_model.pretrain)
lengths = [len(tokenizer.tokenize(c)) + 2 for c in captions]
captions_ids = [torch.LongTensor(tokenizer.encode(c, max_length=max_len, pad_to_max_length=True_, truncation=True_))
for c in captions]

Kerry-zzx on 10 Sep 2020

👍1

not elegant solution
modify transformers source code (~/python/site-packages/transformers/tokenization_utils_base.py) line 1751 to aviod this warning

            if 0:       #if verbose:
                logger.warning(
                    "Truncation was not explicitely activated but `max_length` is provided a specific value, "
                    "please use `truncation=True` to explicitely truncate examples to max length. "
                    "Defaulting to 'longest_first' truncation strategy. "
                    "If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy "
                    "more precisely by providing a specific strategy to `truncation`."
                )
            truncation = "longest_first"