Recently while experimenting, BertTokenizer start to throw this warning
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'only_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you may want to check this is the right behavior.
I know, this warning asks to provide truncation value.
I'm asking here because this warning started this morning.
This is because we recently upgraded the library to version v3.0.0, which has an improved tokenizers API. You can either disable warnings or put truncation=True
to remove that warning (as indicated in the warning).
how do you disable the warnings for this? I'm encountering the same issue. But I don't want to set the truncation=True
You can disable the warnings with:
import logging
logging.basicConfig(level=logging.ERROR)
I've changed the logging level and removed max_length but am still getting this error:
WARNING:transformers.tokenization_utils_base:Truncation was not explicitely activated but max_length
is provided a specific value, please use truncation=True
to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to truncation
.
On which version are you running? Can you try to install v3.0.2 to see if it fixes this issue?
I've tried with v3.0.2 and I'm getting the same warning messages even when I changed the logging level with the code snippet above.
@tutmoses @wise-east can you give us a self-contained code example reproducing the behavior?
I got the same question
update transformers library to v3 and explicitly provide "trucation=True" while encoding text using tokenizers
Could reproduce the error with this code:
from transformers.data.processors.utils import SingleSentenceClassificationProcessor
tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
texts = ["hi", "hello", "salut", "bonjour"]
labels = [0, 0, 1, 1,]
processor = SingleSentenceClassificationProcessor().create_from_examples(texts, labels)
dataset = processor.get_features(tokenizer=tokenizer)
Hello,
Using the following command had solved the problem:
import logging
logging.basicConfig(level = logging.ERROR)
However, since today 15h40 (Paris time), it does not work anymore and the following warning continues to pop up until crashing Google Colab:
Truncation was not explicitely activated but
max_lengthis provided a specific value, please use
truncation=Trueto explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to
truncation.
Could you please tell me how to solve it? I also tried to deactivate truncation from the encode_plus tokenizer:
encoded_dict = tokenizer.encode_plus(
sent, # Sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = 128, # Pad & truncate all sentences.
pad_to_max_length = True,
return_attention_mask = True, # Construct attn. masks.
return_tensors = 'pt', # Return pytorch tensors.
truncation = False
)
But it did not work.
Thank for your help/replies,
----------EDIT---------------
I modified my code in the following way by setting "truncation = True" as suggested on this post. It worked perfectly! From what I understood, this should consider the max_lenght I'm applying and avoid the warning from comming up.
encoded_dict = tokenizer.encode_plus(
sent, # Sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = 128, # Pad & truncate all sentences.
pad_to_max_length = True,
return_attention_mask = True, # Construct attn. masks.
return_tensors = 'pt', # Return pytorch tensors.
truncation = True
)
J.
'truncation=True' solves the problem.
tokenizer = BertTokenizer.from_pretrained(cfg.text_model.pretrain)
lengths = [len(tokenizer.tokenize(c)) + 2 for c in captions]
captions_ids = [torch.LongTensor(tokenizer.encode(c, max_length=max_len, pad_to_max_length=True_, truncation=True_))
for c in captions]
not elegant solution
modify transformers source code (~/python/site-packages/transformers/tokenization_utils_base.py
) line 1751 to aviod this warning
if 0: #if verbose:
logger.warning(
"Truncation was not explicitely activated but `max_length` is provided a specific value, "
"please use `truncation=True` to explicitely truncate examples to max length. "
"Defaulting to 'longest_first' truncation strategy. "
"If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy "
"more precisely by providing a specific strategy to `truncation`."
)
truncation = "longest_first"
Most helpful comment
This is because we recently upgraded the library to version v3.0.0, which has an improved tokenizers API. You can either disable warnings or put
truncation=True
to remove that warning (as indicated in the warning).