I know this warning is because the transformer library is updated to 3.x.
I know the warning saying to set TOKENIZERS_PARALLELISM = true / false
My question is where should i set TOKENIZERS_PARALLELISM = true / false
is this when defining tokenizers like
tok = Tokenizer.from_pretrained('xyz', TOKENIZERS_PARALLELISM=True) // this doesn't work
or is this when encoding text like
tok.encode_plus(text_string, some=some, some=some, TOKENIZERS_PARALLELISM = True) // this also didn't work
Suggestions anyone?
I suspect this may be caused by loading data. In my case, it happens when my dataloader starts working.
This is happening whenever you use multiprocessing (Often used by data loaders). The way to disable this warning is to set the TOKENIZERS_PARALLELISM environment variable to the value that makes more sense for you. By default, we disable the parallelism to avoid any hidden deadlock that would be hard to debug, but you might be totally fine while keeping it enabled in your specific use-case.
You can try to set it to true, and if your process seems to be stuck, doing nothing, then you should use false.
We'll improve this message to help avoid any confusion (Cf https://github.com/huggingface/tokenizers/issues/328)
I may be a rookie, but it seems like it would be useful to indicate that this is an environment variable in the warning message.
You are totally right! In the latest version 3.0.2, the warning message should be a lot better, and it will trigger only when necessary.
Most helpful comment
You are totally right! In the latest version
3.0.2, the warning message should be a lot better, and it will trigger only when necessary.