Transformers: Tokenizer further tokenizes pretokenized input

Created on 18 Aug 2020 · 3Comments · Source: huggingface/transformers

Environment info

transformers version: current master
Platform: MacOS
Python version: 3.7

Who can help

@mfuntowicz

Information

It seems that passing pretokenized input to the Tokenizer and setting is_pretokenized=True doesn't prevent the Tokenizer from further tokenizing the input. This issue already came up in #6046 and the reason for this seems to be #6573 . A workaround is to set is_pretokenized=False.
What hasn't been reported yet is that this issue also arises with FastTokenizers where we see the same behavior. However, there is no workaround for FastTokenizers (or at least I haven't found one...). Setting is_pretokenized=False will raise a ValueError.

To reproduce

from transformers.tokenization_auto import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased")
fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased", use_fast=True)

text = "Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot ist"
pretokenized_text = ['Schar', '##tau', 'sagte', 'dem', 'Tages', '##spiegel', ',', 'dass', 'Fischer', 'ein', 'Id', '##iot', 'ist']

tokenized = tokenizer.encode(text)
# returns list of len 15 -> 13 tokens + 2 special tokens
pretokenized_tok = tokenizer.encode(pretokenized_text, is_pretokenized=True)
# returns list of len 23 -> too large
pretokenized_tok_2 = tokenizer.encode(pretokenized_text, is_pretokenized=False)
# returns list of len 15 -> 13 tokens + 2 special tokens

fast_tokenized = fast_tokenizer.encode(text)
# returns list of len 15 -> 13 tokens + 2 special tokens
fast_pretokenized_tok = fast_tokenizer.encode(pretokenized_text, is_pretokenized=True)
# returns list of len 23 -> too large
# fast_pretokenizer_tok2 = fast_tokenizer.encode(pretokenized_text, is_pretokenized=False)
# would raise: 'ValueError: TextInputSequence must be str'


tokenized_decoded = tokenizer.decode(tokenized)
# returns '[CLS] Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot ist [SEP]'
pretokenized_tok_decoded = tokenizer.decode(pretokenized_tok)
# returns '[CLS] Schar # # tau sagte dem Tages # # spiegel, dass Fischer ein Id # # iot ist [SEP]'
pretokenized_tok_2_decoded = tokenizer.decode(pretokenized_tok_2)
# returns '[CLS] Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot ist [SEP]'


fast_tokenized_decoded = fast_tokenizer.decode(fast_tokenized)
# returns '[CLS] Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot ist [SEP]'
fast_pretokenized_tok_decoded = fast_tokenizer.decode(fast_pretokenized_tok)
# returns '[CLS] Schar # # tau sagte dem Tages # # spiegel, dass Fischer ein Id # # iot ist [SEP]'

Source

bogdankostic

👍1

Most helpful comment

Hi,

is_pretokenized=True actually means that you are providing a list of words as strings instead of a full sentence or paragraph not sub-words. The step which is skipped in this case is the pre tokenization step, not the tokenization step.

This is useful for NER or token classification for instance but I understand that the wording can be confusing, we will try to make it more clear in the docstring and the page of the doc (here) cc @sgugger and @LysandreJik