Transformers: tokenizer "is_split_into_words" seems not work

Created on 1 Nov 2020  路  6Comments  路  Source: huggingface/transformers

I input tokenized list of tokens, but it return different result(not count pad token). It seems tokenize pretokenized tokens, ignoring is_split_into_words. Please refer to the code below:

sent = "the latest investigation was authorized after the supreme court in 2007 found dcc and its founder , jim flavin , guilty of selling dcc 's ( euro ) 106 million ( then $ 130 million ) stake in fyffes after flavin -- also a fyffes director at the time -- received inside information about bad fyffes news in the pipeline ."

encoded_dict = tokenizer.encode_plus(
                sent,       # Sentence to encode.
                add_special_tokens = False, # Add '[CLS]' and '[SEP]'
                max_length = 314,           # Pad & truncate all sentences.
                padding = 'max_length',
                return_attention_mask = True,   # Construct attn. masks.
                return_tensors = 'pt', 
                return_token_type_ids=False,# Return pytorch tensors.
                truncation=False,
                is_split_into_words=False)
input_ids = encoded_dict['input_ids']
tokenized = tokenizer.convert_ids_to_tokens([i.item() for i in input_ids.squeeze() if i > 1])
len(tokenized)
>> 79

print(tokenized)
>> ['the', 'latest', 'investigation', 'was', 'authorized', 'after', 'the', 'supreme', 'court', 'in', '2007', 'found', 'dc', '##c', 'and', 'its', 'founder', ',', 'jim', 'fl', '##avi', '##n', ',', 'guilty', 'of', 'selling', 'dc', '##c', "'", 's', '(', 'euro', ')', '106', 'million', '(', 'then', '$', '130', 'million', ')', 'stake', 'in', 'f', '##y', '##ffe', '##s', 'after', 'fl', '##avi', '##n', '-', '-', 'also', 'a', 'f', '##y', '##ffe', '##s', 'director', 'at', 'the', 'time', '-', '-', 'received', 'inside', 'information', 'about', 'bad', 'f', '##y', '##ffe', '##s', 'news', 'in', 'the', 'pipeline', '.']

###### tokenizing pretokenized tokens as list
encoded_dict = tokenizer.encode_plus(
                tokenized,       # Sentence to encode.
                add_special_tokens = False, # Add '[CLS]' and '[SEP]'
                max_length = 314,           # Pad & truncate all sentences.
                padding = 'max_length',
                return_attention_mask = True,   # Construct attn. masks.
                return_tensors = 'pt', 
                return_token_type_ids=False,# Return pytorch tensors.
                truncation=False,
                is_split_into_words=True)

input_ids = encoded_dict['input_ids']
tokenized = tokenizer.convert_ids_to_tokens([i.item() for i in input_ids.squeeze() if i > 1])
len(tokenized)
>> 114 # it should be 79

print(tokenized)
>> ['the', 'latest', 'investigation', 'was', 'authorized', 'after', 'the', 'supreme', 'court', 'in', '2007', 'found', 'dc', '#', '#', 'c', 'and', 'its', 'founder', ',', 'jim', 'fl', '#', '#', 'av', '##i', '#', '#', 'n', ',', 'guilty', 'of', 'selling', 'dc', '#', '#', 'c', "'", 's', '(', 'euro', ')', '106', 'million', '(', 'then', '$', '130', 'million', ')', 'stake', 'in', 'f', '#', '#', 'y', '#', '#', 'ff', '##e', '#', '#', 's', 'after', 'fl', '#', '#', 'av', '##i', '#', '#', 'n', '-', '-', 'also', 'a', 'f', '#', '#', 'y', '#', '#', 'ff', '##e', '#', '#', 's', 'director', 'at', 'the', 'time', '-', '-', 'received', 'inside', 'information', 'about', 'bad', 'f', '#', '#', 'y', '#', '#', 'ff', '##e', '#', '#', 's', 'news', 'in', 'the', 'pipeline', '.']

All 6 comments

the same issue, is there any workaround?

same issue, I think there is a bug in PreTrainedTokenizer class

        def get_input_ids(text):
            print(text)
            if isinstance(text, str):
                tokens = self.tokenize(text, **kwargs)
                return self.convert_tokens_to_ids(tokens)
            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):
                if is_split_into_words:
                    tokens = list(
                        itertools.chain(*(self.tokenize(t, is_split_into_words=True, **kwargs) for t in text))
                    )
                    return self.convert_tokens_to_ids(tokens)
                else:
                    return self.convert_tokens_to_ids(text)

in if is_split_into_words case (where the input is pretokenized words), the tokenizer should directly return ids.

Hello! I think all of the confusion here may be because you're expecting is_split_into_words to understand that the text was already pre-tokenized. This is not the case, it means that the string was split into words (not tokens), i.e., split on spaces.

@HenryPaik1, in your example, your list of words is the following:

['the', 'latest', 'investigation', 'was', 'authorized', 'after', 'the', 'supreme', 'court', 'in', '2007', 'found', 'dc', '##c', 'and', 'its', 'founder', ',', 'jim', 'fl', '##avi', '##n', ',', 'guilty', 'of', 'selling', 'dc', '##c', "'", 's', '(', 'euro', ')', '106', 'million', '(', 'then', '$', '130', 'million', ')', 'stake', 'in', 'f', '##y', '##ffe', '##s', 'after', 'fl', '##avi', '##n', '-', '-', 'also', 'a', 'f', '##y', '##ffe', '##s', 'director', 'at', 'the', 'time', '-', '-', 'received', 'inside', 'information', 'about', 'bad', 'f', '##y', '##ffe', '##s', 'news', 'in', 'the', 'pipeline', '.']

Some of these strings are tokens, but not words. Running the encoding method on it once again means that you're re-tokenizing some of these tokens.

You can see it is the case, as the following token:

 [..., '##c', ...]

became:

[..., '#', '#', 'c', ...]

I think in your case you're looking for the method convert_tokens_to_ids: your sequence is already tokenized, you only need the IDs. If you're looking to use encode_plus because you need padding/trunc/conversion to tensors, etc., then you can simply use it without specifying that the sequence is separated into words. Please be aware that the following code only works on python tokenizers, i.e., slow tokenizers.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

sent = "the latest investigation was authorized after the supreme court in 2007 found dcc and its founder , jim flavin , guilty of selling dcc 's ( euro ) 106 million ( then $ 130 million ) stake in fyffes after flavin -- also a fyffes director at the time -- received inside information about bad fyffes news in the pipeline ."

encoded_dict = tokenizer.encode_plus(
                sent,       # Sentence to encode.
                add_special_tokens = False, # Add '[CLS]' and '[SEP]'
                max_length = 314,           # Pad & truncate all sentences.
                padding = 'max_length',
                return_attention_mask = True,   # Construct attn. masks.
                return_tensors = 'pt',
                truncation=False,
                is_split_into_words=False)
input_ids = encoded_dict['input_ids']
tokenized = tokenizer.convert_ids_to_tokens([i.item() for i in input_ids.squeeze() if i > 1])
print(len(tokenized))
#80 

###### tokenizing pretokenized tokens as list
encoded_dict = tokenizer.encode_plus(
                tokenized,       # Sentence to encode.
                add_special_tokens = False, # Add '[CLS]' and '[SEP]'
                max_length = 314,           # Pad & truncate all sentences.
                padding = 'max_length',
                return_attention_mask = True,   # Construct attn. masks.
                return_tensors = 'pt',
                truncation=False,
               )

input_ids = encoded_dict['input_ids']
tokenized = tokenizer.convert_ids_to_tokens([i.item() for i in input_ids.squeeze() if i > 1])
print(len(tokenized))
# 80

@LysandreJik Thanks for your explanation. Yes, I want to use encode_plus for padding/trunc. It looks I thought the argument, is_split_into_words, the other way around. is_split_into_words=True seems for the "not tokenized sentence."
And if I understand correctly, you mean the part below is executed by python:

def get_input_ids(text):
            if isinstance(text, str):
                tokens = self.tokenize(text, **kwargs)
                return self.convert_tokens_to_ids(tokens)
            elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):
                if is_split_into_words:
                 ####### this part ########
                    tokens = list(
                        itertools.chain(*(self.tokenize(t, is_split_into_words=True, **kwargs) for t in text))
                    )
                 ####### this part ########
                    return self.convert_tokens_to_ids(tokens)
                else:
                    return self.convert_tokens_to_ids(text)

The part you've highlighted is performing tokenization on each individual word (not token!). You can see here that if it was already tokenized, then applying a second tokenization would be incorrect.

@LysandreJik Understood, Thanks. I close the issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

0x01h picture 0x01h  路  3Comments

alphanlp picture alphanlp  路  3Comments

iedmrc picture iedmrc  路  3Comments

HansBambel picture HansBambel  路  3Comments

zhezhaoa picture zhezhaoa  路  3Comments