Transformers: is_pretokenized seems to work incorrectly

Created on 27 Jul 2020  路  5Comments  路  Source: huggingface/transformers

馃悰 Bug

Information

Model I am using (Bert, XLNet ...): roberta

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

  • the official example scripts: (give details below)
  • [x] my own modified scripts: (give details below)

I use RobertaTokenizerFast on pretokenized text, but problem arises when I switch to slow version too

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • [x] my own task or dataset: (give details below)

I am trying to implement sliding window for roberta

To reproduce

I use tokenizer.tokenize(text) method to tokenize whole text (1-3 sentences), when I divide tokens into chunks and try to use __call__ method (I also tried encode) with is_pretokenized=True argument, but this creates additional tokens (like 3 times more then should be). I worked this around by using tokenize -> convert_tokens_to_ids -> prepare_for_model -> pad pipeline, but I believe that batch methods should be faster and more memory efficient
Steps to reproduce the behavior:

  1. tokenizer = AutoTokenizer.from_pretrained('roberta-base', add_prefix_space=True, use_fast=True)
  2. ex_text = 'long text'
  3. tokens = tokenizer.tokenize(ex_text)
  4. examples = [tokens[i:i+126] for i in range(0, len(tokens), 100)]
  5. print(len(tokenizer(examples, is_pretokenized=True)['input_ids'][0])) # this prints more than 128

Expected behavior

I would expect to get result similar to result I get when I use

tokens = tokeniser.tokenize(ex_text)
inputs = tokenizer.convert_tokens_to_ids(tokens)
inputs = [inputs[i:i+126] for i in range(0, len(tokens), 100)]
inputs = [tokenizer.prepare_for_model(example) for example in inputs] 
inputs = tokenizer.pad(inputs, padding='longest')

Am I doing something wrong or it's unexpected behaviour?

Environment info

  • transformers version: 3.0.2
  • Platform: MacOs
  • Python version: 3.8.3
  • PyTorch version (GPU?): 1.5.1 (no GPU)
  • Tensorflow version (GPU?): NO
  • Using GPU in script?: NO
  • Using distributed or parallel set-up in script?: NO

EDIT:
I see that when I use __call__ it actually treat as 2 tokens:
tokenizer(tokenizer.tokenize('How'), is_pretokenized=True)['input_ids']
out: [0, 4236, 21402, 6179, 2] where 4236, 21402 is

Most helpful comment

We face a similar issue with the distilbert tokenizer.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-german-cased")
tokens = ['1980', 'kam', 'der', 'Crow', '##n', 'von', 'Toy', '##ota']
result = tokenizer.encode_plus(text=tokens,
                               text_pair=None,
                               add_special_tokens=True,
                               truncation=False,
                               return_special_tokens_mask=True,
                               return_token_type_ids=True,
                               is_pretokenized=True
                               )
result["input_ids"]
# returns:
[102,
 3827,
 1396,
 125,
 28177,
 1634,
 1634,
 151,
 195,
 25840,
 1634,
 1634,
 23957,
 30887,
 103]

tokenizer.decode(result["input_ids"])
# returns:
'[CLS] 1980 kam der Crow # # n von Toy # # ota [SEP]'

It seems that subword tokens (here ##n and ##ota) get split into further tokens even though we set is_pretokenized=True. This seems unexpected to me but maybe I am missing something?

All 5 comments

We face a similar issue with the distilbert tokenizer.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-german-cased")
tokens = ['1980', 'kam', 'der', 'Crow', '##n', 'von', 'Toy', '##ota']
result = tokenizer.encode_plus(text=tokens,
                               text_pair=None,
                               add_special_tokens=True,
                               truncation=False,
                               return_special_tokens_mask=True,
                               return_token_type_ids=True,
                               is_pretokenized=True
                               )
result["input_ids"]
# returns:
[102,
 3827,
 1396,
 125,
 28177,
 1634,
 1634,
 151,
 195,
 25840,
 1634,
 1634,
 23957,
 30887,
 103]

tokenizer.decode(result["input_ids"])
# returns:
'[CLS] 1980 kam der Crow # # n von Toy # # ota [SEP]'

It seems that subword tokens (here ##n and ##ota) get split into further tokens even though we set is_pretokenized=True. This seems unexpected to me but maybe I am missing something?

As I mentioned before we used is_pretokenized to create sliding window, but recently discovered that this can be achieved using:

stride = max_seq_length - 2 - int(max_seq_length*stride)
tokenized_examples = tokenizer(examples, return_overflowing_tokens=True, 
                               max_length=max_seq_length, stride=stride, truncation=True)

this returns dict with input_ids, attention_mask and overflow_to_sample_mapping (this helps to map between windows and example, but you should check for its presence, if you pass 1 short example it might not be there).

Hope this will help someone 馃

I have the same issue as @tholor - there seem to be some nasty differences between slow and fast tokenizer implementations.

Just got the same issue with bert-base-uncased, However if when is_pretokenized=False it seems to be OK. Is this expected behaviour?

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text  = "huggingface transformers"
tok = tokenizer.tokenize(text)
print(tok)
# ['hugging', '##face', 'transformers']

output = tokenizer.encode_plus(tok, is_pretokenized=True)
tokenizer.convert_ids_to_tokens(output["input_ids"])
# ['[CLS]', 'hugging', '#', '#', 'face', 'transformers', '[SEP]']

when is_pretokenized=False

output2 = tokenizer.encode_plus(tok, is_pretokenized=False)
tokenizer.convert_ids_to_tokens(output2["input_ids"])
# ['[CLS]', 'hugging', '##face', 'transformers', '[SEP]']

I believe that this issue can be closed because of explanation in #6575 stating that is_pretokenized expect just list of words spited by white space not actual tokens. So this is "kind of expected" behaviour :)

Was this page helpful?
0 / 5 - 0 ratings