Model I am using (Bert, XLNet ...): roberta
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
I use RobertaTokenizerFast on pretokenized text, but problem arises when I switch to slow version too
The tasks I am working on is:
I am trying to implement sliding window for roberta
I use tokenizer.tokenize(text) method to tokenize whole text (1-3 sentences), when I divide tokens into chunks and try to use __call__ method (I also tried encode) with is_pretokenized=True argument, but this creates additional tokens (like 3 times more then should be). I worked this around by using tokenize -> convert_tokens_to_ids -> prepare_for_model -> pad pipeline, but I believe that batch methods should be faster and more memory efficient
Steps to reproduce the behavior:
tokenizer = AutoTokenizer.from_pretrained('roberta-base', add_prefix_space=True, use_fast=True)ex_text = 'long text'tokens = tokenizer.tokenize(ex_text)examples = [tokens[i:i+126] for i in range(0, len(tokens), 100)]print(len(tokenizer(examples, is_pretokenized=True)['input_ids'][0])) # this prints more than 128I would expect to get result similar to result I get when I use
tokens = tokeniser.tokenize(ex_text)
inputs = tokenizer.convert_tokens_to_ids(tokens)
inputs = [inputs[i:i+126] for i in range(0, len(tokens), 100)]
inputs = [tokenizer.prepare_for_model(example) for example in inputs]
inputs = tokenizer.pad(inputs, padding='longest')
Am I doing something wrong or it's unexpected behaviour?
transformers version: 3.0.2EDIT:
I see that when I use __call__ it actually treat 臓 as 2 tokens:
tokenizer(tokenizer.tokenize('How'), is_pretokenized=True)['input_ids']
out: [0, 4236, 21402, 6179, 2] where 4236, 21402 is 臓
We face a similar issue with the distilbert tokenizer.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-german-cased")
tokens = ['1980', 'kam', 'der', 'Crow', '##n', 'von', 'Toy', '##ota']
result = tokenizer.encode_plus(text=tokens,
text_pair=None,
add_special_tokens=True,
truncation=False,
return_special_tokens_mask=True,
return_token_type_ids=True,
is_pretokenized=True
)
result["input_ids"]
# returns:
[102,
3827,
1396,
125,
28177,
1634,
1634,
151,
195,
25840,
1634,
1634,
23957,
30887,
103]
tokenizer.decode(result["input_ids"])
# returns:
'[CLS] 1980 kam der Crow # # n von Toy # # ota [SEP]'
It seems that subword tokens (here ##n and ##ota) get split into further tokens even though we set is_pretokenized=True. This seems unexpected to me but maybe I am missing something?
As I mentioned before we used is_pretokenized to create sliding window, but recently discovered that this can be achieved using:
stride = max_seq_length - 2 - int(max_seq_length*stride)
tokenized_examples = tokenizer(examples, return_overflowing_tokens=True,
max_length=max_seq_length, stride=stride, truncation=True)
this returns dict with input_ids, attention_mask and overflow_to_sample_mapping (this helps to map between windows and example, but you should check for its presence, if you pass 1 short example it might not be there).
Hope this will help someone 馃
I have the same issue as @tholor - there seem to be some nasty differences between slow and fast tokenizer implementations.
Just got the same issue with bert-base-uncased, However if when is_pretokenized=False it seems to be OK. Is this expected behaviour?
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "huggingface transformers"
tok = tokenizer.tokenize(text)
print(tok)
# ['hugging', '##face', 'transformers']
output = tokenizer.encode_plus(tok, is_pretokenized=True)
tokenizer.convert_ids_to_tokens(output["input_ids"])
# ['[CLS]', 'hugging', '#', '#', 'face', 'transformers', '[SEP]']
when is_pretokenized=False
output2 = tokenizer.encode_plus(tok, is_pretokenized=False)
tokenizer.convert_ids_to_tokens(output2["input_ids"])
# ['[CLS]', 'hugging', '##face', 'transformers', '[SEP]']
I believe that this issue can be closed because of explanation in #6575 stating that is_pretokenized expect just list of words spited by white space not actual tokens. So this is "kind of expected" behaviour :)
Most helpful comment
We face a similar issue with the distilbert tokenizer.
It seems that subword tokens (here ##n and ##ota) get split into further tokens even though we set
is_pretokenized=True. This seems unexpected to me but maybe I am missing something?