I'm using bert-base-multilingual-cased tokenizer and model for creating another model. However, the batch_encode_plus is adding an extra [SEP] token id in the middle.
The problem arises when using:
16., 3., 10.,bert-base-multilingual-cased tokenizer is used beforehand to tokenize the previously described strings andbatch_encode_plus is used to convert the tokenized stringsIn fact, batch_encode_plus will generate an input_ids list containing two [SEP], such as in [101, 10250, 102, 119, 102]
I have seen similar issues, but they don't indicate the version of transformers:
https://github.com/huggingface/transformers/issues/2658
https://github.com/huggingface/transformers/issues/3037
Thus, I'm not sure if it is related to transformers version 2.6.0
Steps to reproduce the behavior (simplified steps):
16. or 6.tokens = bert_tokenizer.tokenize("16.")bert_tokenizer.batch_encode_plus([tokens])You can reproduce the error with this code
from transformers import BertTokenizer
import unittest
class TestListElements(unittest.TestCase):
def setUp(self):
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
problematic_string = "16."
tokens = bert_tokenizer.tokenize(problematic_string)
self.encoded_batch_1 = bert_tokenizer.batch_encode_plus([tokens]) #list[list[]]
self.encoded_batch_2 = bert_tokenizer.batch_encode_plus([problematic_string]) #list[]
self.encoded_tokens_1 = bert_tokenizer.encode_plus(problematic_string)
self.encoded_tokens_2 = bert_tokenizer.encode_plus(tokens)
def test_tokens_vs_tokens(self):
self.assertListEqual(self.encoded_tokens_1["input_ids"], self.encoded_tokens_2["input_ids"])
def test_tokens_vs_batch_string(self):
self.assertListEqual(self.encoded_tokens_1["input_ids"], self.encoded_batch_2["input_ids"][0])
def test_tokens_vs_batch_list_tokens(self):
self.assertListEqual(self.encoded_tokens_1["input_ids"], self.encoded_batch_1["input_ids"][0])
if __name__ == "__main__":
unittest.main(verbosity=2)
The code will break at test test_tokens_vs_batch_list_tokens, with the following summarized output:
- [101, 10250, 119, 102]
+ [101, 10250, 102, 119, 102]
The batch_encode_plus should always produce the same input_ids no matter whether we pass them a list of tokens or a list of strings.
For instance, for the string 16. we should get always [101, 10250, 119, 102]. However, using batch_encode_plus we get [101, 10250, 102, 119, 102] if we pass them an input already tokenized.
transformers version: 2.6.0Hi @creat89,
Thanks for posting this issue!
You are correct there is some inconsistent behavior here.
batch_encode_plus() of a simple string. For this the encode_plus() function should be used. encode_plus([string]) and encode_plus(string). This should probably be fixed.Well, the issue not only happens with a simple string. In my actual code I was using a batch of size 2. However, I just used a simple example to demonstrate the issue.
I didn't find any inconsistency between encode_plus([string]) and encode_plus(string) but batch_encode_plus([strings]) and batch_encode_plus([[tokens]])
Sorry, I was skimming through your problem too quickly - I see what you mean now.
I will take a closer look at this.
Created a PR this fixes this behavior. Thanks for pointing this out @creat89 :-)
There has been a big change in tokenizers recently :-) which adds a is_pretokenized flag to the input which makes everything much easier. This should then be used as follows:
bert_tokenizer.batch_encode_plus([tokens], is_pretokenized=True))
Cool, that's awesome and yes, I'm sure that makes everything easier. Cheers!
Most helpful comment
Created a PR this fixes this behavior. Thanks for pointing this out @creat89 :-)