I think the XLMRobertaTokenizer vocab_size is off. Currently double counts '<unk>' | '<s>' | '</s>'
Maybe change it to
def vocab_size(self):
return len(self.sp_model) + self.fairseq_offset
Running the following code caused error for me:
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("xlm-roberta-base")
tokenizer.convert_ids_to_tokens(range(tokenizer.vocab_size))
Actually, the tokenizer.vocab_size is 250005, the last id 250004 is <mask>, but the ids from 250001 to 250003 do not exist.
Actually, the
tokenizer.vocab_sizeis250005, the last id250004is<mask>, but the ids from250001to250003do not exist.
Ya ok this is definitely the problem. Either way, it's an issue for the current implementation of get_vocab which will crash at 25001:
def get_vocab(self):
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
vocab.update(self.added_tokens_encoder)
return vocab
I wonder if this issue will be fixed? Currently it is not...
This issue is known and will be fixed.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This should have been fixed with https://github.com/huggingface/transformers/pull/3198
Most helpful comment
This issue is known and will be fixed.