Transformers: XLMRobertaTokenizer vocab size

Created on 23 Feb 2020 · 7Comments · Source: huggingface/transformers

I think the XLMRobertaTokenizer vocab_size is off. Currently double counts '<unk>' | '<s>' | '</s>'

Maybe change it to

def vocab_size(self):
    return len(self.sp_model) + self.fairseq_offset

Tokenization Should Fix wontfix

Source

ssdorsey

Most helpful comment

This issue is known and will be fixed.

LysandreJik on 24 Feb 2020

👍2

All 7 comments

Running the following code caused error for me:

import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("xlm-roberta-base")
tokenizer.convert_ids_to_tokens(range(tokenizer.vocab_size))

Actually, the tokenizer.vocab_size is 250005, the last id 250004 is <mask>, but the ids from 250001 to 250003 do not exist.

erikchwang on 24 Feb 2020

Actually, the tokenizer.vocab_size is 250005, the last id 250004 is <mask>, but the ids from 250001 to 250003 do not exist.

Ya ok this is definitely the problem. Either way, it's an issue for the current implementation of get_vocab which will crash at 25001:

    def get_vocab(self):
        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
        vocab.update(self.added_tokens_encoder)
        return vocab

ssdorsey on 24 Feb 2020

I wonder if this issue will be fixed? Currently it is not...

erikchwang on 24 Feb 2020

This issue is known and will be fixed.

LysandreJik on 24 Feb 2020

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 25 Apr 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 28 Jun 2020

This should have been fixed with https://github.com/huggingface/transformers/pull/3198

LysandreJik on 29 Jun 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings