Fairseq: Using sentencepiece tokenizer in RoBERTa pretraining regiment

Created on 3 Sep 2019  路  2Comments  路  Source: pytorch/fairseq

Would it be possible to use sentencepiece tokenizer in preprocessing the data?

Most helpful comment

Is there an example to use sentencepiece tokenizer in preprocessing the data?

All 2 comments

Hey @nitinnairk it's definitely possible if you are pretraining from scratch.
The released pretrained model used GPT2 bpe dictionary so unfortunately you can't use sentencepiece tokenizer with the released model.

Is there an example to use sentencepiece tokenizer in preprocessing the data?

Was this page helpful?
0 / 5 - 0 ratings