Fairseq: Using sentencepiece tokenizer in RoBERTa pretraining regiment

Created on 3 Sep 2019 · 2Comments · Source: pytorch/fairseq

Would it be possible to use sentencepiece tokenizer in preprocessing the data?

Source

nitinnairk

Most helpful comment

Is there an example to use sentencepiece tokenizer in preprocessing the data?

simonefrancia on 6 Dec 2019

👍2

All 2 comments

Hey @nitinnairk it's definitely possible if you are pretraining from scratch.
The released pretrained model used GPT2 bpe dictionary so unfortunately you can't use sentencepiece tokenizer with the released model.

ngoyal2707 on 2 Oct 2019

👍1

Is there an example to use sentencepiece tokenizer in preprocessing the data?

simonefrancia on 6 Dec 2019

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Enable per-token classification in RoBERTa

prihoda · 3Comments

Currently fairseq-py requires PyTorch version >= 0.4.0 ?

mali-nuist · 3Comments

Any performance comparison between pre-norm and post-norm for Transformer on Machine Translation

gaopengcuhk · 3Comments

errors trying to decode with mbart model

mjpost · 3Comments

fairseq/clib/libbleu/libbleu.cpp:10:10: fatal error: 'array' file not found

galphag · 3Comments