I am training RoBERTa on a different language. I found how to build vocab.bpe using other BPE methods but not able to figure out how to get dict.txt and encoder.json.
Please suggest how to do this.
What BPE are you using (sentencepiece, fastbpe, something else)? The encoder.json is specific to GPT2's BPE. The dict.txt file will get created when you preprocess your data using fairseq-preprocess
I am using sentencepiece BPE @lematt1991 . So can I copy-paste encoder.json directly?
You shouldn't need encoder.json at all. Follow these instructions, and skip the "Next encode it with the GPT-2 BPE" section, and encode using your sentencepiece BPE. The rest should be the same.
But in the preprocessing step there is srcdict argument and train.bpe valid.bpe test.bpe files are needed whereas I only got one model file and vocab from sentencepiece BPE.
fairseq-preprocess \
--only-source \
--srcdict gpt2_bpe/dict.txt \
--trainpref wikitext-103-raw/wiki.train.bpe \
--validpref wikitext-103-raw/wiki.valid.bpe \
--testpref wikitext-103-raw/wiki.test.bpe \
--destdir data-bin/wikitext-103 \
--workers 60
@lematt1991
The *.bpe files are the names of the BPE encoded files. You would do something like:
for SPLIT in train valid test; do \
cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done
And then:
fairseq-preprocess \
--only-source \
--trainpref wikitext-103-raw/wiki.train.bpe \
--validpref wikitext-103-raw/wiki.valid.bpe \
--testpref wikitext-103-raw/wiki.test.bpe \
--destdir data-bin/wikitext-103 \
--workers 60
By not specifying --srcdict it will generate a dictionary for you.
Does this solve your problem? If so, do you mind closing this issue. Thanks!
Closing due to inactivity
what if dict.txt is not present and I have multiple data-bins (data-bin1:data-bin2:data-bin3, ec...), how can I create a general dict.txt that is valid for each data-bin and not create a new one for each of them that will also cause problems?
@lematt1991 lematt1991
What BPE are you using (sentencepiece, fastbpe, something else)? The
encoder.jsonis specific to GPT2's BPE. Thedict.txtfile will get created when you preprocess your data usingfairseq-preprocess
@lematt1991 please guide me how to create encoder.json using gpt-2 bpe for other language.
Most helpful comment
The *.bpe files are the names of the BPE encoded files. You would do something like:
And then:
By not specifying
--srcdictit will generate a dictionary for you.