Fairseq: How to build encoder.json and dict.txt

Created on 26 Sep 2019 · 9Comments · Source: pytorch/fairseq

I am training RoBERTa on a different language. I found how to build vocab.bpe using other BPE methods but not able to figure out how to get dict.txt and encoder.json.
Please suggest how to do this.

Source

008karan

Most helpful comment

The *.bpe files are the names of the BPE encoded files. You would do something like:

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done

And then:

fairseq-preprocess \
    --only-source \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

By not specifying --srcdict it will generate a dictionary for you.

lematt1991 on 26 Sep 2019

👍5 ❤2

All 9 comments

What BPE are you using (sentencepiece, fastbpe, something else)? The encoder.json is specific to GPT2's BPE. The dict.txt file will get created when you preprocess your data using fairseq-preprocess

lematt1991 on 26 Sep 2019

I am using sentencepiece BPE @lematt1991 . So can I copy-paste encoder.json directly?

008karan on 26 Sep 2019

You shouldn't need encoder.json at all. Follow these instructions, and skip the "Next encode it with the GPT-2 BPE" section, and encode using your sentencepiece BPE. The rest should be the same.

lematt1991 on 26 Sep 2019

But in the preprocessing step there is srcdict argument and train.bpe valid.bpe test.bpe files are needed whereas I only got one model file and vocab from sentencepiece BPE.

 fairseq-preprocess \
    --only-source \
    --srcdict gpt2_bpe/dict.txt \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

@lematt1991

008karan on 26 Sep 2019

The *.bpe files are the names of the BPE encoded files. You would do something like:

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done

And then:

fairseq-preprocess \
    --only-source \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

By not specifying --srcdict it will generate a dictionary for you.

lematt1991 on 26 Sep 2019

👍5 ❤2

Does this solve your problem? If so, do you mind closing this issue. Thanks!

lematt1991 on 27 Sep 2019

Closing due to inactivity

lematt1991 on 30 Sep 2019

what if dict.txt is not present and I have multiple data-bins (data-bin1:data-bin2:data-bin3, ec...), how can I create a general dict.txt that is valid for each data-bin and not create a new one for each of them that will also cause problems?

GabboM on 21 May 2020

@lematt1991 lematt1991

What BPE are you using (sentencepiece, fastbpe, something else)? The encoder.json is specific to GPT2's BPE. The dict.txt file will get created when you preprocess your data using fairseq-preprocess
@lematt1991 please guide me how to create encoder.json using gpt-2 bpe for other language.