Fairseq: How to build encoder.json and dict.txt

Created on 26 Sep 2019  路  9Comments  路  Source: pytorch/fairseq

I am training RoBERTa on a different language. I found how to build vocab.bpe using other BPE methods but not able to figure out how to get dict.txt and encoder.json.
Please suggest how to do this.

Most helpful comment

The *.bpe files are the names of the BPE encoded files. You would do something like:

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done

And then:

fairseq-preprocess \
    --only-source \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

By not specifying --srcdict it will generate a dictionary for you.

All 9 comments

What BPE are you using (sentencepiece, fastbpe, something else)? The encoder.json is specific to GPT2's BPE. The dict.txt file will get created when you preprocess your data using fairseq-preprocess

I am using sentencepiece BPE @lematt1991 . So can I copy-paste encoder.json directly?

You shouldn't need encoder.json at all. Follow these instructions, and skip the "Next encode it with the GPT-2 BPE" section, and encode using your sentencepiece BPE. The rest should be the same.

But in the preprocessing step there is srcdict argument and train.bpe valid.bpe test.bpe files are needed whereas I only got one model file and vocab from sentencepiece BPE.

 fairseq-preprocess \
    --only-source \
    --srcdict gpt2_bpe/dict.txt \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

@lematt1991

The *.bpe files are the names of the BPE encoded files. You would do something like:

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done

And then:

fairseq-preprocess \
    --only-source \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

By not specifying --srcdict it will generate a dictionary for you.

Does this solve your problem? If so, do you mind closing this issue. Thanks!

Closing due to inactivity

what if dict.txt is not present and I have multiple data-bins (data-bin1:data-bin2:data-bin3, ec...), how can I create a general dict.txt that is valid for each data-bin and not create a new one for each of them that will also cause problems?

@lematt1991 lematt1991

What BPE are you using (sentencepiece, fastbpe, something else)? The encoder.json is specific to GPT2's BPE. The dict.txt file will get created when you preprocess your data using fairseq-preprocess
@lematt1991 please guide me how to create encoder.json using gpt-2 bpe for other language.

Was this page helpful?
0 / 5 - 0 ratings