Fairseq: Why we use fairseq-preprocess after BPE multiprocessing encoding?

Created on 7 Oct 2020  路  4Comments  路  Source: pytorch/fairseq

I also have another question, what if we use huggingface BPE tokenizer and proceed with the fairseq data processing pipeline. Then how can I complete the fairseq-preprocess step?

question

All 4 comments

fairseq-preprocess assumes that the input is already subword-encoded. If you have a BPE tokenizer from huggingface, you should just need to encode your corpus and preprocess as normal. If that doesn't work, I recommend re-opening with your specific issue.

fairseq-preprocess \
    --only-source \
    --trainpref "aclImdb/train.input0.bpe" \
    --validpref "aclImdb/dev.input0.bpe" \
    --destdir "IMDB-bin/input0" \
    --workers 60 \
    --srcdict dict.txt

So in the above case what would be the --srcdict dict.txt ?

What dictionary should I use?

You can omit it and fairseq-preprocess will generate it for you.

Yep, @erip is exactly right. fairseq-preprocess converts the BPE text input to tensors and generates dict.txt.

In the example you linked we manually specify the dictionary, but if you don鈥檛 provide one then fairseq-preprocess will generate it.

Was this page helpful?
0 / 5 - 0 ratings