Fairseq: Why we use fairseq-preprocess after BPE multiprocessing encoding?

Created on 7 Oct 2020 · 4Comments · Source: pytorch/fairseq

I also have another question, what if we use huggingface BPE tokenizer and proceed with the fairseq data processing pipeline. Then how can I complete the fairseq-preprocess step?

question

Source

shamanez

All 4 comments

fairseq-preprocess assumes that the input is already subword-encoded. If you have a BPE tokenizer from huggingface, you should just need to encode your corpus and preprocess as normal. If that doesn't work, I recommend re-opening with your specific issue.

erip on 7 Oct 2020

fairseq-preprocess \
    --only-source \
    --trainpref "aclImdb/train.input0.bpe" \
    --validpref "aclImdb/dev.input0.bpe" \
    --destdir "IMDB-bin/input0" \
    --workers 60 \
    --srcdict dict.txt

So in the above case what would be the --srcdict dict.txt ?

What dictionary should I use?

shamanez on 7 Oct 2020

👍1

You can omit it and fairseq-preprocess will generate it for you.

erip on 9 Oct 2020

👍1

Yep, @erip is exactly right. fairseq-preprocess converts the BPE text input to tensors and generates dict.txt.

In the example you linked we manually specify the dictionary, but if you don’t provide one then fairseq-preprocess will generate it.