I also have another question, what if we use huggingface BPE tokenizer and proceed with the fairseq data processing pipeline. Then how can I complete the fairseq-preprocess step?
fairseq-preprocess assumes that the input is already subword-encoded. If you have a BPE tokenizer from huggingface, you should just need to encode your corpus and preprocess as normal. If that doesn't work, I recommend re-opening with your specific issue.
fairseq-preprocess \
--only-source \
--trainpref "aclImdb/train.input0.bpe" \
--validpref "aclImdb/dev.input0.bpe" \
--destdir "IMDB-bin/input0" \
--workers 60 \
--srcdict dict.txt
So in the above case what would be the --srcdict dict.txt ?
What dictionary should I use?
You can omit it and fairseq-preprocess will generate it for you.
Yep, @erip is exactly right. fairseq-preprocess converts the BPE text input to tensors and generates dict.txt.
In the example you linked we manually specify the dictionary, but if you don鈥檛 provide one then fairseq-preprocess will generate it.