Hi,
I use t2t-data_gen to download and process the enfr_wmt32k dataset, but find the dataset is about 40M, not 36M instead. Does it mean that I still need to further preprocess like filtering data? If yes, how to do the further preprocess? Thanks.
Hey, that's the excellent question. I'll be happy to listen to the answers from the experts.
On my side I can only mention that some time before I found such reply to this question in FairSeq issue queue: https://github.com/facebookresearch/fairseq/issues/59
Hope this may help you.
@gsoul Thanks, I think according to fairseq prepocess, the dataset will be 34M instead of 36M either. So I think we still need to wait for experts' answers.
We don't do any further pre-processing in our experiments. Note that things like "remove sentences longer than 175 words" may happen during training if you set max_len to that value. But the results, e.g. in Attention is All You Need (https://arxiv.org/abs/1706.03762) are without any other pre-processing.
Most helpful comment
We don't do any further pre-processing in our experiments. Note that things like "remove sentences longer than 175 words" may happen during training if you set
max_lento that value. But the results, e.g. in Attention is All You Need (https://arxiv.org/abs/1706.03762) are without any other pre-processing.