Tensor2tensor: data size of enfr_wmt32k dataset

Created on 9 Jan 2018  路  3Comments  路  Source: tensorflow/tensor2tensor

Hi,

I use t2t-data_gen to download and process the enfr_wmt32k dataset, but find the dataset is about 40M, not 36M instead. Does it mean that I still need to further preprocess like filtering data? If yes, how to do the further preprocess? Thanks.

question

Most helpful comment

We don't do any further pre-processing in our experiments. Note that things like "remove sentences longer than 175 words" may happen during training if you set max_len to that value. But the results, e.g. in Attention is All You Need (https://arxiv.org/abs/1706.03762) are without any other pre-processing.

All 3 comments

Hey, that's the excellent question. I'll be happy to listen to the answers from the experts.

On my side I can only mention that some time before I found such reply to this question in FairSeq issue queue: https://github.com/facebookresearch/fairseq/issues/59

Hope this may help you.

@gsoul Thanks, I think according to fairseq prepocess, the dataset will be 34M instead of 36M either. So I think we still need to wait for experts' answers.

We don't do any further pre-processing in our experiments. Note that things like "remove sentences longer than 175 words" may happen during training if you set max_len to that value. But the results, e.g. in Attention is All You Need (https://arxiv.org/abs/1706.03762) are without any other pre-processing.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jhyoocoder picture jhyoocoder  路  3Comments

jsawruk picture jsawruk  路  4Comments

draplater picture draplater  路  4Comments

sebastian-nehrdich picture sebastian-nehrdich  路  4Comments

bezigon picture bezigon  路  4Comments