Tensor2tensor: data size of enfr_wmt32k dataset

Created on 9 Jan 2018 · 3Comments · Source: tensorflow/tensor2tensor

Hi,

I use t2t-data_gen to download and process the enfr_wmt32k dataset, but find the dataset is about 40M, not 36M instead. Does it mean that I still need to further preprocess like filtering data? If yes, how to do the further preprocess? Thanks.

question

Source

apeterswu

Most helpful comment

We don't do any further pre-processing in our experiments. Note that things like "remove sentences longer than 175 words" may happen during training if you set max_len to that value. But the results, e.g. in Attention is All You Need (https://arxiv.org/abs/1706.03762) are without any other pre-processing.

lukaszkaiser on 10 Feb 2018

👍3

All 3 comments

Hey, that's the excellent question. I'll be happy to listen to the answers from the experts.

On my side I can only mention that some time before I found such reply to this question in FairSeq issue queue: https://github.com/facebookresearch/fairseq/issues/59

Hope this may help you.

gsoul on 9 Jan 2018

@gsoul Thanks, I think according to fairseq prepocess, the dataset will be 34M instead of 36M either. So I think we still need to wait for experts' answers.

apeterswu on 10 Jan 2018

lukaszkaiser on 10 Feb 2018

👍3

Was this page helpful?

0 / 5 - 0 ratings