Tensor2tensor: can not reproduce the result of wmt_enfr32k

Created on 19 Jan 2018 · 6Comments · Source: tensorflow/tensor2tensor

Hi,

I have tried several times to run transformer_big and transformer_enfr_big setting for wmt_enfr32k problem, but I only got 32.x BLEU score, which is far below the paper 41.x..
I am very confused about the result and can't figure out why. That's so hard to reproduce the wmt_enfr32k result. And I am using t2t v1.2.9.
For evaluation, I just feed in the en-fr.en test file(raw data), and use the output to match the en-fr.fr(raw data) by multi-bleu.perl, is there anything wrong?
Could anyone help give a detailed description about how to reproduce the wmt_enfr32k result, the hparams setting, the batch size and so on? I am struggling to reproduce the enfr result...
@lukaszkaiser Could you please help a little?

Appreciate a lot.

question

Source

apeterswu

👍7

Most helpful comment

@lukaszkaiser I also met the same problem as @apeterswu .

BTW, I want to confirm how should we set the hyper parameters (e.g., hparam_set, learning rate, algorithm, batchsize, dropout). What's more, do we need to filter out some training data ? In your paper, you mentioned that you use 36M data but it seems that WMT14 has 40M bilingual data pairs.

Thanks a lot in advance.

xyc1207 on 19 Jan 2018

👍3

All 6 comments

Hi,

If you are using the latest version of the code then there is a chance that your low scores are due to a BUG.

Even I have faced a problem with the new version which I didn't with the old version.

Look here: #525

prajdabre on 19 Jan 2018

@lukaszkaiser I also met the same problem as @apeterswu .

Thanks a lot in advance.

xyc1207 on 19 Jan 2018

👍3

@prajdabre Sorry that I forget the version detail. I am using v1.2.9, which seems reasonable for other datasets training. But wmt_enfr32k is hard to reproduce.

apeterswu on 19 Jan 2018

👍1

@lukaszkaiser @rsepassi I also met this kind of problem as described by @apeterswu. It's hard to reproduce wmt_enfr32k. Previously I successfully reproduce wmt_ende32k, and there are some tips that we need feed the raw(un-token) test data when decoding, and tokenize the data when calculating BLEU. Using these tips I still just get BLEU far below 41.x reported in the paper.

Could you help and have a check?