Models: Can Not Replicate Transformer Base Bleu Scores

Created on 12 Dec 2018  Â·  12Comments  Â·  Source: tensorflow/models

System information

  • What is the top-level directory of the model you are using: /models/official/transformer
  • Have I written custom code : No
  • OS Platform and Distribution :Ubuntu 16.04.5 LTS
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version : v1.12.0-0-ga6d8ffae09 1.12.0
  • CUDA/cuDNN version: release 9.0, V9.0.176
  • GPU model and memory: Tesla K40m/12GB
  • Exact command to reproduce: Official Instructions
  • Bazel version: N/A

Describe the problem

I have been trying to replicate models/offical/tensorflow/ Base. I followed the official instructions yet I was faced with two problems:
1- The size of the generated vocabulary was bigger than the one defined in /models/offical/tranformer/model/model_params.py. The program would not work, I fixed it by changing the value in model_params.py to the actual vocabulary size of 33945.
2- The second problem, and the one I could not solve, is that Blue scores after 10 epochs are not consistent with what is reported in models/official/tensorflow. Running, as instructed, compute_bleu.py gives case-insensitive Bleu scores of 26.04 far bellow the "promised" 27.7 Bleu for Base Transformer.

I have trained 3 models and even though there are fluctuation in Bleu scores these are minimal being the biggest difference 0,1 Bleu.

Source code / logs

python compute_bleu.py --translation=translation.en --reference=test_data/newstest2014.de
I1212 11:03:35.484694 140343540246272 tf_logging.py:115] Case-insensitive results: 26.038009
I1212 11:03:40.145630 140343540246272 tf_logging.py:115] Case-sensitive results: 25.506699

bug

Most helpful comment

I'm having the same problem too.
I've tried TF version 1.10 and 1.12, with tensorflow models repo branch 1.10 and master.
Here's my environment.

What is the top-level directory of the model you are using: /models/official/transformer
Have I written custom code : No
OS Platform and Distribution :Ubuntu 16.04.5 LTS
TensorFlow installed from (source or binary): Source
TensorFlow version : 'v1.12.0-0-ga6d8ffa' 1.12.0
CUDA/cuDNN version: release 9.2, 7.2.1
GPU model and memory: 1080 Ti, 11GB
Exact command to reproduce: Official Instructions
Bazel version: 0.19.2

I've also noticed that the downloaded data size doesn't match the official instructions.
The raw files are 7.8GB (official instructions says 8.4GB), and the TFRecord files are 689MB (official says 722MB). Vocabulary size needed to be changed too (to 33945) as OP has mentioned.

It also seems like the base model quickly overfits, around 6th epoch.
At epoch 5, I'm getting 24.66 case-insensitive BLEU, but at epoch 10 I get 22.85.
(#5573 )

Any advice on where to look, or a working combination of TF version and models repo branch would be really appreciated.

image

All 12 comments

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Bazel version
GPU model and memory

Is there any update on this issue? I am having the same problem:
Case-insensitive results: 25.97
Case-sensitive results: 25.42

I'm having the same problem too.
I've tried TF version 1.10 and 1.12, with tensorflow models repo branch 1.10 and master.
Here's my environment.

What is the top-level directory of the model you are using: /models/official/transformer
Have I written custom code : No
OS Platform and Distribution :Ubuntu 16.04.5 LTS
TensorFlow installed from (source or binary): Source
TensorFlow version : 'v1.12.0-0-ga6d8ffa' 1.12.0
CUDA/cuDNN version: release 9.2, 7.2.1
GPU model and memory: 1080 Ti, 11GB
Exact command to reproduce: Official Instructions
Bazel version: 0.19.2

I've also noticed that the downloaded data size doesn't match the official instructions.
The raw files are 7.8GB (official instructions says 8.4GB), and the TFRecord files are 689MB (official says 722MB). Vocabulary size needed to be changed too (to 33945) as OP has mentioned.

It also seems like the base model quickly overfits, around 6th epoch.
At epoch 5, I'm getting 24.66 case-insensitive BLEU, but at epoch 10 I get 22.85.
(#5573 )

Any advice on where to look, or a working combination of TF version and models repo branch would be really appreciated.

image

Im still running into this issue with the latest release. Seeing similar bleu scores and plots as others mentioned above. Is there any update to this?

Any updates on this issue? I am running into the same problem with branch r1.12.0.

Having the same issue with "latest stable release" v1.11
Case-insensitive results: 23.499498
Case-sensitive results: 23.019947

Any update by the previous commenters ?

I gave up on it. For what I get the scores they report are not reproducible!

Em 24/07/2019 3:03 da manhã, James Wallbridge notifications@github.com escreveu:

Having the same issue with "latest stable release" v1.11
Case-insensitive results: 23.499498
Case-sensitive results: 23.019947

Any update by the previous commenters ?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com/tensorflow/models/issues/5901?email_source=notifications&email_token=AJXW437KJW7RTOSD6P4PENDQA62AFA5CNFSM4GJ5Y3F2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2U52ZQ#issuecomment-514448742, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJXW43YMLABIQDFLJVTPFWLQA62AFANCNFSM4GJ5Y3FQ.

I gave up as well and changed to tensor2tensor

I used the code from mlperf. That matched pretty well.

I used the code from mlperf. That matched pretty well.

Do you means the code from mlperf can't replicate the bleu score and came across the overfitting problem too??

I came across the same problem, although I do not use the dataset from example. I use my dataset with tensor2tensor and tensorflow/models code

Tensor2tensor can produce much better result(bleu or loss), and tensorflow/models did came across the overfitting problem as mentioned above;

I used the code from mlperf. That matched pretty well.

Do you means the code from mlperf can't replicate the bleu score and came across the overfitting problem too??

The mlperf code did replicate the bleu score and did not have the overfitting problem.

Was this page helpful?
0 / 5 - 0 ratings