Tensor2tensor: Not getting the claimed performance on English-to-German translation task.

Created on 10 Jul 2017  路  11Comments  路  Source: tensorflow/tensor2tensor

After following the walkthrough example for training English-to-German translation task (wmt_ende_tokens_32k problem), I am getting quite low performance of 14.20 BLEU.

Trained the model for 250k steps on a single Nvidia TITAN X GPU (12GB) with transformer_base_single_gpu hparams with default batch_size.

Got the below results after the end of training.

INFO:tensorflow:Validation (step 250000): loss = 1.33851, metrics-wmt_ende_tokens_32k/accuracy = 0.698445, metrics/neg_log_perplexity = -1.52589, metrics-wmt_ende_tokens_32k/approx_bleu_score = 0.399907, metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.870514, metrics/accuracy = 0.698445, metrics-wmt_ende_tokens_32k/neg_log_perplexity = -1.52589, metrics/approx_bleu_score = 0.399907, metrics/accuracy_per_sequence = 0.0, metrics/accuracy_top5 = 0.870514, metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.0, global_step = 249505

I decoded the newstest2013.en file using the model and used get_bleu_ende.sh on the generated .decodes file to calculate BLEU.
Before running avg_checkpoints.py file, got BLEU score - 13.98
After averaging over last 20 checkpoints, got the following result -
BLEU = 14.20, 39.9/19.1/10.1/5.3 (BP=1.000, ratio=1.138, hyp_len=63189, ref_len=55538)

Don't know where the issue persists. Tensor2tensor version 1.0.11 was used for training.

Most helpful comment

As for explanation: I think you forgot that you need to tokenize the input and golden file before calculating BLEU. The script runs the tokenizer. Still, if you cannot reproduce it, please re-open this issue.

All 11 comments

I think in the "single_gpu" setting you need to train for more steps. Take into account that each step is 8x smaller in this setting than in an 8-gpu setting. Could you continue training for another 250K steps and report back? If you have tensorboard, then pasting the metrics would also be helpful. We'll get to the bottom of this, thanks for reporting!

@lukaszkaiser Not much improvement after 500k steps.

INFO:tensorflow:Saving dict for global step 500000: global_step = 500000, loss = 1.3043, metrics-wmt_ende_tokens_32k/accuracy = 0.703695, metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.0, metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.876566, metrics-wmt_ende_tokens_32k/approx_bleu_score = 0.404646, metrics-wmt_ende_tokens_32k/neg_log_perplexity = -1.48773, metrics/accuracy = 0.703695, metrics/accuracy_per_sequence = 0.0, metrics/accuracy_top5 = 0.876566, metrics/approx_bleu_score = 0.404646, metrics/neg_log_perplexity = -1.48773

BLEU without and with averaging : 14.15, 14.47.

First, note there may be small differences in BLEU between newstest2014 and newstest2013. The former is used as the final test set (Table 2), the latter as dev set (Table 3) in the paper. But these differences should be smaller than one point.

I used transformer_base_single_gpu, batch_size=3072 and after 350k steps got approx_bleu_score(wmt13)=0.378, BLEU(wmt13)=24.23, BLEU(wmt14)=24.27, without averaging, evaluating with the official case-insensitive BLEU script (the same one as used in http://matrix.statmt.org/).

@ujjwal9895: your BLEU=14 is so low (but approx_bleu high). I guess that you used the whole file xy.transformer.transformer_base_single_gpu.beam4.alpha0.6.decodes which contains both the translation and the source sentences (cf. #112). If yes, add cut -f1.

@martinpopel I was unaware that decodes contains both input and output. But still after removing the input part, I am getting 14.47 BLEU on newstest2013. I am attaching the decodes file generated after decoding using model trained for 500k steps. If possible, please see if everything is correct. (Remove .txt from the end of filename).

newstest2013.en.transformer.transformer_base_single_gpu.beam4.alpha0.6.decodes.txt

I just ran the scorer on this set, I get 25.6 BLEU. Did you try our get_ende_bleu script from utils?
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/get_ende_bleu.sh

I'll close for now as it looks like your model is quite good :). (In fact you could claim state-of-the-art with these decodes just a few months ago!)

As for explanation: I think you forgot that you need to tokenize the input and golden file before calculating BLEU. The script runs the tokenizer. Still, if you cannot reproduce it, please re-open this issue.

Oh yes, I did not tokenize the gold_target file before computing BLEU. Thanks for the help. :)

The main issue (BLEU 14 vs 25) is solved, but still there are some differences in the BLEU scores.
I used the official BLEU script (which includes the tokenization, and is used e.g. at http://matrix.statmt.org and http://wmt.ufal.cz), both case-sensitive and case-insensitive BLEU.
See http://mt-compareval.ufal.cz/tasks/?experimentId=46

@ujjwal9895's output has BLEU=26.57, my own output (with averaging 20 checkpoints) has BLEU=25.15.
Both models were trained with the transformer_base_single_gpu model for 500k steps on a single GPU.
The only difference I am aware is that I used batch_size=3072, while I expect @ujjwal9895 used batch_size=8192, which used to be the default for transformer_base_single_gpu until July 20.
This could mean that 500k is not enough to converge with batch_size=3072 (although the loss and approx_bleu curves in TensorBoard were quite flat in the last 100k steps).

@lukaszkaiser: Can you please publish your WMT en-de decoded outputs (wmt13 transformer-base with reported BLEU 25.8 after 100k steps, and possibly also those the transformer-big with 26.4)?

BTW: any one can upload their translation outputs to http://mt-compareval.ufal.cz and inspect the differences (which sentences/phrases are better translated by one system relative to another system and the reference translation).

I'm sorry but it still seems unclear to me

  1. in what format should the English test file be fed into the decoder in the en-de task? should it be raw text, tokenized, or even BPE-ed?
  2. the decoder output seems to be untokenized, as also suggested by the conversion above. And this leads to the use of https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/get_ende_bleu.sh instead of the Moses BLEU scorer to evaluate the BLEU score. For evaluation of translation of other languages, does this BLEU scorer that comes with tensor2tensor still apply?

in what format should the English test file be fed into the decoder in the en-de task?

Raw text (not tokenized), one sentence per line. That is in the same format as the raw text used for training.
The one-sentence-per-line property is to make it possible/easier to align it with the reference translation and compute the BLEU score.

the decoder output seems to be untokenized

Yes.

this leads to the use of https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/get_ende_bleu.sh instead of the Moses BLEU scorer

This is not "instead". get_ende_bleu.sh uses the Moses tokenizer and Moses scorer multi-bleu.perl.

For evaluation of translation of other languages, does this BLEU scorer that comes with tensor2tensor still apply?

You can use the Moses tokenizer.perl and multi-bleu.perl for languages which use spaces to separate words (so excluding e.g. Chinese).
The tokenization is not ideal, but at least replicable (with the same version of tokenizer.perl).

@martinpopel Good to know. Thanks!

Was this page helpful?
0 / 5 - 0 ratings