Tensor2tensor: I train with translate_enzh_wmt32k, and bleu is only 1.25.What's the reason?I appreciate you telling me

Created on 21 Jun 2018 · 16Comments · Source: tensorflow/tensor2tensor

Description

I trained model with translate_enzh_wmt32k, but bleu is only 1.25. What's the reason?I appreciate you can tell me, thank you!

Environment information

OS: linux

Steps to reproduce:

just flow the doc command

something just like this:
$ PROBLEM=translate_enzh_wmt32k
$ MODEL=transformer
$ HPARAMS=transformer_base_single_gpu
$ DATA_DIR=$HOME/t2t_data
$ TMP_DIR=/tmp/t2t_datagen
$ TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS
$ T2T_USR_DIR=$HOME/t2t_usr_dir

$ t2t-trainer \
  --data_dir=$DATA_DIR \
  --problem=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR

Source

hpulfc

Most helpful comment

The solution is pretty simple and straight forward.

Extract all unique characters in the corpus including symbols.
Use the characters as dictionary and just use the text token encoder.
Train the model using transformer_base configuration.
It took us 1.5 millions steps to get bleu score 22~23.

All we believe is that for unigram language like Chinese and Japanese, it’s unnecessary to do word cut any more because the deep neural network will learn how to connect characters as words better than any word cut library.
The translation model proved my guess might be right.
发自我的 iPhone

在 2018年11月3日，20:34，ConnectDotz notifications@github.com 写道：

@yynil thx, did you just use a simple one-hot encoding then? is there an example (parameter?) on how to plug in our own tokenizer? would you care to share more detail? Actually, if the current implementation could not reliably produce a reasonable result(?), will you consider contributing your solution?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

yynil on 3 Nov 2018

👍4

All 16 comments

check your BLEU script.
transformer_base default parameter sets can easily reach 40+ test BLEU(4) with 8000 tokens per training batch on LDC zh-en datasets. there is no problem with your datasets and parameters, why not have a look at your dev results?

vergilus on 21 Jun 2018

yes, @vergilus i found that i use a wrong vocab file. I will have a try and then put the result here! Thank you !!!!!!

hpulfc on 21 Jun 2018

This is the result of my test. The final bleu is only 0.067. Is it normal?

Saving dict for global step 250001: 
global_step = 250001, loss = 5.2350197, 
metrics-translate_enzh_wmt32k/accuracy = 0.31111678,
metrics-translate_enzh_wmt32k/accuracy_per_sequence = 0.006454816,
metrics-translate_enzh_wmt32k/accuracy_top5 = 0.46391085, 
metrics-translate_enzh_wmt32k/approx_bleu_score = 0.06787159,
metrics-translate_enzh_wmt32k/neg_log_perplexity = -5.9543014, 
metrics-translate_enzh_wmt32k/rouge_2_fscore = 0.12633628, 
metrics-translate_enzh_wmt32k/rouge_L_fscore = 0.2820667

Is it the default parameter that needs to be adjusted? @vergilus
the trainer folder's hparams file content as th follow:

{
"activation_dtype": "float32",
"attention_dropout": 0.1,
"attention_dropout_broadcast_dims": "",
"attention_key_channels": 0,
"attention_value_channels": 0,
"batch_size": 1024,
"clip_grad_norm": 0.0,
"compress_steps": 0,
"daisy_chain_variables": true,
"data_dir": "/home/kratos/t2t_data",
"dropout": 0.2,
"eval_drop_long_sequences": false,
"eval_run_autoregressive": false,
"eval_steps": 100,
"factored_logits": false,
"ffn_layer": "dense_relu_dense",
"filter_size": 2048,
"force_full_predict": false,
"grad_noise_scale": 0.0,
"hidden_size": 512,
"initializer": "uniform_unit_scaling",
"initializer_gain": 1.0,
"input_modalities": "default",
"kernel_height": 3,
"kernel_width": 1,
"label_smoothing": 0.1,
"layer_postprocess_sequence": "da",
"layer_prepostprocess_dropout": 0.1,
"layer_prepostprocess_dropout_broadcast_dims": "",
"layer_preprocess_sequence": "n",
"learning_rate": 0.2,
"learning_rate_constant": 2.0,
"learning_rate_cosine_cycle_steps": 250000,
"learning_rate_decay_rate": 1.0,
"learning_rate_decay_scheme": "noam",
"learning_rate_decay_staircase": false,
"learning_rate_decay_steps": 5000,
"learning_rate_minimum": null,
"learning_rate_schedule": "constant*linear_warmup*rsqrt_decay*rsqrt_hidden_size",
"learning_rate_warmup_steps": 16000,
"length_bucket_step": 1.1,
"max_input_seq_length": 0,
"max_length": 256,
"max_relative_position": 0,
"max_target_seq_length": 0,
"min_length": 0,
"min_length_bucket": 8,
"model_dir": "/home/kratos/t2t_train/translate_enzh_wmt32k/transformer-transformer_base_single_gpu",
"moe_hidden_sizes": "2048",
"moe_k": 2,
"moe_loss_coef": 0.01,
"moe_num_experts": 64,
"multiply_embedding_mode": "sqrt_depth",
"nbr_decoder_problems": 1,
"no_data_parallelism": false,
"norm_epsilon": 1e-06,
"norm_type": "layer",
"num_decoder_layers": 0,
"num_encoder_layers": 0,
"num_heads": 8,
"num_hidden_layers": 6,
"optimizer": "Adam",
"optimizer_adafactor_beta1": 0.0,
"optimizer_adafactor_beta2": 0.999,
"optimizer_adafactor_clipping_threshold": 1.0,
"optimizer_adafactor_decay_type": "pow",
"optimizer_adafactor_factored": true,
"optimizer_adafactor_memory_exponent": 0.8,
"optimizer_adafactor_multiply_by_parameter_scale": true,
"optimizer_adam_beta1": 0.9,
"optimizer_adam_beta2": 0.997,
"optimizer_adam_epsilon": 1e-09,
"optimizer_momentum_momentum": 0.9,
"optimizer_momentum_nesterov": false,
"parameter_attention_key_channels": 0,
"parameter_attention_value_channels": 0,
"pos": "timing",
"prepend_mode": "none",
"proximity_bias": false,
"relu_dropout": 0.1,
"relu_dropout_broadcast_dims": "",
"sampling_method": "argmax",
"sampling_temp": 1.0,
"scheduled_sampling_gold_mixin_prob": 0.5,
"scheduled_sampling_prob": 0.0,
"scheduled_sampling_warmup_steps": 50000,
"self_attention_type": "dot_product",
"shared_embedding_and_softmax_weights": true,
"split_to_length": 0,
"summarize_grads": false,
"summarize_vars": false,
"symbol_dropout": 0.0,
"symbol_modality_num_shards": 16,
"symbol_modality_skip_top": false,
"target_modality": "default",
"train_steps": 250000,
"use_fixed_batch_size": false,
"use_pad_remover": true,
"weight_decay": 0.0,
"weight_dtype": "float32",
"weight_noise": 0.0
}

hpulfc on 22 Jun 2018

@hpulfc
batch_size too small. try 12800. you are using token as batches.(check "use_fixed_batch_size" and "batch_size" docs )

vergilus on 1 Jul 2018

@hplfc
I met the same issue, have you found the soulution to it?

xuekun90 on 15 Jul 2018

@xuekun90
I met the same issue, have you found the soulution to it?:)

gushuheng on 19 Jul 2018

I vaguely remember long long ago, that enzh WMT dataset is a toy dataset, which is extremely small. To obtain the full one, you have to download from other sources. If that remain the case, probably your low bleu is due to insufficient data.

colmantse on 19 Jul 2018

@colmantse yeah, i will have a try ,then i will put the result here,
@xuekun90 @gushuheng ,may be use the bigger dataset.

hpulfc on 20 Jul 2018

👍1

@hpulfc, I met similar problem during training EN-ZH problem. The BLEU is only 6.x when using sample default training set (quite small). I downloaded manually with other training data/sources, then the BLEU easily increase to about 18+ (now it's still running and reaches about 20 @200K step with 2 1080 GPUs).
Also your batch-size seems too small, try to increase 4096 (or using transformer_base by default).

SimonGu2018 on 21 Jul 2018

👍1

finally , you should have word segmentation for Chinese, then it will be suitable from SubTokenEncoder

hpulfc on 11 Sep 2018

❤1

I don't think word segmentation is good for Chinese. Try to use character as a token, the translation results may give you a lot of surprises that would help you to find a lot of un-seen words.

yynil on 11 Sep 2018

@SimonGu2018 is the BLEU score you reported coming from the training metrics approx_bleu_score? or you used a separate bleu script? I too manually downloaded the 32k dataset but the approx_bleu_score is still very poor: 0.14150444 at 100K steps, it' hard to imagine the bleu score would jump to 20 at 200k so wondering if I am comparing with the wrong info...

@hpulfc did you ever achieve the desired target bleu score (which is..?)? Will you be able to do so with the current t2t implementation? Would be really appreciated if you can share what steps you took to correct the issue reported, as we are also facing a similar issue.

connectdotz on 3 Nov 2018

I’ve used characters as dictionaries instead of word pieces. The bleu I got is around 22

发自我的 iPhone

在 2018年11月3日，11:12，ConnectDotz notifications@github.com 写道：

@SimonGu2018 is the BLEU score you reported coming from the training metrics approx_bleu_score? or you used a separate bleu script? I too manually downloaded the 32k dataset but the approx_bleu_score is still very poor: 0.14150444 at 100K steps, it' hard to imagine the bleu score would jump to 20 at 200k so wondering if I am comparing with the wrong info...

@hpulfc did you ever achieve the desired target bleu score (which is..?)? Will you be able to do so with the current t2t implementation? Would be really appreciated if you can share what steps you took to correct the issue reported, as we are also facing similar issue.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

yynil on 3 Nov 2018

@yynil thx, did you just use a simple one-hot encoding then? is there an example (parameter?) on how to plug in our own tokenizer? would you care to share more detail? Actually, if the current implementation could not reliably produce a reasonable result(?), will you consider contributing your solution?

connectdotz on 3 Nov 2018

The solution is pretty simple and straight forward.

Extract all unique characters in the corpus including symbols.
Use the characters as dictionary and just use the text token encoder.
Train the model using transformer_base configuration.
It took us 1.5 millions steps to get bleu score 22~23.

在 2018年11月3日，20:34，ConnectDotz notifications@github.com 写道：

@yynil thx, did you just use a simple one-hot encoding then? is there an example (parameter?) on how to plug in our own tokenizer? would you care to share more detail? Actually, if the current implementation could not reliably produce a reasonable result(?), will you consider contributing your solution?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

yynil on 3 Nov 2018

👍4

Segment features get better results in Named Entity Recognize(NER) work, event the model based on the Google Bert Model. @yynil

yanwii on 8 Mar 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Is there a way to know the exact BPE components each word is decomposed into?

anglil · 5Comments

Need help with understanding tokenization and pre processing in case of translation problem.

sugeeth14 · 3Comments

*help* How to serve model on gpu

mehmedes · 3Comments

Use transformer encoder for sequence labeling

sebastian-nehrdich · 4Comments

ERROR:tensorflow:Model diverged with loss = NaN during traning translation model

yudianer · 4Comments