Tensor2tensor: I train with translate_enzh_wmt32k, and bleu is only 1.25.What's the reason?I appreciate you telling me

Created on 21 Jun 2018  ·  16Comments  ·  Source: tensorflow/tensor2tensor

Description

I trained model with translate_enzh_wmt32k, but bleu is only 1.25. What's the reason?I appreciate you can tell me, thank you!

Environment information

OS: linux

Steps to reproduce:

just flow the doc command

something just like this:
$ PROBLEM=translate_enzh_wmt32k
$ MODEL=transformer
$ HPARAMS=transformer_base_single_gpu
$ DATA_DIR=$HOME/t2t_data
$ TMP_DIR=/tmp/t2t_datagen
$ TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS
$ T2T_USR_DIR=$HOME/t2t_usr_dir

$ t2t-trainer \
  --data_dir=$DATA_DIR \
  --problem=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR

Most helpful comment

The solution is pretty simple and straight forward.

  1. Extract all unique characters in the corpus including symbols.
  2. Use the characters as dictionary and just use the text token encoder.
  3. Train the model using transformer_base configuration.
  4. It took us 1.5 millions steps to get bleu score 22~23.

All we believe is that for unigram language like Chinese and Japanese, it’s unnecessary to do word cut any more because the deep neural network will learn how to connect characters as words better than any word cut library.
The translation model proved my guess might be right.
发自我的 iPhone

在 2018年11月3日,20:34,ConnectDotz notifications@github.com 写道:

@yynil thx, did you just use a simple one-hot encoding then? is there an example (parameter?) on how to plug in our own tokenizer? would you care to share more detail? Actually, if the current implementation could not reliably produce a reasonable result(?), will you consider contributing your solution?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

All 16 comments

check your BLEU script.
transformer_base default parameter sets can easily reach 40+ test BLEU(4) with 8000 tokens per training batch on LDC zh-en datasets. there is no problem with your datasets and parameters, why not have a look at your dev results?

yes, @vergilus i found that i use a wrong vocab file. I will have a try and then put the result here! Thank you !!!!!!

This is the result of my test. The final bleu is only 0.067. Is it normal?

Saving dict for global step 250001: 
global_step = 250001, loss = 5.2350197, 
metrics-translate_enzh_wmt32k/accuracy = 0.31111678,
metrics-translate_enzh_wmt32k/accuracy_per_sequence = 0.006454816,
metrics-translate_enzh_wmt32k/accuracy_top5 = 0.46391085, 
metrics-translate_enzh_wmt32k/approx_bleu_score = 0.06787159,
metrics-translate_enzh_wmt32k/neg_log_perplexity = -5.9543014, 
metrics-translate_enzh_wmt32k/rouge_2_fscore = 0.12633628, 
metrics-translate_enzh_wmt32k/rouge_L_fscore = 0.2820667

Is it the default parameter that needs to be adjusted? @vergilus
the trainer folder's hparams file content as th follow:

{
"activation_dtype": "float32",
"attention_dropout": 0.1,
"attention_dropout_broadcast_dims": "",
"attention_key_channels": 0,
"attention_value_channels": 0,
"batch_size": 1024,
"clip_grad_norm": 0.0,
"compress_steps": 0,
"daisy_chain_variables": true,
"data_dir": "/home/kratos/t2t_data",
"dropout": 0.2,
"eval_drop_long_sequences": false,
"eval_run_autoregressive": false,
"eval_steps": 100,
"factored_logits": false,
"ffn_layer": "dense_relu_dense",
"filter_size": 2048,
"force_full_predict": false,
"grad_noise_scale": 0.0,
"hidden_size": 512,
"initializer": "uniform_unit_scaling",
"initializer_gain": 1.0,
"input_modalities": "default",
"kernel_height": 3,
"kernel_width": 1,
"label_smoothing": 0.1,
"layer_postprocess_sequence": "da",
"layer_prepostprocess_dropout": 0.1,
"layer_prepostprocess_dropout_broadcast_dims": "",
"layer_preprocess_sequence": "n",
"learning_rate": 0.2,
"learning_rate_constant": 2.0,
"learning_rate_cosine_cycle_steps": 250000,
"learning_rate_decay_rate": 1.0,
"learning_rate_decay_scheme": "noam",
"learning_rate_decay_staircase": false,
"learning_rate_decay_steps": 5000,
"learning_rate_minimum": null,
"learning_rate_schedule": "constant*linear_warmup*rsqrt_decay*rsqrt_hidden_size",
"learning_rate_warmup_steps": 16000,
"length_bucket_step": 1.1,
"max_input_seq_length": 0,
"max_length": 256,
"max_relative_position": 0,
"max_target_seq_length": 0,
"min_length": 0,
"min_length_bucket": 8,
"model_dir": "/home/kratos/t2t_train/translate_enzh_wmt32k/transformer-transformer_base_single_gpu",
"moe_hidden_sizes": "2048",
"moe_k": 2,
"moe_loss_coef": 0.01,
"moe_num_experts": 64,
"multiply_embedding_mode": "sqrt_depth",
"nbr_decoder_problems": 1,
"no_data_parallelism": false,
"norm_epsilon": 1e-06,
"norm_type": "layer",
"num_decoder_layers": 0,
"num_encoder_layers": 0,
"num_heads": 8,
"num_hidden_layers": 6,
"optimizer": "Adam",
"optimizer_adafactor_beta1": 0.0,
"optimizer_adafactor_beta2": 0.999,
"optimizer_adafactor_clipping_threshold": 1.0,
"optimizer_adafactor_decay_type": "pow",
"optimizer_adafactor_factored": true,
"optimizer_adafactor_memory_exponent": 0.8,
"optimizer_adafactor_multiply_by_parameter_scale": true,
"optimizer_adam_beta1": 0.9,
"optimizer_adam_beta2": 0.997,
"optimizer_adam_epsilon": 1e-09,
"optimizer_momentum_momentum": 0.9,
"optimizer_momentum_nesterov": false,
"parameter_attention_key_channels": 0,
"parameter_attention_value_channels": 0,
"pos": "timing",
"prepend_mode": "none",
"proximity_bias": false,
"relu_dropout": 0.1,
"relu_dropout_broadcast_dims": "",
"sampling_method": "argmax",
"sampling_temp": 1.0,
"scheduled_sampling_gold_mixin_prob": 0.5,
"scheduled_sampling_prob": 0.0,
"scheduled_sampling_warmup_steps": 50000,
"self_attention_type": "dot_product",
"shared_embedding_and_softmax_weights": true,
"split_to_length": 0,
"summarize_grads": false,
"summarize_vars": false,
"symbol_dropout": 0.0,
"symbol_modality_num_shards": 16,
"symbol_modality_skip_top": false,
"target_modality": "default",
"train_steps": 250000,
"use_fixed_batch_size": false,
"use_pad_remover": true,
"weight_decay": 0.0,
"weight_dtype": "float32",
"weight_noise": 0.0
}

@hpulfc
batch_size too small. try 12800. you are using token as batches.(check "use_fixed_batch_size" and "batch_size" docs )

@hplfc
I met the same issue, have you found the soulution to it?

@xuekun90
I met the same issue, have you found the soulution to it?:)

I vaguely remember long long ago, that enzh WMT dataset is a toy dataset, which is extremely small. To obtain the full one, you have to download from other sources. If that remain the case, probably your low bleu is due to insufficient data.

@colmantse yeah, i will have a try ,then i will put the result here,
@xuekun90 @gushuheng ,may be use the bigger dataset.

@hpulfc, I met similar problem during training EN-ZH problem. The BLEU is only 6.x when using sample default training set (quite small). I downloaded manually with other training data/sources, then the BLEU easily increase to about 18+ (now it's still running and reaches about 20 @200K step with 2 1080 GPUs).
Also your batch-size seems too small, try to increase 4096 (or using transformer_base by default).

finally , you should have word segmentation for Chinese, then it will be suitable from SubTokenEncoder

I don't think word segmentation is good for Chinese. Try to use character as a token, the translation results may give you a lot of surprises that would help you to find a lot of un-seen words.

@SimonGu2018 is the BLEU score you reported coming from the training metrics approx_bleu_score? or you used a separate bleu script? I too manually downloaded the 32k dataset but the approx_bleu_score is still very poor: 0.14150444 at 100K steps, it' hard to imagine the bleu score would jump to 20 at 200k so wondering if I am comparing with the wrong info...

@hpulfc did you ever achieve the desired target bleu score (which is..?)? Will you be able to do so with the current t2t implementation? Would be really appreciated if you can share what steps you took to correct the issue reported, as we are also facing a similar issue.

I’ve used characters as dictionaries instead of word pieces. The bleu I got is around 22

发自我的 iPhone

在 2018年11月3日,11:12,ConnectDotz notifications@github.com 写道:

@SimonGu2018 is the BLEU score you reported coming from the training metrics approx_bleu_score? or you used a separate bleu script? I too manually downloaded the 32k dataset but the approx_bleu_score is still very poor: 0.14150444 at 100K steps, it' hard to imagine the bleu score would jump to 20 at 200k so wondering if I am comparing with the wrong info...

@hpulfc did you ever achieve the desired target bleu score (which is..?)? Will you be able to do so with the current t2t implementation? Would be really appreciated if you can share what steps you took to correct the issue reported, as we are also facing similar issue.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@yynil thx, did you just use a simple one-hot encoding then? is there an example (parameter?) on how to plug in our own tokenizer? would you care to share more detail? Actually, if the current implementation could not reliably produce a reasonable result(?), will you consider contributing your solution?

The solution is pretty simple and straight forward.

  1. Extract all unique characters in the corpus including symbols.
  2. Use the characters as dictionary and just use the text token encoder.
  3. Train the model using transformer_base configuration.
  4. It took us 1.5 millions steps to get bleu score 22~23.

All we believe is that for unigram language like Chinese and Japanese, it’s unnecessary to do word cut any more because the deep neural network will learn how to connect characters as words better than any word cut library.
The translation model proved my guess might be right.
发自我的 iPhone

在 2018年11月3日,20:34,ConnectDotz notifications@github.com 写道:

@yynil thx, did you just use a simple one-hot encoding then? is there an example (parameter?) on how to plug in our own tokenizer? would you care to share more detail? Actually, if the current implementation could not reliably produce a reasonable result(?), will you consider contributing your solution?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Segment features get better results in Named Entity Recognize(NER) work, event the model based on the Google Bert Model. @yynil

Was this page helpful?
0 / 5 - 0 ratings