when I training using t2t-trainer, and try the universal_transformer together with worker_gpu = 2 , training failed immediately with error and exit (BTW: train default with 1 GPU works).
INFO:tensorflow:Cannot use 'Identity_122' as input to 'Identity_33' because they are in different while loops.
Identity_122 while context: universal_transformer/parallel_1_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/while_context
Identity_33 while context: universal_transformer/parallel_0_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/while_context
Traceback for Identity_122:
.....
OS:
Linux 8d9c9f85bad0 4.4.0-131-generic
$ pip freeze | grep tensor
tensor2tensor==1.7.0
tensorboard==1.9.0
tensorflow==1.9.0
$ python -V
Python 3.6.5 :: Anaconda, Inc.
# Steps to reproduce:
t2t-trainer \
--data_dir=$DATA_DIR \
--problem=translate_enzh_wmt32k \
--model=universal_transformer \
--hparams_set=universal_transformer_small \
--hparams='batch_size=5120' \
--train_steps=800000 \
--random_seed=33 \
--worker_gpu=2 \
--output_dir=$TRAIN_DIR
INFO:tensorflow:Cannot use 'Identity_122' as input to 'Identity_33' because they are in different while loops.
Identity_122 while context: universal_transformer/parallel_1_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/while_context
Identity_33 while context: universal_transformer/parallel_0_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/while_context
Traceback for Identity_122:
.....
Hello @gushuheng , can you tell me when you run the model universal_transformer ,whether it is run correctly? and can you post your parameters to me ?my tensorflow-gpu version is 1.8,but I can't run it with the default parammeters and report a type error .
@zxqchat I uses default parameter set with "universal_transformer_small" with "bath_size=5120".
btw: i'm using tf 1.9.
t2t-trainer \
--data_dir=$DATA_DIR \
--problem=translate_enzh_wmt32k \
--model=universal_transformer \
--hparams_set=universal_transformer_small \
--hparams='batch_size=5120' \
--train_steps=800000 \
--random_seed=33 \
--worker_gpu=2 \
--output_dir=$TRAIN_DIR
@gushuheng thank you for your kindness help.I found your parameters is similar to me,I thought the cause might be tracked with the tf version.the tensor2tensor version is same with yours.
I also can not run universal transformer with multi gpus.
hope some guys can help fix it.
they have fixed it with the latest codes. great!
i have a new issue. when i use universal_transformer_big to train a model, the BLEU score is very low,
approx_bleu_score = 0.01985711,INFO:tensorflow:loss = 4.3538547, step = 22000 (82.838 sec)
2
- Make sure that you are on TF1.10 (or at least 1.9) and you're using the latest version of T2T.
- Also, make sure that hparams.daisy_chain_variables is set to False:
(https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/research/universal_transformer.py#L360)
Then it should work.
thank you, but i can't solve this problem. Here is the information about my machine:
hparams.daisy_chain_variables = False # Breaks multi-gpu in while loops.
OS:
$ pip freeze | grep tensor
tensor2tensor==1.9.0
tensorboard==1.10.0
tensorflow==1.10.1
tensorflow-gpu==1.10.1
$ python -V
Python 3.6.4 :: Anaconda, Inc.
@kudou1994 the bleu is low indicate that tensor2tensor is run correctly. The performence caused by your settings eg. batch_size and training data
@kudou1994 the bleu is low indicate that tensor2tensor is run correctly. The performence caused by your settings eg. batch_size and training data
I try to use different datasat, for example, WMT, ai challenger.Both small and large dataset have been tried.Batch_size =1024 or 2048. BLEU is normal when I use transformer_base, only abnormal when I use universal_transformer.
@kudou1994 the bleu is low indicate that tensor2tensor is run correctly. The performence caused by your settings eg. batch_size and training data
I try to use different datasat, for example, WMT, ai challenger.Both small and large dataset have been tried.Batch_size =1024 or 2048. BLEU is normal when I use transformer_base, only abnormal when I use universal_transformer.
I got the same problem, then training loss is about 4~5 after 8k steps. Have you solved yet?
@kudou1994 my wechat id is lijingx-, hope we can help each other~~
I got the same problem, then training loss is about 4~5 after 8k steps. Have you solved yet?
The convergence problem of the Universal Transformer is solved in #1194.
(Really sorry for the delay in fixing this issue!)
Most helpful comment
they have fixed it with the latest codes. great!