Tensor2tensor: model: universal_transformer can't working with worker_gpu = 2 ?

Created on 18 Aug 2018  路  14Comments  路  Source: tensorflow/tensor2tensor

Description

when I training using t2t-trainer, and try the universal_transformer together with worker_gpu = 2 , training failed immediately with error and exit (BTW: train default with 1 GPU works).
INFO:tensorflow:Cannot use 'Identity_122' as input to 'Identity_33' because they are in different while loops.
Identity_122 while context: universal_transformer/parallel_1_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/while_context
Identity_33 while context: universal_transformer/parallel_0_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/while_context

Traceback for Identity_122:
.....

Environment information

OS:
Linux 8d9c9f85bad0 4.4.0-131-generic

$ pip freeze | grep tensor
tensor2tensor==1.7.0
tensorboard==1.9.0
tensorflow==1.9.0

$ python -V
Python 3.6.5 :: Anaconda, Inc.

For bugs: reproduction and error logs

# Steps to reproduce:
t2t-trainer \
  --data_dir=$DATA_DIR \
  --problem=translate_enzh_wmt32k \
  --model=universal_transformer \
  --hparams_set=universal_transformer_small \
  --hparams='batch_size=5120' \
  --train_steps=800000 \
  --random_seed=33 \
  --worker_gpu=2 \
  --output_dir=$TRAIN_DIR

Error logs:

INFO:tensorflow:Cannot use 'Identity_122' as input to 'Identity_33' because they are in different while loops.
Identity_122 while context: universal_transformer/parallel_1_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/while_context
Identity_33 while context: universal_transformer/parallel_0_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/while_context

Traceback for Identity_122:
.....

Most helpful comment

they have fixed it with the latest codes. great!

All 14 comments

Hello @gushuheng , can you tell me when you run the model universal_transformer ,whether it is run correctly? and can you post your parameters to me ?my tensorflow-gpu version is 1.8,but I can't run it with the default parammeters and report a type error .

@zxqchat I uses default parameter set with "universal_transformer_small" with "bath_size=5120".
btw: i'm using tf 1.9.
t2t-trainer \
--data_dir=$DATA_DIR \
--problem=translate_enzh_wmt32k \
--model=universal_transformer \
--hparams_set=universal_transformer_small \
--hparams='batch_size=5120' \
--train_steps=800000 \
--random_seed=33 \
--worker_gpu=2 \
--output_dir=$TRAIN_DIR

@gushuheng thank you for your kindness help.I found your parameters is similar to me,I thought the cause might be tracked with the tf version.the tensor2tensor version is same with yours.

I also can not run universal transformer with multi gpus.
hope some guys can help fix it.

they have fixed it with the latest codes. great!

i have a new issue. when i use universal_transformer_big to train a model, the BLEU score is very low,
approx_bleu_score = 0.01985711,INFO:tensorflow:loss = 4.3538547, step = 22000 (82.838 sec)

  1. Make sure that you are on TF1.10 (or at least 1.9) and you're using the latest version of T2T.
  2. Also, make sure that hparams.daisy_chain_variables is set to False:
    (https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/research/universal_transformer.py#L360)
    Then it should work.

2

  1. Make sure that you are on TF1.10 (or at least 1.9) and you're using the latest version of T2T.
  2. Also, make sure that hparams.daisy_chain_variables is set to False:
    (https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/research/universal_transformer.py#L360)
    Then it should work.

thank you, but i can't solve this problem. Here is the information about my machine:

hparams.daisy_chain_variables = False # Breaks multi-gpu in while loops.

OS:

$ pip freeze | grep tensor
tensor2tensor==1.9.0
tensorboard==1.10.0
tensorflow==1.10.1
tensorflow-gpu==1.10.1

$ python -V
Python 3.6.4 :: Anaconda, Inc.

@kudou1994 the bleu is low indicate that tensor2tensor is run correctly. The performence caused by your settings eg. batch_size and training data

@kudou1994 the bleu is low indicate that tensor2tensor is run correctly. The performence caused by your settings eg. batch_size and training data

I try to use different datasat, for example, WMT, ai challenger.Both small and large dataset have been tried.Batch_size =1024 or 2048. BLEU is normal when I use transformer_base, only abnormal when I use universal_transformer.

@kudou1994 the bleu is low indicate that tensor2tensor is run correctly. The performence caused by your settings eg. batch_size and training data

I try to use different datasat, for example, WMT, ai challenger.Both small and large dataset have been tried.Batch_size =1024 or 2048. BLEU is normal when I use transformer_base, only abnormal when I use universal_transformer.

I got the same problem, then training loss is about 4~5 after 8k steps. Have you solved yet?

@kudou1994 my wechat id is lijingx-, hope we can help each other~~

I got the same problem, then training loss is about 4~5 after 8k steps. Have you solved yet?

The convergence problem of the Universal Transformer is solved in #1194.
(Really sorry for the delay in fixing this issue!)

Was this page helpful?
0 / 5 - 0 ratings