Tensor2tensor: model: universal_transformer can't working with worker_gpu = 2 ?

Created on 18 Aug 2018 · 14Comments · Source: tensorflow/tensor2tensor

Description

when I training using t2t-trainer, and try the universal_transformer together with worker_gpu = 2 , training failed immediately with error and exit (BTW: train default with 1 GPU works).
INFO:tensorflow:Cannot use 'Identity_122' as input to 'Identity_33' because they are in different while loops.
Identity_122 while context: universal_transformer/parallel_1_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/while_context
Identity_33 while context: universal_transformer/parallel_0_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/while_context

Traceback for Identity_122:
.....

Environment information

OS:
Linux 8d9c9f85bad0 4.4.0-131-generic

$ pip freeze | grep tensor
tensor2tensor==1.7.0
tensorboard==1.9.0
tensorflow==1.9.0

$ python -V
Python 3.6.5 :: Anaconda, Inc.

For bugs: reproduction and error logs

# Steps to reproduce:
t2t-trainer \
  --data_dir=$DATA_DIR \
  --problem=translate_enzh_wmt32k \
  --model=universal_transformer \
  --hparams_set=universal_transformer_small \
  --hparams='batch_size=5120' \
  --train_steps=800000 \
  --random_seed=33 \
  --worker_gpu=2 \
  --output_dir=$TRAIN_DIR

Error logs:

INFO:tensorflow:Cannot use 'Identity_122' as input to 'Identity_33' because they are in different while loops.
Identity_122 while context: universal_transformer/parallel_1_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/while_context
Identity_33 while context: universal_transformer/parallel_0_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/while_context

Traceback for Identity_122:
.....

Source

gushuheng

Most helpful comment

they have fixed it with the latest codes. great!

zherowolf on 24 Aug 2018

👍3

All 14 comments

Hello @gushuheng , can you tell me when you run the model universal_transformer ,whether it is run correctly? and can you post your parameters to me ?my tensorflow-gpu version is 1.8,but I can't run it with the default parammeters and report a type error .

Qnlp on 22 Aug 2018

@zxqchat I uses default parameter set with "universal_transformer_small" with "bath_size=5120".
btw: i'm using tf 1.9.
t2t-trainer \
--data_dir=$DATA_DIR \
--problem=translate_enzh_wmt32k \
--model=universal_transformer \
--hparams_set=universal_transformer_small \
--hparams='batch_size=5120' \
--train_steps=800000 \
--random_seed=33 \
--worker_gpu=2 \
--output_dir=$TRAIN_DIR

gushuheng on 22 Aug 2018

@gushuheng thank you for your kindness help.I found your parameters is similar to me,I thought the cause might be tracked with the tf version.the tensor2tensor version is same with yours.

Qnlp on 22 Aug 2018

I also can not run universal transformer with multi gpus.
hope some guys can help fix it.

zherowolf on 24 Aug 2018

they have fixed it with the latest codes. great!

zherowolf on 24 Aug 2018

👍3

i have a new issue. when i use universal_transformer_big to train a model, the BLEU score is very low,
approx_bleu_score = 0.01985711,INFO:tensorflow:loss = 4.3538547, step = 22000 (82.838 sec)

kudou1994 on 19 Sep 2018

Make sure that you are on TF1.10 (or at least 1.9) and you're using the latest version of T2T.
Also, make sure that hparams.daisy_chain_variables is set to False:
(https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/research/universal_transformer.py#L360)
Then it should work.

MostafaDehghani on 11 Oct 2018

Manumanu199719 on 11 Oct 2018

Make sure that you are on TF1.10 (or at least 1.9) and you're using the latest version of T2T.

Also, make sure that hparams.daisy_chain_variables is set to False:
(https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/research/universal_transformer.py#L360)
Then it should work.

thank you, but i can't solve this problem. Here is the information about my machine:

hparams.daisy_chain_variables = False # Breaks multi-gpu in while loops.

OS:

$ pip freeze | grep tensor
tensor2tensor==1.9.0
tensorboard==1.10.0
tensorflow==1.10.1
tensorflow-gpu==1.10.1

$ python -V
Python 3.6.4 :: Anaconda, Inc.

kudou1994 on 12 Oct 2018

@kudou1994 the bleu is low indicate that tensor2tensor is run correctly. The performence caused by your settings eg. batch_size and training data

Qnlp on 12 Oct 2018

@kudou1994 the bleu is low indicate that tensor2tensor is run correctly. The performence caused by your settings eg. batch_size and training data

I try to use different datasat, for example, WMT, ai challenger.Both small and large dataset have been tried.Batch_size =1024 or 2048. BLEU is normal when I use transformer_base, only abnormal when I use universal_transformer.

kudou1994 on 12 Oct 2018

@kudou1994 the bleu is low indicate that tensor2tensor is run correctly. The performence caused by your settings eg. batch_size and training data

I try to use different datasat, for example, WMT, ai challenger.Both small and large dataset have been tried.Batch_size =1024 or 2048. BLEU is normal when I use transformer_base, only abnormal when I use universal_transformer.

I got the same problem, then training loss is about 4~5 after 8k steps. Have you solved yet?

Bournet on 18 Oct 2018

@kudou1994 my wechat id is lijingx-, hope we can help each other~~

li10141110 on 22 Oct 2018

I got the same problem, then training loss is about 4~5 after 8k steps. Have you solved yet?

The convergence problem of the Universal Transformer is solved in #1194.
(Really sorry for the delay in fixing this issue!)

MostafaDehghani on 1 Nov 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Exporting Trained Models with SavedModel API

peblair · 4Comments

Need help with understanding tokenization and pre processing in case of translation problem.

sugeeth14 · 3Comments

Retval[0] does not have value issue for multiple problems

jhyoocoder · 3Comments

*help* How to serve model on gpu

mehmedes · 3Comments

ERROR:tensorflow:Model diverged with loss = NaN during traning translation model

yudianer · 4Comments