Models: i think it is because the eval process the source language and it concat target language output errors .

Created on 19 May 2020 · 10Comments · Source: tensorflow/models

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[1 ] I am using the latest TensorFlow Model Garden release and TensorFlow 2.2.0
[ 2] I am reporting the issue to the correct repository. (Model Garden official or research directory)
[ 3] I checked to make sure that this issue has not been filed already.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/official/nlp/transformer

2. Describe the bug

i use the follow command to run the translation ,
it training is ok ,and during training it eval a newstest2014.en file and it cannot concat the eval translation results and errors as following

!python3 transformer_main.py --data_dir=data_v2 \
--model_dir=model\
--vocab_file=data_v2/vocab.ende.32768 \
--param_set=big \
--train_steps=100000 \
--steps_between_evals=5000\
--batch_size=4096 --max_length=64 \
--bleu_source=data_v2/newstest2014.en \
--bleu_ref=data_v2/newstest2014.de \
--num_gpus=4 \
--enable_time_history=False

64/64 [==============================] - 432s 7s/step
Traceback (most recent call last):
File "transformer_main.py", line 510, in
app.run(main)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "transformer_main.py", line 498, in main
task.train()
File "transformer_main.py", line 371, in train
uncased_score, cased_score = self.eval()
File "transformer_main.py", line 404, in eval
distribution_strategy)
File "transformer_main.py", line 122, in evaluate_and_log_bleu
model, params, subtokenizer, bleu_source, bleu_ref, distribution_strategy)
File "transformer_main.py", line 89, in translate_and_compute_bleu
distribution_strategy=distribution_strategy)
File "/home/dell/workspace/cps_seq/models-master_origin_20200427/official/nlp/transformer/translate.py", line 174, in translate_file
val_outputs, _ = model.predict(text,verbose=1)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 94, in _method_wrapper
return method(self, args, *kwargs)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1396, in predict
all_outputs = nest.map_structure_up_to(batch_outputs, concat, outputs)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/util/nest.py", line 1131, in map_structure_up_to
*kwargs)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/util/nest.py", line 1227, in map_structure_with_tuple_paths_up_to
*flat_value_lists)]
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/util/nest.py", line 1226, in
results = [func(args, *kwargs) for args in zip(flat_path_list,
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/util/nest.py", line 1129, in
lambda _, *values: func(values), # Discards the path arg.
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1884, in concat
return array_ops.concat(tensors, axis=axis)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
return target(args, *kwargs)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1630, in concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1198, in concat_v2
_ops.raise_from_not_ok_status(e, name)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 6816, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: ConcatOp : Dimensions of inputs should match: shape[0] = [32,178] vs. shape[1] = [32,175] [Op:ConcatV2] name: concat

3. Steps to reproduce

4. Expected behavior

when training ,it eval the translation file ,but it's eval process outputs concats errors

5. Additional context

i think it is because the eval process the source language and it concat target language batch results shape doesnt match errors
but i donot know how to fix it .

6. System information

Linux Ubuntu 16.04):
TensorFlow installed from source
TensorFlow version 2.2.0:
Python version:3.6
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:10.2.1
GPU model and memory: 2080TI *4

official bug

Source

paulrich1234

All 10 comments

hi can anybody help me ?
i think it is because the eval process the source language and it concat target language batch results shape doesnt match errors .it seems to be the tf.keras.model.predict() API's bugs ,
but i donot know how to fix it .

paulrich1234 on 19 May 2020

@omalleyt12 Hi Tom, is this some regression for keras in TF 2.2? I remember we encountered something similar.

@paulrich1234 --steps_between_evals=5000 should train 5000 steps before eval. However, your logs show it just runs 64 steps?

saberkun on 19 May 2020

@omalleyt12 Hi Tom, is this some regression for keras in TF 2.2? I remember we encountered something similar.

@paulrich1234 --steps_between_evals=5000 should train 5000 steps before eval. However, your logs show it just runs 64 steps?

hi saberkun,i think it is not just run 64 steps ,cause i run about 30minutes ,and i will test it again

Thank you for your attention .

paulrich1234 on 20 May 2020

The current master head should work for 2.2 as well. If you run with the current head, do you see this error? Because this script has nightly regression test, so I am a bit surprised to see this error.
I do remember this error as there was a bug caused by keras predict rewrite. So let us know you findings.

saberkun on 20 May 2020

This could be due to the Model.predict rewrite

Q: are the expected return values here Tensors or RaggedTensors? It seems to be trying to concat two Tensors of shape [32, 178] and [32, 175] along the first dimension, which will fail because the second dimensions are not equal

Are these shapes expected?

omalleyt12 on 20 May 2020

hi omalleyt12 : i think i have found the problem tf.keras.model.predict() can only accept one batch samples ,if data generator have two different shapes of batch .it will cause the problem . such as one batch have the shape of [124,36] as the input of predict is ok .but a data generator generate a batch of sample shape like [124,36] [78,56] it will cause the problem .

paulrich1234 on 20 May 2020

@paulrich1234 Different batch sizes are ok, but different sizes for the second dimension (36 and 56 in this example) won't work because Model.predict concatenates batches into one Tensor along the batch dimension.

To support outputs from Model.predict that have different sizes in non-batch dimensions, the Model outputs would have to be RaggedTensors

omalleyt12 on 20 May 2020

👍1

hi @omalleyt12 Thanks

paulrich1234 on 21 May 2020

@paulrich1234 The second dimension is the sequence dimension. If you see this, I doubt it is because of the dynamic batch size but the data is padded: https://github.com/tensorflow/models/blob/master/official/nlp/transformer/translate.py#L116
I could think about the reason that outputs are be different is something related to beam search.
I think if you use "padded_decode" here,
https://github.com/tensorflow/models/blob/master/official/nlp/transformer/transformer.py#L292
will make sure the outputs have the same length.
If it works, I think for "padded_decode" is false, we should pad the output to make model.predict happy.

saberkun on 21 May 2020

👍1

Thank you @saberkun

paulrich1234 on 22 May 2020

Was this page helpful?

0 / 5 - 0 ratings