Please answer the following questions for yourself before submitting an issue.
https://github.com/tensorflow/models/tree/master/official/nlp/transformer
i use the follow command to run the translation ,
it training is ok ,and during training it eval a newstest2014.en file and it cannot concat the eval translation results and errors as following
!python3 transformer_main.py --data_dir=data_v2 \
--model_dir=model\
--vocab_file=data_v2/vocab.ende.32768 \
--param_set=big \
--train_steps=100000 \
--steps_between_evals=5000\
--batch_size=4096 --max_length=64 \
--bleu_source=data_v2/newstest2014.en \
--bleu_ref=data_v2/newstest2014.de \
--num_gpus=4 \
--enable_time_history=False
64/64 [==============================] - 432s 7s/step
Traceback (most recent call last):
File "transformer_main.py", line 510, in
app.run(main)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "transformer_main.py", line 498, in main
task.train()
File "transformer_main.py", line 371, in train
uncased_score, cased_score = self.eval()
File "transformer_main.py", line 404, in eval
distribution_strategy)
File "transformer_main.py", line 122, in evaluate_and_log_bleu
model, params, subtokenizer, bleu_source, bleu_ref, distribution_strategy)
File "transformer_main.py", line 89, in translate_and_compute_bleu
distribution_strategy=distribution_strategy)
File "/home/dell/workspace/cps_seq/models-master_origin_20200427/official/nlp/transformer/translate.py", line 174, in translate_file
val_outputs, _ = model.predict(text,verbose=1)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 94, in _method_wrapper
return method(self, args, *kwargs)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1396, in predict
all_outputs = nest.map_structure_up_to(batch_outputs, concat, outputs)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/util/nest.py", line 1131, in map_structure_up_to
*kwargs)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/util/nest.py", line 1227, in map_structure_with_tuple_paths_up_to
*flat_value_lists)]
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/util/nest.py", line 1226, in
results = [func(
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/util/nest.py", line 1129, in
lambda _, *values: func(
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1884, in concat
return array_ops.concat(tensors, axis=axis)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
return target(args, *kwargs)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1630, in concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1198, in concat_v2
_ops.raise_from_not_ok_status(e, name)
File "/home/dell/anaconda3/envs/tf2.2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 6816, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "
tensorflow.python.framework.errors_impl.InvalidArgumentError: ConcatOp : Dimensions of inputs should match: shape[0] = [32,178] vs. shape[1] = [32,175] [Op:ConcatV2] name: concat
!python3 transformer_main.py --data_dir=data_v2 \
--model_dir=model\
--vocab_file=data_v2/vocab.ende.32768 \
--param_set=big \
--train_steps=100000 \
--steps_between_evals=5000\
--batch_size=4096 --max_length=64 \
--bleu_source=data_v2/newstest2014.en \
--bleu_ref=data_v2/newstest2014.de \
--num_gpus=4 \
--enable_time_history=False
when training ,it eval the translation file ,but it's eval process outputs concats errors
i think it is because the eval process the source language and it concat target language batch results shape doesnt match errors
but i donot know how to fix it .
hi can anybody help me ?
i think it is because the eval process the source language and it concat target language batch results shape doesnt match errors .it seems to be the tf.keras.model.predict() API's bugs ,
but i donot know how to fix it .
@omalleyt12 Hi Tom, is this some regression for keras in TF 2.2? I remember we encountered something similar.
@paulrich1234 --steps_between_evals=5000 should train 5000 steps before eval. However, your logs show it just runs 64 steps?
@omalleyt12 Hi Tom, is this some regression for keras in TF 2.2? I remember we encountered something similar.
@paulrich1234 --steps_between_evals=5000 should train 5000 steps before eval. However, your logs show it just runs 64 steps?
hi saberkun,i think it is not just run 64 steps ,cause i run about 30minutes ,and i will test it again
Thank you for your attention .
The current master head should work for 2.2 as well. If you run with the current head, do you see this error? Because this script has nightly regression test, so I am a bit surprised to see this error.
I do remember this error as there was a bug caused by keras predict rewrite. So let us know you findings.
This could be due to the Model.predict rewrite
Q: are the expected return values here Tensors or RaggedTensors? It seems to be trying to concat two Tensors of shape [32, 178] and [32, 175] along the first dimension, which will fail because the second dimensions are not equal
Are these shapes expected?
hi omalleyt12 : i think i have found the problem tf.keras.model.predict() can only accept one batch samples ,if data generator have two different shapes of batch .it will cause the problem . such as one batch have the shape of [124,36] as the input of predict is ok .but a data generator generate a batch of sample shape like [124,36] [78,56] it will cause the problem .
@paulrich1234 Different batch sizes are ok, but different sizes for the second dimension (36 and 56 in this example) won't work because Model.predict concatenates batches into one Tensor along the batch dimension.
To support outputs from Model.predict that have different sizes in non-batch dimensions, the Model outputs would have to be RaggedTensors
hi @omalleyt12 Thanks
@paulrich1234 The second dimension is the sequence dimension. If you see this, I doubt it is because of the dynamic batch size but the data is padded: https://github.com/tensorflow/models/blob/master/official/nlp/transformer/translate.py#L116
I could think about the reason that outputs are be different is something related to beam search.
I think if you use "padded_decode" here,
https://github.com/tensorflow/models/blob/master/official/nlp/transformer/transformer.py#L292
will make sure the outputs have the same length.
If it works, I think for "padded_decode" is false, we should pad the output to make model.predict happy.
Thank you @saberkun