I0819 12:00:50.283545 140343560021824 tf_logging.py:115] Saving checkpoints for 0 into /home/rudy/git/models/official/transformer/model_base/model.ckpt.
Traceback (most recent call last):
File "transformer_main.py", line 636, in
absl_app.run(main)
File "/home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "transformer_main.py", line 630, in main
run_transformer(flags.FLAGS)
File "transformer_main.py", line 611, in run_transformer
vocab_file=flags_obj.vocab_file)
File "transformer_main.py", line 332, in run_loop
hooks=train_hooks)
File "/home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1205, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1352, in _train_model_distributed
saving_listeners)
File "/home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1471, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1156, in run
run_metadata=run_metadata)
File "/home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
raise six.reraise(original_exc_info)
File "/home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1240, in run
return self._sess.run(args, *kwargs)
File "/home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1312, in run
run_metadata=run_metadata)
File "/home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
return self._sess.run(args, **kwargs)
File "/home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Window size must be greater than zero, but got 0.
[[{{node IteratorGetNext}} = IteratorGetNext[output_shapes=[[?,?], [?,?]], output_types=[DT_INT64, DT_INT64]](IteratorFromStringHandleV2)]]
[[node ExperimentalFunctionBufferingResourceGetNext_2 (defined at /home/rudy/anaconda2/envs/tf1.12py2.7/lib/python2.7/site-packages/tensorflow/contrib/distribute/python/prefetching_ops_v2.py:124) = ExperimentalFunctionBufferingResourceGetNext[output_types=[DT_INT64, DT_INT64], _device="/job:localhost/replica:0/task:0/device:GPU:2"](ExperimentalFunctionBufferingResource_2)]]
what‘ s problem? any suggestion will be appreciated.
** I edited this comment because I was completely wrong before
I had the same issue, and I fixed it by increasing 'batch_size' parameter.
What the program does is it creates buckets of similar length sequences purely based off the 'batch_size' and 'max_length' parameters. Each bucket has an individual batch size (implemented as the window size you refer to above) tuned so that the max sequence length allowed in the bucket multiplied by the bucket's batch size is approximately equal to 'batch_size' parameter (this is so that you have about the same amount of data going into every batch).
The problem occurs: if the batch_size is less than the max_length, then the buckets on the upper end of the range cannot have a small enough window size/individual batch size to make it so that the (window_sizze) * (bucket sequence length) <= 'max_length'. Hence the window_size is set to 0 which tensorflow doesn't like causing the error above.
It would be nice if they checked for this in the code (since none of this logic depends on the data, so you could check as you did the data loading) and threw a more informative error. FYI the file that all this code is being kept in is official/transformer/v2/data_pipeline.py
TLDR change your 'batch_size' param to be at least 1 greater than your 'max_length' param
ah, yes. The batch_size is actually the token budget not the number of sequence per batch. It is quite confusing.
Most helpful comment
** I edited this comment because I was completely wrong before
I had the same issue, and I fixed it by increasing 'batch_size' parameter.
What the program does is it creates buckets of similar length sequences purely based off the 'batch_size' and 'max_length' parameters. Each bucket has an individual batch size (implemented as the window size you refer to above) tuned so that the max sequence length allowed in the bucket multiplied by the bucket's batch size is approximately equal to 'batch_size' parameter (this is so that you have about the same amount of data going into every batch).
The problem occurs: if the batch_size is less than the max_length, then the buckets on the upper end of the range cannot have a small enough window size/individual batch size to make it so that the (window_sizze) * (bucket sequence length) <= 'max_length'. Hence the window_size is set to 0 which tensorflow doesn't like causing the error above.
It would be nice if they checked for this in the code (since none of this logic depends on the data, so you could check as you did the data loading) and threw a more informative error. FYI the file that all this code is being kept in is official/transformer/v2/data_pipeline.py
TLDR change your 'batch_size' param to be at least 1 greater than your 'max_length' param