I get the following error when attempting to train with a batch size of 64. Any help is much appreciated!
W tensorflow/core/framework/op_kernel.cc:968] Out of range: FIFOQueue '_1_batch_join/fifo_queue' is closed and has insufficient elements (requested 64, current size 0)
[[Node: batch_join = QueueDequeueUpTo_class=["loc:@batch_join/fifo_queue"], component_types=[DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"]]
Traceback (most recent call last):
File "facenet_train_classifier.py", line 347, in
main(parse_arguments(sys.argv[1:]))
File "facenet_train_classifier.py", line 175, in main
update_centers)
File "facenet_train_classifier.py", line 203, in train
err, _, _, step, reg_loss = sess.run([loss, train_op, update_centers, global_step, regularization_losses], feed_dict=feed_dict)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 717, in run
run_metadata_ptr)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 915, in _run
feed_dict_string, options, run_metadata)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _do_run
target_list, options, run_metadata)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 985, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.OutOfRangeError: FIFOQueue '_1_batch_join/fifo_queue' is closed and has insufficient elements (requested 64, current size 0)
[[Node: batch_join = QueueDequeueUpTo[_class=["loc:@batch_join/fifo_queue"], component_types=[DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch_join/fifo_queue, batch_join/n)]]
Caused by op u'batch_join', defined at:
File "facenet_train_classifier.py", line 347, in
main(parse_arguments(sys.argv[1:]))
File "facenet_train_classifier.py", line 87, in main
args.batch_size, args.max_nrof_epochs, args.random_crop, args.random_flip, args.nrof_preprocess_threads)
File "/home/tangwt/facenet/src/facenet.py", line 138, in read_and_augument_data
allow_smaller_final_batch=True)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 708, in batch_join
dequeued = queue.dequeue_up_to(batch_size, name=name)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/ops/data_flow_ops.py", line 499, in dequeue_up_to
self._queue_ref, n=n, component_types=self._dtypes, name=name)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 951, in _queue_dequeue_up_to
timeout_ms=timeout_ms, name=name)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op
op_def=op_def)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1298, in __init__
self._traceback = _extract_stack()
OutOfRangeError (see above for traceback): FIFOQueue '_1_batch_join/fifo_queue' is closed and has insufficient elements (requested 64, current size 0)
[[Node: batch_join = QueueDequeueUpTo[_class=["loc:@batch_join/fifo_queue"], component_types=[DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch_join/fifo_queue, batch_join/n)]]
Maybe your input data is insufficient.
I was trying to train using the LFW dataset without validation. There are 5749 classes and 13233 images. The above error occurred during the 5th epoch.
Could you post your executed command so that we can more easily guess what happened?
Did you set --max_nrof_epochs to a low value?
The executed command is:
python facenet_train_classifier.py --logs_base_dir ~/facenet_logs/ --models_base_dir ~/facenet_models/ --data_dir ~/datasets/lfw_mtcnnpy_182 --image_size 160 --model_def models.inception_resnet_v1 --weight_decay 2e-4 --optimizer RMSPROP --learning_rate -1 --max_nrof_epochs 20 --keep_probability 0.8 --random_crop --random_flip --learning_rate_schedule_file ../data/learning_rate_schedule_classifier_long.txt --center_loss_factor 2e-4 --batch_size 64
Try setting the epoch_size to a small value (such that batch_size*epoch_size < 13233). I think the problem is caused by the input_slice_producer stopping too early.
There it is!
The problem caused from
input_queue = tf.train.slice_input_producer([images, labels],
num_epochs=max_nrof_epochs, shuffle=shuffle)
The queue reaches an epoch limit and the attempt to dequeue examples gives an tf.OutOfRangeError.
In your example, 64 * 1000 * 4 < 20 * 13233 < 64 * 1000 * 5 so an error occurs at 5th epoch.
Thanks @davidsandberg and @ugtony, problem is solved by reducing epoch_size!
Great! Will remove the epochs limit in slice_input_producer some time. Closing this for now...
Can someone please explain to me the meaning of epoch_size, batch_size and max_nrof_epochs in the context of this project please?
I understand that the epoch_size is the number of batches that should be considered as an epoch. Is that right? So if we mention epoch_size as 10, it means that 10 mini-batches is one epoch. Is that right?
I understand batch_size is the total number of images that is considered as a mini-batch and passed through the GPU. So if I mention that the batch_size is 120, it means that the triplets are sampled from this batch or does this mean that the number of triplets always passed through the GPU is = 40? (because 120 images is 40 triplets?)
I understand max_nrof_epochs is the maximum number of epochs to be trained for.
Can you please correct my understanding?
I have 2 images in each folder ( each folder represents one person) and I have 10000 folders. So that is 20000 images in total.
max_nrof_epochs is 500
batch_size is 120
people_per_batch is 120
images_per_person is 2
epoch_size is 60
Why does my training crash as follows at the 25th minibatch?
`2018-04-16 08:27:14.280098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-16 08:27:14.280122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-04-16 08:27:14.601180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-16 08:27:14.601235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-04-16 08:27:14.601255: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-04-16 08:27:14.601626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 12921 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
Running forward pass on sampled images: 7.075
Selecting suitable triplets for training
src/train_tripletloss.py:297: RuntimeWarning: invalid value encountered in less
all_neg = np.where(neg_dists_sqr-pos_dist_sqr
Epoch: [0][1/60] Time 14.344 Loss 9.802
Epoch: [0][2/60] Time 1.198 Loss 9.642
Epoch: [0][3/60] Time 1.634 Loss 10.009
Running forward pass on sampled images: 2.805
Selecting suitable triplets for training
(nrof_random_negs, nrof_triplets) = (120, 120): time=2.819 seconds
Epoch: [0][4/60] Time 1.415 Loss 9.801
Epoch: [0][5/60] Time 1.332 Loss 9.986
Epoch: [0][6/60] Time 1.569 Loss 9.615
Running forward pass on sampled images: 3.394
Selecting suitable triplets for training
(nrof_random_negs, nrof_triplets) = (120, 120): time=3.408 seconds
Epoch: [0][7/60] Time 1.726 Loss 9.718
Epoch: [0][8/60] Time 1.450 Loss 9.737
Epoch: [0][9/60] Time 1.488 Loss 9.761
Running forward pass on sampled images: 3.120
Selecting suitable triplets for training
(nrof_random_negs, nrof_triplets) = (120, 120): time=3.134 seconds
Epoch: [0][10/60] Time 1.566 Loss 9.788
Epoch: [0][11/60] Time 1.477 Loss 9.589
Epoch: [0][12/60] Time 1.614 Loss 9.735
Running forward pass on sampled images: 3.190
Selecting suitable triplets for training
(nrof_random_negs, nrof_triplets) = (120, 120): time=3.204 seconds
Epoch: [0][13/60] Time 1.683 Loss 9.721
Epoch: [0][14/60] Time 1.444 Loss 9.739
Epoch: [0][15/60] Time 1.462 Loss 9.743
Running forward pass on sampled images: 2.961
Selecting suitable triplets for training
(nrof_random_negs, nrof_triplets) = (120, 120): time=2.975 seconds
Epoch: [0][16/60] Time 1.536 Loss 9.385
Epoch: [0][17/60] Time 1.374 Loss 9.386
Epoch: [0][18/60] Time 1.458 Loss 9.532
Running forward pass on sampled images: 3.354
Selecting suitable triplets for training
(nrof_random_negs, nrof_triplets) = (120, 120): time=3.368 seconds
Epoch: [0][19/60] Time 1.578 Loss 9.299
Epoch: [0][20/60] Time 1.528 Loss 9.363
Epoch: [0][21/60] Time 1.609 Loss 9.402
Running forward pass on sampled images: 3.286
Selecting suitable triplets for training
(nrof_random_negs, nrof_triplets) = (120, 120): time=3.300 seconds
Epoch: [0][22/60] Time 1.651 Loss 9.648
Epoch: [0][23/60] Time 1.486 Loss 9.345
Epoch: [0][24/60] Time 1.494 Loss 9.301
Running forward pass on sampled images: Traceback (most recent call last):
File "src/train_tripletloss.py", line 489, in
main(parse_arguments(sys.argv[1:]))
File "src/train_tripletloss.py", line 188, in main
args.embedding_size, anchor, positive, negative, triplet_loss)
File "src/train_tripletloss.py", line 227, in train
learning_rate_placeholder: lr, phase_train_placeholder: True})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: FIFOQueue '_1_batch_join/fifo_queue' is closed and has insufficient elements (requested 120, current size 0)
[[Node: batch_join = QueueDequeueUpToV2[component_types=[DT_FLOAT, DT_INT64], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](batch_join/fifo_queue, _arg_batch_size_0_0)]]
`
i also got this error but i corrected using deleting all png image
Most helpful comment
Maybe your input data is insufficient.