Facenet: OutOfRangeError occurred during training

Created on 6 Jan 2017 · 11Comments · Source: davidsandberg/facenet

I get the following error when attempting to train with a batch size of 64. Any help is much appreciated!

W tensorflow/core/framework/op_kernel.cc:968] Out of range: FIFOQueue '_1_batch_join/fifo_queue' is closed and has insufficient elements (requested 64, current size 0)
[[Node: batch_join = QueueDequeueUpTo_class=["loc:@batch_join/fifo_queue"], component_types=[DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"]]
Traceback (most recent call last):
File "facenet_train_classifier.py", line 347, in
main(parse_arguments(sys.argv[1:]))
File "facenet_train_classifier.py", line 175, in main
update_centers)
File "facenet_train_classifier.py", line 203, in train
err, _, _, step, reg_loss = sess.run([loss, train_op, update_centers, global_step, regularization_losses], feed_dict=feed_dict)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 717, in run
run_metadata_ptr)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 915, in _run
feed_dict_string, options, run_metadata)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _do_run
target_list, options, run_metadata)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 985, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.OutOfRangeError: FIFOQueue '_1_batch_join/fifo_queue' is closed and has insufficient elements (requested 64, current size 0)
[[Node: batch_join = QueueDequeueUpTo[_class=["loc:@batch_join/fifo_queue"], component_types=[DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch_join/fifo_queue, batch_join/n)]]

Caused by op u'batch_join', defined at:
File "facenet_train_classifier.py", line 347, in
main(parse_arguments(sys.argv[1:]))
File "facenet_train_classifier.py", line 87, in main
args.batch_size, args.max_nrof_epochs, args.random_crop, args.random_flip, args.nrof_preprocess_threads)
File "/home/tangwt/facenet/src/facenet.py", line 138, in read_and_augument_data
allow_smaller_final_batch=True)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 708, in batch_join
dequeued = queue.dequeue_up_to(batch_size, name=name)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/ops/data_flow_ops.py", line 499, in dequeue_up_to
self._queue_ref, n=n, component_types=self._dtypes, name=name)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 951, in _queue_dequeue_up_to
timeout_ms=timeout_ms, name=name)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op
op_def=op_def)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/tangwt/venvs/py27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1298, in __init__
self._traceback = _extract_stack()

OutOfRangeError (see above for traceback): FIFOQueue '_1_batch_join/fifo_queue' is closed and has insufficient elements (requested 64, current size 0)
[[Node: batch_join = QueueDequeueUpTo[_class=["loc:@batch_join/fifo_queue"], component_types=[DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch_join/fifo_queue, batch_join/n)]]

Source

wttangc

Most helpful comment

Maybe your input data is insufficient.

stringmoon on 6 Jan 2017

😄4

All 11 comments

Maybe your input data is insufficient.

stringmoon on 6 Jan 2017

😄4

I was trying to train using the LFW dataset without validation. There are 5749 classes and 13233 images. The above error occurred during the 5th epoch.

wttangc on 6 Jan 2017

Could you post your executed command so that we can more easily guess what happened?
Did you set --max_nrof_epochs to a low value?

ugtony on 6 Jan 2017

The executed command is:

python facenet_train_classifier.py --logs_base_dir ~/facenet_logs/ --models_base_dir ~/facenet_models/ --data_dir ~/datasets/lfw_mtcnnpy_182 --image_size 160 --model_def models.inception_resnet_v1 --weight_decay 2e-4 --optimizer RMSPROP --learning_rate -1 --max_nrof_epochs 20 --keep_probability 0.8 --random_crop --random_flip --learning_rate_schedule_file ../data/learning_rate_schedule_classifier_long.txt --center_loss_factor 2e-4 --batch_size 64

wttangc on 6 Jan 2017

Try setting the epoch_size to a small value (such that batch_size*epoch_size < 13233). I think the problem is caused by the input_slice_producer stopping too early.

davidsandberg on 6 Jan 2017

👍1

There it is!
The problem caused from

 input_queue = tf.train.slice_input_producer([images, labels],
        num_epochs=max_nrof_epochs, shuffle=shuffle)

The queue reaches an epoch limit and the attempt to dequeue examples gives an tf.OutOfRangeError.
In your example, 64 * 1000 * 4 < 20 * 13233 < 64 * 1000 * 5 so an error occurs at 5th epoch.

ugtony on 6 Jan 2017

Thanks @davidsandberg and @ugtony, problem is solved by reducing epoch_size!

wttangc on 8 Jan 2017

Great! Will remove the epochs limit in slice_input_producer some time. Closing this for now...

davidsandberg on 8 Jan 2017

Can someone please explain to me the meaning of epoch_size, batch_size and max_nrof_epochs in the context of this project please?

I understand that the epoch_size is the number of batches that should be considered as an epoch. Is that right? So if we mention epoch_size as 10, it means that 10 mini-batches is one epoch. Is that right?

I understand batch_size is the total number of images that is considered as a mini-batch and passed through the GPU. So if I mention that the batch_size is 120, it means that the triplets are sampled from this batch or does this mean that the number of triplets always passed through the GPU is = 40? (because 120 images is 40 triplets?)

I understand max_nrof_epochs is the maximum number of epochs to be trained for.

Can you please correct my understanding?

abhisheksgumadi on 16 Apr 2018

I have 2 images in each folder ( each folder represents one person) and I have 10000 folders. So that is 20000 images in total.

max_nrof_epochs is 500
batch_size is 120
people_per_batch is 120
images_per_person is 2
epoch_size is 60

Why does my training crash as follows at the 25th minibatch?

`2018-04-16 08:27:14.280098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-16 08:27:14.280122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-04-16 08:27:14.601180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-16 08:27:14.601235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-04-16 08:27:14.601255: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-04-16 08:27:14.601626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 12921 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
Running forward pass on sampled images: 7.075
Selecting suitable triplets for training
src/train_tripletloss.py:297: RuntimeWarning: invalid value encountered in less
all_neg = np.where(neg_dists_sqr-pos_dist_sqr (nrof_random_negs, nrof_triplets) = (120, 120): time=7.090 seconds
Epoch: [0][1/60] Time 14.344 Loss 9.802
Epoch: [0][2/60] Time 1.198 Loss 9.642
Epoch: [0][3/60] Time 1.634 Loss 10.009
Running forward pass on sampled images: 2.805
Selecting suitable triplets for training
(nrof_random_negs, nrof_triplets) = (120, 120): time=2.819 seconds
Epoch: [0][4/60] Time 1.415 Loss 9.801
Epoch: [0][5/60] Time 1.332 Loss 9.986
Epoch: [0][6/60] Time 1.569 Loss 9.615
Running forward pass on sampled images: 3.394
Selecting suitable triplets for training
(nrof_random_negs, nrof_triplets) = (120, 120): time=3.408 seconds
Epoch: [0][7/60] Time 1.726 Loss 9.718
Epoch: [0][8/60] Time 1.450 Loss 9.737
Epoch: [0][9/60] Time 1.488 Loss 9.761
Running forward pass on sampled images: 3.120
Selecting suitable triplets for training
(nrof_random_negs, nrof_triplets) = (120, 120): time=3.134 seconds
Epoch: [0][10/60] Time 1.566 Loss 9.788
Epoch: [0][11/60] Time 1.477 Loss 9.589
Epoch: [0][12/60] Time 1.614 Loss 9.735
Running forward pass on sampled images: 3.190
Selecting suitable triplets for training
(nrof_random_negs, nrof_triplets) = (120, 120): time=3.204 seconds
Epoch: [0][13/60] Time 1.683 Loss 9.721
Epoch: [0][14/60] Time 1.444 Loss 9.739
Epoch: [0][15/60] Time 1.462 Loss 9.743
Running forward pass on sampled images: 2.961
Selecting suitable triplets for training
(nrof_random_negs, nrof_triplets) = (120, 120): time=2.975 seconds
Epoch: [0][16/60] Time 1.536 Loss 9.385
Epoch: [0][17/60] Time 1.374 Loss 9.386
Epoch: [0][18/60] Time 1.458 Loss 9.532
Running forward pass on sampled images: 3.354
Selecting suitable triplets for training
(nrof_random_negs, nrof_triplets) = (120, 120): time=3.368 seconds
Epoch: [0][19/60] Time 1.578 Loss 9.299
Epoch: [0][20/60] Time 1.528 Loss 9.363
Epoch: [0][21/60] Time 1.609 Loss 9.402
Running forward pass on sampled images: 3.286
Selecting suitable triplets for training
(nrof_random_negs, nrof_triplets) = (120, 120): time=3.300 seconds
Epoch: [0][22/60] Time 1.651 Loss 9.648
Epoch: [0][23/60] Time 1.486 Loss 9.345
Epoch: [0][24/60] Time 1.494 Loss 9.301
Running forward pass on sampled images: Traceback (most recent call last):
File "src/train_tripletloss.py", line 489, in
main(parse_arguments(sys.argv[1:]))
File "src/train_tripletloss.py", line 188, in main
args.embedding_size, anchor, positive, negative, triplet_loss)
File "src/train_tripletloss.py", line 227, in train
learning_rate_placeholder: lr, phase_train_placeholder: True})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: FIFOQueue '_1_batch_join/fifo_queue' is closed and has insufficient elements (requested 120, current size 0)
[[Node: batch_join = QueueDequeueUpToV2[component_types=[DT_FLOAT, DT_INT64], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](batch_join/fifo_queue, _arg_batch_size_0_0)]]