Facenet: large dataset OOM

Created on 18 Jan 2017 · 2Comments · Source: davidsandberg/facenet

I have a dataset of 10M pictures
when I run small dataset(1M) for test, it works ok.
But, when I use the whole dataset(10M), it failed.
My system is Centos 7, with 256G memory, GPU is titanX
bellow is the error info:

I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 1728090112 totalling 6.44GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3325791744 totalling 3.10GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 9.83GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 12287591055
InUse: 10557622784
MaxInUse: 12280709632
NumAllocs: 2135
MaxAllocSize: 3325791744

W tensorflow/core/common_runtime/bfc_allocator.cc:274] *********_____________***************xxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 1.61GiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[1792,241084]
Traceback (most recent call last):
File "facenet_train_classifier.py", line 346, in
main(parse_arguments(sys.argv[1:]))
File "facenet_train_classifier.py", line 153, in main
sess.run(tf.global_variables_initializer())
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1792,241084]
[[Node: Logits/weights/Initializer/truncated_normal = Add[T=DT_FLOAT, _class=["loc:@Logits/weights"], _device="/job: localhost/replica:0/task:0/gpu:0"](Logits/weights/Initializer/truncated_normal/mul, Logits/weights/Initializer/truncated_normal/ mean)]]

Caused by op u'Logits/weights/Initializer/truncated_normal', defined at:
File "facenet_train_classifier.py", line 346, in
main(parse_arguments(sys.argv[1:]))
File "facenet_train_classifier.py", line 101, in main
scope='Logits', reuse=False)
File "/usr/local/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
return func(args, *current_args)
File "/usr/local/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1346, in fully_connected
trainable=trainable)
File "/usr/local/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
return func(args, *current_args)
File "/usr/local/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 244, in model_variable
caching_device=caching_device, device=device)
File "/usr/local/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
return func(args, *current_args)
File "/usr/local/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 208, in variable
caching_device=caching_device)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1024, in get_variable
custom_getter=custom_getter)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 850, in get_variable
custom_getter=custom_getter)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 346, in get_variable
validate_shape=validate_shape)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 331, in _true_getter
caching_device=caching_device, validate_shape=validate_shape)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 677, in _get_single_variable
expected_shape=shape)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 224, in __init__
expected_shape=expected_shape)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 327, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 665, in
shape.as_list(), dtype=dtype, partition_info=partition_info)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/init_ops.py", line 229, in _initializer
return random_ops.truncated_normal(shape, mean, stddev, dtype, seed=seed)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/random_ops.py", line 176, in truncated_normal
value = math_ops.add(mul, mean_tensor, name=name)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 73, in add
result = _op_def_lib.apply_op("Add", x=x, y=y, name=name)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1792,241084]
[[Node: Logits/weights/Initializer/truncated_normal = Add[T=DT_FLOAT, _class=["loc:@Logits/weights"], _device="/job: localhost/replica:0/task:0/gpu:0"](Logits/weights/Initializer/truncated_normal/mul, Logits/weights/Initializer/truncated_normal/ mean)]]

Source

mgy89

Most helpful comment

Hi @mgy89!
You have too many classes so the final fully connected layer becomes too large. This is a known problem when training as a classifier and the main reason why people are using triplet loss and similar. But it's not an easy problem to solve. The things I see you can do are:

Select a subset of the classes in your dataset to use for training
Use a smaller model such that the final fully connected layer fits in your GPU
- For example if the size of the prelogits tensor can be reduced the number of weights in the final FC layer will also be reduced
Train using triplet loss
Using float16 instead of float32 will reduce the memory footprint significantly (see Tensorflow cifar example) but I haven't gotten it to work
Good luck!

davidsandberg on 18 Jan 2017

👍3

All 2 comments

Select a subset of the classes in your dataset to use for training
Use a smaller model such that the final fully connected layer fits in your GPU
- For example if the size of the prelogits tensor can be reduced the number of weights in the final FC layer will also be reduced
Train using triplet loss
Using float16 instead of float32 will reduce the memory footprint significantly (see Tensorflow cifar example) but I haven't gotten it to work
Good luck!

davidsandberg on 18 Jan 2017

👍3

Hi, @davidsandberg
Thank you very much.
Earlier I totally flow your tutorial.
I just read your code facenet_train_classifier.py carefully, then I found that this is not triplet loss version.

So, I did the flowing experiment:

the facenet_train_classifier.py support around 100000 classes(person identity) on my machine, because the final fully_connected laryer:logits = slim.fully_connected(prelogits, len(train_set) ...)
I use the triplet loss version facenet_train.py, then it works fine for large dataset

I will check the result of those two version in next few days.

mgy89 on 18 Jan 2017

Was this page helpful?

0 / 5 - 0 ratings