Facenet: large dataset OOM

Created on 18 Jan 2017  路  2Comments  路  Source: davidsandberg/facenet

I have a dataset of 10M pictures
when I run small dataset(1M) for test, it works ok.
But, when I use the whole dataset(10M), it failed.
My system is Centos 7, with 256G memory, GPU is titanX
bellow is the error info:

I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 1728090112 totalling 6.44GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3325791744 totalling 3.10GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 9.83GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 12287591055
InUse: 10557622784
MaxInUse: 12280709632
NumAllocs: 2135
MaxAllocSize: 3325791744

W tensorflow/core/common_runtime/bfc_allocator.cc:274] *********_____________***************xxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 1.61GiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[1792,241084]
Traceback (most recent call last):
File "facenet_train_classifier.py", line 346, in
main(parse_arguments(sys.argv[1:]))
File "facenet_train_classifier.py", line 153, in main
sess.run(tf.global_variables_initializer())
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1792,241084]
[[Node: Logits/weights/Initializer/truncated_normal = Add[T=DT_FLOAT, _class=["loc:@Logits/weights"], _device="/job: localhost/replica:0/task:0/gpu:0"](Logits/weights/Initializer/truncated_normal/mul, Logits/weights/Initializer/truncated_normal/ mean)]]

Caused by op u'Logits/weights/Initializer/truncated_normal', defined at:
File "facenet_train_classifier.py", line 346, in
main(parse_arguments(sys.argv[1:]))
File "facenet_train_classifier.py", line 101, in main
scope='Logits', reuse=False)
File "/usr/local/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
return func(args, *current_args)
File "/usr/local/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1346, in fully_connected
trainable=trainable)
File "/usr/local/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
return func(args, *current_args)
File "/usr/local/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 244, in model_variable
caching_device=caching_device, device=device)
File "/usr/local/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
return func(args, *current_args)
File "/usr/local/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 208, in variable
caching_device=caching_device)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1024, in get_variable
custom_getter=custom_getter)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 850, in get_variable
custom_getter=custom_getter)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 346, in get_variable
validate_shape=validate_shape)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 331, in _true_getter
caching_device=caching_device, validate_shape=validate_shape)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 677, in _get_single_variable
expected_shape=shape)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 224, in __init__
expected_shape=expected_shape)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 327, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 665, in
shape.as_list(), dtype=dtype, partition_info=partition_info)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/init_ops.py", line 229, in _initializer
return random_ops.truncated_normal(shape, mean, stddev, dtype, seed=seed)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/random_ops.py", line 176, in truncated_normal
value = math_ops.add(mul, mean_tensor, name=name)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 73, in add
result = _op_def_lib.apply_op("Add", x=x, y=y, name=name)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1792,241084]
[[Node: Logits/weights/Initializer/truncated_normal = Add[T=DT_FLOAT, _class=["loc:@Logits/weights"], _device="/job: localhost/replica:0/task:0/gpu:0"](Logits/weights/Initializer/truncated_normal/mul, Logits/weights/Initializer/truncated_normal/ mean)]]

Most helpful comment

Hi @mgy89!
You have too many classes so the final fully connected layer becomes too large. This is a known problem when training as a classifier and the main reason why people are using triplet loss and similar. But it's not an easy problem to solve. The things I see you can do are:

  • Select a subset of the classes in your dataset to use for training
  • Use a smaller model such that the final fully connected layer fits in your GPU

    • For example if the size of the prelogits tensor can be reduced the number of weights in the final FC layer will also be reduced

  • Train using triplet loss
  • Using float16 instead of float32 will reduce the memory footprint significantly (see Tensorflow cifar example) but I haven't gotten it to work
    Good luck!

All 2 comments

Hi @mgy89!
You have too many classes so the final fully connected layer becomes too large. This is a known problem when training as a classifier and the main reason why people are using triplet loss and similar. But it's not an easy problem to solve. The things I see you can do are:

  • Select a subset of the classes in your dataset to use for training
  • Use a smaller model such that the final fully connected layer fits in your GPU

    • For example if the size of the prelogits tensor can be reduced the number of weights in the final FC layer will also be reduced

  • Train using triplet loss
  • Using float16 instead of float32 will reduce the memory footprint significantly (see Tensorflow cifar example) but I haven't gotten it to work
    Good luck!

Hi, @davidsandberg
Thank you very much.
Earlier I totally flow your tutorial.
I just read your code facenet_train_classifier.py carefully, then I found that this is not triplet loss version.

So, I did the flowing experiment:

  1. the facenet_train_classifier.py support around 100000 classes(person identity) on my machine, because the final fully_connected laryer:logits = slim.fully_connected(prelogits, len(train_set) ...)
  2. I use the triplet loss version facenet_train.py, then it works fine for large dataset

I will check the result of those two version in next few days.

Was this page helpful?
0 / 5 - 0 ratings