Models: per_process_gpu_memory_fraction doesn't work

Created on 3 Nov 2017 · 12Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using: object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 7
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): 1.3.0
CUDA/cuDNN version: Cuda compilation tools, release 8.0, V8.0.44
GPU model and memory: GTX 1080Ti, 11Gb
Exact command to reproduce:
python train.py --logtostder --train_dir=training0311/ --pipeline_config_path=training0311\faster_rcnn_inception_resnet_v2_atrous_coco.config

Describe the problem

This works fine, but tensorflow use all the memory of my GPU. So I can't eval at the same time. In the trainer.py I add the line :

session_config.gpu_options.per_process_gpu_memory_fraction = 0.2 #
just after :
session_config = tf.ConfigProto(allow_soft_placement=True,
log_device_placement=False)

But this doesn't work, I don't know if I have other options to limit the use of the GPU memory ...

awaiting response support

Source

ghost

Most helpful comment

I realized that my code had a call to an undocumented method
(device_lib.list_local_devices)[
https://github.com/tensorflow/tensorflow/blob/d42facc3cc9611f0c9722c81551a7404a0bd3f6b/tensorflow/python/client/device_lib.py#L27]
which was creating a default session. Since this call was before the call
to create the session with my options, all the GPU was being allocated. The
stackoverflow discussion here
talks about this issue.

I removed the call to this function and it started working as expected.

On Mon, Jan 1, 2018 at 2:59 PM, Mark Sonnenschein notifications@github.com
wrote:

Same here @mharshe https://github.com/mharshe :/
Did you find a solution?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/2703#issuecomment-354654649,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEgsNlKbAILiLA9KOlo_sPuaFxBdbv4hks5tGOSvgaJpZM4QRjiQ
.

--
Mandar

mharshe on 2 Jan 2018

👍3 ❤2

All 12 comments

Hi @madekwe, the way you should specify per_process_gpu_memory_fraction is like this:

session_config=tf.ConfigProto(
    gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.2))

nealwu on 6 Nov 2017

👎1

I finally find a solution I fix the batch_queue_capacity to 50 in train_config and I also resize the images to 500*500 and it works properly.

ghost on 9 Nov 2017

System information

What is the top-level directory of the model you are using: object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): paths changed
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): 1.3.0
CUDA/cuDNN version: 8.0/ 6.0
GPU model and memory: Nvidia Tesla K80 12GB
Exact command to reproduce:
python object_detection/train.py --logtostderr \
--pipeline_config_path=/datadrive/try_faster_rcnn_resnet101_cokefull/models/model/faster_rcnn_resnet101_coke.config \
--train_dir=/datadrive/try_faster_rcnn_resnet101_cokefull/models/model/train

Describe the problem
The training process for faster-rcnn resnet101 always consume all of the memory of GPU so that the evaluation process can't run. When the train.py process start, no matter how many GPUs are used on VM, the python process will occupy all GPUs. These GPUs memory is full used, but 'Volatile GPU-Util' is 100% for first GPU and 0% for the other GPUs.

Besides, whatever the model I select from the object_detection repo, the training process always consumes the almost the same GPU memory which almost equals to the whole GPU memory size. I try to modify several parameters, but it doesn't work.
With GPU k80, the process consumes about 11/12g of GPU memory, with GPU M60, the process consumes about 7/8g of GPU memory. Whatever the GPU and the model I select, the training process always occupy almost all memory.

I try two ways. Both of them don't work.
1.I add these lines in trainer.py
session_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False) session_config.gpu_options.allow_growth=True session_config.gpu_options.per_process_gpu_memory_fraction = 0.8

2.I still try
session_config=tf.ConfigProto( gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.2))
It still doesn't work

Can someone help me？ Many thanks!

tuobay on 13 Nov 2017

👍3

I'm seeing the same problems. I try either of the two ways suggested in this comment and I see that the process consumes all free memory in the GPU.
In fact, I see that other users on the machine are using lesser memory but they can't explain why the same command works for them and not for me.

mharshe on 14 Nov 2017

I am also having this problem.

Jreyno40 on 28 Nov 2017

I removed the call to this function and it started working as expected.

On Mon, Jan 1, 2018 at 2:59 PM, Mark Sonnenschein notifications@github.com
wrote:

Same here @mharshe https://github.com/mharshe :/
Did you find a solution?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/2703#issuecomment-354654649,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEgsNlKbAILiLA9KOlo_sPuaFxBdbv4hks5tGOSvgaJpZM4QRjiQ
.

--
Mandar

mharshe on 2 Jan 2018

👍3 ❤2

@mharshe How did you removed that call? Have you tried this on object detection api?
@tuobay Did you find any solutions?

joydeepmedhi on 26 Jul 2018

@madekwe Did you solve the problem?
I am facing the same issue.

joydeepmedhi on 26 Jul 2018

@joydeepmedhi I wasn't using the object detection api. I had my own (completely different) model where I had used a call to device_lib.list_local_devices(). I just removed this line. The per_process_gpu_memory_fraction works perfectly well in all my code and all the models I have worked on.

mharshe on 27 Jul 2018

👍1

I had a similar issue where the tensorflow was taking up the whole gpu despite including

_gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3,allow_growth=True)_
Reading @mharshe comment, I saw that my code was calling
_tf.test.gpu_device_name()_
Removing this the issue was resolved.

korra141 on 25 Jun 2019

👍4

I had a similar issue where the tensorflow was taking up the whole gpu despite including

_gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3,allow_growth=True)_
Reading @mharshe comment, I saw that my code was calling
_tf.test.gpu_device_name()_
Removing this the issue was resolved.

Thanks for shaing, saved my day!

KrissLin on 26 Dec 2019

Does not work in v2 using v1 compat.

tf_config= tf.compat.v1.ConfigProto(log_device_placement=True,
gpu_options=tf.compat.v1.GPUOptions
(per_process_gpu_memory_fraction=0.5))
TestSession = tf.compat.v1.Session(config=tf_config)

GPU was allocated to 90 percent.