This works fine, but tensorflow use all the memory of my GPU. So I can't eval at the same time. In the trainer.py I add the line :
session_config.gpu_options.per_process_gpu_memory_fraction = 0.2 #
just after :
session_config = tf.ConfigProto(allow_soft_placement=True,
log_device_placement=False)
But this doesn't work, I don't know if I have other options to limit the use of the GPU memory ...
Hi @madekwe, the way you should specify per_process_gpu_memory_fraction
is like this:
session_config=tf.ConfigProto(
gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.2))
I finally find a solution I fix the batch_queue_capacity to 50 in train_config and I also resize the images to 500*500 and it works properly.
System information
What is the top-level directory of the model you are using: object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): paths changed
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): 1.3.0
CUDA/cuDNN version: 8.0/ 6.0
GPU model and memory: Nvidia Tesla K80 12GB
Exact command to reproduce:
python object_detection/train.py --logtostderr \
--pipeline_config_path=/datadrive/try_faster_rcnn_resnet101_cokefull/models/model/faster_rcnn_resnet101_coke.config \
--train_dir=/datadrive/try_faster_rcnn_resnet101_cokefull/models/model/train
Describe the problem
The training process for faster-rcnn resnet101 always consume all of the memory of GPU so that the evaluation process can't run. When the train.py process start, no matter how many GPUs are used on VM, the python process will occupy all GPUs. These GPUs memory is full used, but 'Volatile GPU-Util' is 100% for first GPU and 0% for the other GPUs.
Besides, whatever the model I select from the object_detection repo, the training process always consumes the almost the same GPU memory which almost equals to the whole GPU memory size. I try to modify several parameters, but it doesn't work.
With GPU k80, the process consumes about 11/12g of GPU memory, with GPU M60, the process consumes about 7/8g of GPU memory. Whatever the GPU and the model I select, the training process always occupy almost all memory.
I try two ways. Both of them don't work.
1.I add these lines in trainer.py
session_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
session_config.gpu_options.allow_growth=True
session_config.gpu_options.per_process_gpu_memory_fraction = 0.8
2.I still try
session_config=tf.ConfigProto(
gpu_options=tf.GPUOptions(per_process_gpu_memory_fraction=0.2))
It still doesn't work
Can someone help me? Many thanks!
I'm seeing the same problems. I try either of the two ways suggested in this comment and I see that the process consumes all free memory in the GPU.
In fact, I see that other users on the machine are using lesser memory but they can't explain why the same command works for them and not for me.
I am also having this problem.
I realized that my code had a call to an undocumented method
(device_lib.list_local_devices)[
https://github.com/tensorflow/tensorflow/blob/d42facc3cc9611f0c9722c81551a7404a0bd3f6b/tensorflow/python/client/device_lib.py#L27]
which was creating a default session. Since this call was before the call
to create the session with my options, all the GPU was being allocated. The
stackoverflow discussion here
talks about this issue.
I removed the call to this function and it started working as expected.
On Mon, Jan 1, 2018 at 2:59 PM, Mark Sonnenschein notifications@github.com
wrote:
Same here @mharshe https://github.com/mharshe :/
Did you find a solution?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/models/issues/2703#issuecomment-354654649,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEgsNlKbAILiLA9KOlo_sPuaFxBdbv4hks5tGOSvgaJpZM4QRjiQ
.
--
Mandar
@mharshe How did you removed that call? Have you tried this on object detection api?
@tuobay Did you find any solutions?
@madekwe Did you solve the problem?
I am facing the same issue.
@joydeepmedhi I wasn't using the object detection api. I had my own (completely different) model where I had used a call to device_lib.list_local_devices(). I just removed this line. The per_process_gpu_memory_fraction
works perfectly well in all my code and all the models I have worked on.
I had a similar issue where the tensorflow was taking up the whole gpu despite including
_gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3,allow_growth=True)_
Reading @mharshe comment, I saw that my code was calling
_tf.test.gpu_device_name()_
Removing this the issue was resolved.
I had a similar issue where the tensorflow was taking up the whole gpu despite including
_gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3,allow_growth=True)_
Reading @mharshe comment, I saw that my code was calling
_tf.test.gpu_device_name()_
Removing this the issue was resolved.
Thanks for shaing, saved my day!
Does not work in v2 using v1 compat.
tf_config= tf.compat.v1.ConfigProto(log_device_placement=True,
gpu_options=tf.compat.v1.GPUOptions
(per_process_gpu_memory_fraction=0.5))
TestSession = tf.compat.v1.Session(config=tf_config)
GPU was allocated to 90 percent.
Most helpful comment
I realized that my code had a call to an undocumented method
(device_lib.list_local_devices)[
https://github.com/tensorflow/tensorflow/blob/d42facc3cc9611f0c9722c81551a7404a0bd3f6b/tensorflow/python/client/device_lib.py#L27]
which was creating a default session. Since this call was before the call
to create the session with my options, all the GPU was being allocated. The
stackoverflow discussion here
talks about this issue.
I removed the call to this function and it started working as expected.
On Mon, Jan 1, 2018 at 2:59 PM, Mark Sonnenschein notifications@github.com
wrote:
--
Mandar