Models: Problem when using eval.py, Request to allocate 0 bytes, tried to allocate 0 bytes, tried to deallocate nullptr

Created on 15 Oct 2017 · 13Comments · Source: tensorflow/models

System information

What is the top-level directory of the model you are using: faster_rcnn_inception_resnet_v2
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): ('v1.4.0-rc0-4-g933d601', '1.4.0-rc0')
Bazel version (if compiling from source): 0.6.1
CUDA/cuDNN version: CUDA 9.0 / cuDNN 7.0
GPU model and memory: K80 12GB
Exact command to reproduce:

python object_detection/eval.py --logtostderr \
--pipeline_config_path=/home/ubuntu/tensorflow_inception/models/faster_rcnn_inception_resnet_v2.config \
--checkpoint_dir=/home/ubuntu/tensorflow_inception/models/train/ \
--eval_dir=/home/ubuntu/tensorflow_inception/models/eval/

Describe the problem

I don't have any problem with training, and the tensorboard can show training information correctly. But when I use eval.py, it shows the following message. The tensorboard shows nothing about the validation. I have 8 GPUs, and I use CUDA_VISIBLE_DEVICES to specify gpu for training and validation.

Source code / logs

GPU information when training and validation.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:17.0 Off |                    0 |
| N/A   78C    P0    63W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:18.0 Off |                    0 |
| N/A   76C    P0    83W / 149W |  10982MiB / 11439MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 00000000:00:19.0 Off |                    0 |
| N/A   52C    P0    59W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 00000000:00:1A.0 Off |                    0 |
| N/A   44C    P0    72W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   53C    P0    57W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   42C    P0    72W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   59C    P0    60W / 149W |      0MiB / 11439MiB |     84%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   67C    P0   145W / 149W |  10980MiB / 11439MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1     65814      C   python                                     10967MiB |
|    7     79907      C   python                                     10967MiB |
+-----------------------------------------------------------------------------+

validation log:
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From /home/ubuntu/tensorflow/models/research/object_detection/evaluator.py:184: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2017-10-15 15:33:13.495886: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-15 15:33:13.496256: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2017-10-15 15:33:13.496278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
INFO:tensorflow:Restoring parameters from /home/ubuntu/tensorflow_inception/models/train/model.ckpt-0
INFO:tensorflow:Restoring parameters from /home/ubuntu/tensorflow_inception/models/train/model.ckpt-0
2017-10-15 15:33:28.878541: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:28.878585: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:28.878592: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:28.878596: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:28.878667: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
2017-10-15 15:33:28.878683: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
WARNING:tensorflow:From /home/ubuntu/tensorflow/models/research/object_detection/evaluator.py:166: get_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_global_step
WARNING:tensorflow:From /home/ubuntu/tensorflow/models/research/object_detection/evaluator.py:166: get_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_global_step
2017-10-15 15:33:31.265735: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:31.265788: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:31.265799: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:31.265803: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:31.265833: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
2017-10-15 15:33:31.265841: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
2017-10-15 15:33:32.904984: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:32.905022: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:32.905030: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:32.905034: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:32.905070: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
2017-10-15 15:33:32.905088: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
2017-10-15 15:33:34.423554: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:34.423591: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:34.423599: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:34.423603: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:34.423629: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
2017-10-15 15:33:34.423638: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
2017-10-15 15:33:36.569055: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:36.569088: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:36.569097: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:36.569101: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:36.569128: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
2017-10-15 15:33:36.569148: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr

awaiting model gardener

Source

chiayuhsu

👍3

Most helpful comment

I have also hit this problem when using wheels from TinyMind built with MKL (https://github.com/mind/wheels), tried 1.4 wheel with CUDA + MKL, 1.4.1 CUDA + MKL and 1.4.1 CPU only + MKL, with all of these I got the same error (tons of zero allocations complaints in the log).

Using generic build from pip (i.e., pip install tensorflow-gpu) or TinyMinds wheel without the MKL support seems to have no issues.

I am working with TF Object detection API, faster_rccn_resnet101 nets retrained for my data. I use CUDA 8 / CuDNN 6.

lcerman on 9 Jan 2018

👍5

All 13 comments

I'm seeing this as well with v1.4.0-rc1.

parvizp on 26 Oct 2017

@sguada

skye on 27 Oct 2017

BTW I only saw the issue with MKL and XLA enabled. I just recompiled with those disabled and the issue went away.

parvizp on 27 Oct 2017

I also met this problem,and my cuda version is 9.0 too

RippleZou on 27 Nov 2017

Also met this problem, even though I disabled the GPU, and let the eval.py script run on CPU. Any progress so far?

CrazyAlan on 18 Dec 2017

Using generic build from pip (i.e., pip install tensorflow-gpu) or TinyMinds wheel without the MKL support seems to have no issues.

I am working with TF Object detection API, faster_rccn_resnet101 nets retrained for my data. I use CUDA 8 / CuDNN 6.

lcerman on 9 Jan 2018

👍5

Yet another comment on this. These weird errors only appear in eval.

However, the training seems also broken in some way. With MKL when run training on CPU, I got 11sec/step, with the official TF build (pip install tensorflow-gpu) executed on the CPU only, I got 7--8sec/step. Using TinyMind's optimized build (without MKL), the times shrink to 3--4sec/step. Executing on GPU (no mater whether using the official build or TinyMind's with MKL) ends up with 0.33sec/step.

This seems wrong, I expected the MKL build to be the fastest (except the GPU). Maybe it is the same bug, that manifests itself by the inferior performance in training and tons of zero allocation complaints in eval.

lcerman on 9 Jan 2018

Also met this problem, with ubuntu 16.04.4, tf 1.4.1, cuda 8.0.61, cudnn 6.0, mkl enabled
problem solved when using pip installed tf 1.4.0

yuezhilanyi on 25 Apr 2018

I am having the exact same problem - has anyone found a solution? Running train.py is fine, running eval.py gives the same issue even though I am setting my CUDA_VISIBLE_DEVICES to a fully available gpu. I have noticed that it still produces the image portion on tensorboard but will not give the precision/recall scalars.
For reference I am using tensorflow-gpu 1.5.0 and cudnn 7.05

dusk039 on 15 May 2018

I also met this problem,and I run the train.py is fine not use GPU, At the same time, when I run the eval.py, I use this commend export CUDA_VISIBLE_DEVICE=-1 firstly, I still get the error.
my envs is : cuda9.1 tensorflow1.8，ubuntu16.04

lily0101 on 16 May 2018

saidly ,me too

wywywy01 on 25 May 2018

I also met this problem!

mrsamsami on 30 Jan 2019

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.