python object_detection/eval.py --logtostderr \
--pipeline_config_path=/home/ubuntu/tensorflow_inception/models/faster_rcnn_inception_resnet_v2.config \
--checkpoint_dir=/home/ubuntu/tensorflow_inception/models/train/ \
--eval_dir=/home/ubuntu/tensorflow_inception/models/eval/
I don't have any problem with training, and the tensorboard can show training information correctly. But when I use eval.py, it shows the following message. The tensorboard shows nothing about the validation. I have 8 GPUs, and I use CUDA_VISIBLE_DEVICES to specify gpu for training and validation.
GPU information when training and validation.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:17.0 Off | 0 |
| N/A 78C P0 63W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 00000000:00:18.0 Off | 0 |
| N/A 76C P0 83W / 149W | 10982MiB / 11439MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 00000000:00:19.0 Off | 0 |
| N/A 52C P0 59W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 00000000:00:1A.0 Off | 0 |
| N/A 44C P0 72W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 00000000:00:1B.0 Off | 0 |
| N/A 53C P0 57W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 00000000:00:1C.0 Off | 0 |
| N/A 42C P0 72W / 149W | 0MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 00000000:00:1D.0 Off | 0 |
| N/A 59C P0 60W / 149W | 0MiB / 11439MiB | 84% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 67C P0 145W / 149W | 10980MiB / 11439MiB | 90% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 65814 C python 10967MiB |
| 7 79907 C python 10967MiB |
+-----------------------------------------------------------------------------+
validation log:
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From /home/ubuntu/tensorflow/models/research/object_detection/evaluator.py:184: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
2017-10-15 15:33:13.495886: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-15 15:33:13.496256: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2017-10-15 15:33:13.496278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
INFO:tensorflow:Restoring parameters from /home/ubuntu/tensorflow_inception/models/train/model.ckpt-0
INFO:tensorflow:Restoring parameters from /home/ubuntu/tensorflow_inception/models/train/model.ckpt-0
2017-10-15 15:33:28.878541: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:28.878585: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:28.878592: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:28.878596: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:28.878667: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
2017-10-15 15:33:28.878683: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
WARNING:tensorflow:From /home/ubuntu/tensorflow/models/research/object_detection/evaluator.py:166: get_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_global_step
WARNING:tensorflow:From /home/ubuntu/tensorflow/models/research/object_detection/evaluator.py:166: get_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_global_step
2017-10-15 15:33:31.265735: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:31.265788: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:31.265799: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:31.265803: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:31.265833: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
2017-10-15 15:33:31.265841: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
2017-10-15 15:33:32.904984: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:32.905022: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:32.905030: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:32.905034: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:32.905070: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
2017-10-15 15:33:32.905088: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
2017-10-15 15:33:34.423554: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:34.423591: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:34.423599: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:34.423603: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:34.423629: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
2017-10-15 15:33:34.423638: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
2017-10-15 15:33:36.569055: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:36.569088: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:36.569097: E tensorflow/core/common_runtime/bfc_allocator.cc:244] tried to allocate 0 bytes
2017-10-15 15:33:36.569101: W tensorflow/core/common_runtime/allocator_retry.cc:32] Request to allocate 0 bytes
2017-10-15 15:33:36.569128: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
2017-10-15 15:33:36.569148: E tensorflow/core/common_runtime/bfc_allocator.cc:378] tried to deallocate nullptr
I'm seeing this as well with v1.4.0-rc1.
@sguada
BTW I only saw the issue with MKL and XLA enabled. I just recompiled with those disabled and the issue went away.
I also met this problem,and my cuda version is 9.0 too
Also met this problem, even though I disabled the GPU, and let the eval.py script run on CPU. Any progress so far?
I have also hit this problem when using wheels from TinyMind built with MKL (https://github.com/mind/wheels), tried 1.4 wheel with CUDA + MKL, 1.4.1 CUDA + MKL and 1.4.1 CPU only + MKL, with all of these I got the same error (tons of zero allocations complaints in the log).
Using generic build from pip (i.e., pip install tensorflow-gpu) or TinyMinds wheel without the MKL support seems to have no issues.
I am working with TF Object detection API, faster_rccn_resnet101 nets retrained for my data. I use CUDA 8 / CuDNN 6.
Yet another comment on this. These weird errors only appear in eval.
However, the training seems also broken in some way. With MKL when run training on CPU, I got 11sec/step, with the official TF build (pip install tensorflow-gpu) executed on the CPU only, I got 7--8sec/step. Using TinyMind's optimized build (without MKL), the times shrink to 3--4sec/step. Executing on GPU (no mater whether using the official build or TinyMind's with MKL) ends up with 0.33sec/step.
This seems wrong, I expected the MKL build to be the fastest (except the GPU). Maybe it is the same bug, that manifests itself by the inferior performance in training and tons of zero allocation complaints in eval.
Also met this problem, with ubuntu 16.04.4, tf 1.4.1, cuda 8.0.61, cudnn 6.0, mkl enabled
problem solved when using pip installed tf 1.4.0
I am having the exact same problem - has anyone found a solution? Running train.py is fine, running eval.py gives the same issue even though I am setting my CUDA_VISIBLE_DEVICES to a fully available gpu. I have noticed that it still produces the image portion on tensorboard but will not give the precision/recall scalars.
For reference I am using tensorflow-gpu 1.5.0 and cudnn 7.05
I also met this problem,and I run the train.py is fine not use GPU, At the same time, when I run the eval.py, I use this commend export CUDA_VISIBLE_DEVICE=-1 firstly, I still get the error.
my envs is : cuda9.1 tensorflow1.8锛寀buntu16.04
saidly ,me too
I also met this problem!
Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.
Most helpful comment
I have also hit this problem when using wheels from TinyMind built with MKL (https://github.com/mind/wheels), tried 1.4 wheel with CUDA + MKL, 1.4.1 CUDA + MKL and 1.4.1 CPU only + MKL, with all of these I got the same error (tons of zero allocations complaints in the log).
Using generic build from pip (i.e.,
pip install tensorflow-gpu) or TinyMinds wheel without the MKL support seems to have no issues.I am working with TF Object detection API, faster_rccn_resnet101 nets retrained for my data. I use CUDA 8 / CuDNN 6.