Keras-retinanet: the evaluation costs a long time

Created on 30 Sep 2018 · 8Comments · Source: fizyr/keras-retinanet

Recently I download the master brantch of this repo. Every time I run the evaluate.py. It costs about 30 minutes for the first batch. I trained on VOC dataset with two GPUs and evaluate with one. I debuged the evaluate.py that it get stuck on the "predict_on_batch" . I elso keep watch on the GPU memory by NVIDIA-SMI , the memory usage is 713MiB before the prediction really start.

Source

VCBE123

Most helpful comment

Thank you and your team so much! I solved this porblem by upgrading the tensorflow from 1.5 to 1.10.

VCBE123 on 2 Oct 2018

🎉3 👍2

All 8 comments

The evaluation during training is slow, or evaluate.py is slow?

What tensorflow / Keras are you using? What backbone? Can you check in nvidia-smi if the GPU is being utilized? Can you post the output of evaluate.py?

Also, what do you mean first batch? It takes 30 minutes for one image? Otherwise, what is your batch size?

hgaiser on 30 Sep 2018

tensorflow -gpu 1.5.0 & Keras 2.2.2 with backbone resnet101. When i run the evaluate.py, the gpu usage is 713 MiB first. The output is below:

/home/wen/anaconda3/bin/python /home/wen/pycharm-2017.2.4/helpers/pydev/pydevd.py --multiproc --qt-support=auto --client 127.0.0.1 --port 41388 --file /home/wen/net_project/keras-retinanet-master/keras_retinanet/bin/evaluate.py pascal /home/wen/data/test922/voc ./snapshots/resnet101_pascal_120.h5 --gpu=3
pydev debugger: process 24228 is connecting

Connected to pydev debugger (build 172.4343.24)
Using TensorFlow backend.
2018-09-30 14:49:49.053546: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-09-30 14:49:49.531490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:09:00.0
totalMemory: 10.91GiB freeMemory: 10.75GiB
2018-09-30 14:49:49.531563: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1)
Loading model, this may take a second...
QXcbConnection: Failed to initialize XRandr
Qt: XKEYBOARD extension not present on the X server.
Backend Qt5Agg is interactive backend. Turning interactive mode on.
/home/wen/anaconda3/lib/python3.6/site-packages/keras/engine/saving.py:268: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '
----------------------------------------------------------------------------------------------------------------------------------
the program is stunk on this line:
        # run network
        boxes, scores, labels = model.predict_on_batch(np.expand_dims(image, axis=0))[:3] 
---------------------------------------------------------------------------------------------------------------------------------

it costs a long time in the first predict

VCBE123 on 30 Sep 2018

It should use way more than 713MB for resnet101. Also, it seems you are using a python debugger as well, this may slow it down further. Could you check in nvidia-smi if the GPU is utilized, as in if it receives any workload (I forgot the name of the column in nvidia-smi, but it shows a percentage of workload).

Tensorflow 1.5 is also quite old, consider updating to at least 1.8.

hgaiser on 30 Sep 2018

when I run the evaluate.py directly ,I got this output.

/home/wen/anaconda3/bin/python /home/wen/net_project/keras-retinanet-master/keras_retinanet/bin/evaluate.py pascal /home/wen/data/test922/voc ./snapshots/resnet101_pascal_120.h5 --gpu=3
Using TensorFlow backend.
2018-09-30 15:03:32.741131: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-09-30 15:03:33.211905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:09:00.0
totalMemory: 10.91GiB freeMemory: 10.06GiB
2018-09-30 15:03:33.211954: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1)
Loading model, this may take a second...
/home/wen/anaconda3/lib/python3.6/site-packages/keras/engine/saving.py:268: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '



Sun Sep 30 15:06:46 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 25%   33C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:05:00.0 Off |                  N/A |
| 25%   36C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:08:00.0 Off |                  N/A |
| 25%   32C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 25%   36C    P2    54W / 250W |   5694MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  Off  | 00000000:85:00.0 Off |                  N/A |
| 50%   83C    P2   211W / 250W |  10613MiB / 11172MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  Off  | 00000000:86:00.0 Off |                  N/A |
| 25%   35C    P8    16W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 108...  Off  | 00000000:89:00.0 Off |                  N/A |
| 25%   30C    P8    15W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 108...  Off  | 00000000:8A:00.0 Off |                  N/A |
| 50%   83C    P2   115W / 250W |   9599MiB / 11172MiB |     84%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    3     24228      C   /home/wen/anaconda3/bin/python              4971MiB |
|    3     24825      C   /home/wen/anaconda3/bin/python               713MiB |


--------------------------------------------------------------------------------------------------------------------

VCBE123 on 30 Sep 2018

Doesn't that show nearly 5gb of memory usage though? That's more in line with what I expect it to be.

Still though, update tensorflow to 1.10 if possible, I've heard people say it makes a lot of difference.

hgaiser on 2 Oct 2018

Thank you and your team so much! I solved this porblem by upgrading the tensorflow from 1.5 to 1.10.

VCBE123 on 2 Oct 2018

🎉3 👍2

Very helpful to know that older versions of tensorflow slow down program,

i had tensorflow-gpu==1.5, and initial predict was taking 300s,
installed tensorflow-gpu==1.10 and it went down to 6s