Openpose: Performance on server GPUs (K80, P100, etc.)?

Created on 4 Dec 2017  Â·  21Comments  Â·  Source: CMU-Perceptual-Computing-Lab/openpose

We're running OpenPose on Google Cloud and are getting surprisingly poor performance on K80ss and P100s. On the K80 we're only getting about 4.5FPS, which is much worse than for some of the gaming GPU's.

What's the expected performance on K80s, P100s, V100s, etc.? Has anyone tried them out?
Thanks.

help wantequestion

Most helpful comment

OpenPose has to load ~1.5GB into GPU memory every time you start it (therefore those 4 seconds for 1 image). If you just wanna run 1 image from time to time, you should keep OpenPose opened in the background, by implementing your own code (tutorial_wrapper/ examples)

All 21 comments

Unfortunately I do not have any of those models, so I do not know. However, the best way to measure the real OpenPose FPS in remote servers is by using --no_display and measuring e.g. time to run 110 images - time to run 10 images divided into number of images (100 in this case). If anyone has FPS results (either the FPS value displayed in the GUI or this time measurement), please feel free to post your FPS results on:
https://docs.google.com/spreadsheets/d/1-DynFGvoScvfWDA1P4jDInCkbD4lg0IKOYbXgEq0sK0/edit#gid=0

@gineshidalgo99 ok thank you! could you clarify the math to calculate time - not sure I follow?

@gineshidalgo99 @megashigger Hi, I got some problems when using this app in remote server.
First I try :
./build/examples/openpose/openpose.bin --video examples/media/video.avi
And the error is:
(OpenPose 1.2.0:26569): Gtk-WARNING **: cannot open display:
I think that's because I run it on remote server and thus the GUI cannot open, which like the error had said. So I add a "--no_display" flag after the above command lines, but error below comes:

===============================================================================
root@1918d92c278c:/code/openpose# ./build/examples/openpose/openpose.bin --video examples/media/video.avi -no_display
Starting pose estimation demo.

Error:
No output is selected (no_display) and no results are generated (no write_X flags enabled). Thus, no output would be generated. You could also set mThreadManagerMode = mThreadManagerMode::Asynchronous(Out) and/or add your own output worker class before calling this function.

Coming from:

  • ./include/openpose/wrapper/wrapper.hpp:configure():502
  • ./include/openpose/wrapper/wrapper.hpp:configure():937
    terminate called after throwing an instance of 'std::runtime_error'
    what():
    Error:
    No output is selected (no_display) and no results are generated (no write_X flags enabled). Thus, no output would be generated. You could also set mThreadManagerMode = mThreadManagerMode::Asynchronous(Out) and/or add your own output worker class before calling this function.

Coming from:

Aborted (core dumped)

I run it in a docker container, but am sure that the GPU and the environment is well-prepared.

Would you tell me how to fix this?
Thank you.

@megashigger when OpenPose is closed, it tells you the time to run. So create a folder with 10 elements, a folder with 110 elements, and do:

FPS = 1 / t

where t = 1/100 x (t_total_110 - t_total_10)

@HoracceFeng Error: No output is selected (no_display) and no results are generated (no write_X flags enabled). Thus, no output would be generated. --> as it says, use e.g. --write_keypoint_json output_temporary to solve it.

@gineshidalgo99 Thx. BTW, if I set "num_gpu" to 0, the code can work but the output video/images didn't have the skeleton. Is it because the model can only be run under GPU?

@gineshidalgo99 And, would you tell me where is the code to draw the skeleton? As I said above, when I set num_GPU=0, the code can work but output show nothing (even I use write_keypoint_json). But when I use num_GPU = -1, something went wrong:

Error:
OpenPose must be compiled with the USE_CUDA macro definition in order to use this functionality.

Coming from:

what():

I follow your manual to compile the code and have set the USE_CUDA := 1. [MAYBE because I run the code in container ? ]

So, here are three questions:

  1. how to fix the USE_CUDA macro error?
  2. Can I just use CPU to get results?
  3. Would you tell me where is the code to draw results?

Thank you

  1. (DUplicated) CUDA is required, you must compile with CUDA
  2. (Duplicated) No
  3. (Duplicated) RenderPose.cu or something like that

get it. Thank you so much! Have a good day.

Actually I have CUDA, not sure if it is because of the docker container. Will update if I figure it out

I've tried to use CoreML powered with GPU
and getting 6.4 sec on iPad and 3 sec on iPhone 7 to recognize image 400x256 with model pose_iter_440000.caffemodel
is it the best time?
is there is a way to simplify network to achieve at least 0.1sec per frame?

I'm getting slow speeds on my P100 as well. I'm running Ubuntu 16.04 on top of Xen hypervisor.
CPU: Intel Xeon Gold 24 cores allocated
RAM: 256 GB allocated
GPU: Tesla P100

I'm happy to run any benchmarks you suggest but I'm fairly new to openpose not sure how to extract the stats listed in the linked spreadsheet.

Update:
I'm sorry, I failed to notice that you'd already written how to take the benchmarks. I'll take them and get back to you.

@jhorowitz once you are done testing the benchmark, please let me know what is the FPS and what is the accuracy you got with you system configurations.

I did some benchmarking with server GPUs.
k80: 3.44 FPS
p100: 11.0999 FPS
v100: 11.1425 FPS
I verified I was calculating FPS correctly by also running the same test on a 1080ti and I got the same results as the performance doc you shared.

The v100 is a much more powerful GPU compared to the p100 however there is barely any improvement in FPS.
@gineshidalgo99 Do you know if there are any limitations in openpose that prevent it from fully utilizing the v100. Can it be compiled to use higher versions of CUDA, CUDnn, tensoRT or even different floating point precision (ideally FP16)?

@leodays Thank you very much for the very useful benchmark! Added to the official one ( https://docs.google.com/spreadsheets/d/1-DynFGvoScvfWDA1P4jDInCkbD4lg0IKOYbXgEq0sK0/edit#gid=0 ).

I am also surprised with the results. I understand that V100 would required CUDA 9 for fully using the Volta architecture. But even with the exact same CUDA version, it should be better than p100. Could you let me know CUDA/cuDNN version for p100 and v100? Or if there is any change that you are aware in OpenCV or Caffe (these 2 should not make any difference though)? Thanks!

For both GPUs I compiled caffe using the default install scripts that point to the 3rdparty directory.

P100
CUDA: 8.0
cuDNN: 5.1
OpenCV: 2.0

V100 (I tried using same configuration as the p100 but openpose was not able to run at all with that setup)
CUDA: 9.0
cuDNN: 7.0
OpenCV: 3.0

P.S. I ran the tests using openpose.bin so it using all the default resolution and scale values.

I found cuDNN 6 around 15% slower than cuDNN 5.1 (with same everything else). Maybe cuDNN 7 is also slower than 5.1... my intuition tells me that the issue might be the speed difference... Could you try to measure the FPS in the P100 with cuDNN 7 (maintaining CUDA 8)? (I think 5.1 cannot be used with CUDA9)

@atrbx5 Can you be kindly to share your CoreML version so that we can test on it? Thanks!

Hi everyone,

I am really new to Openpose and also seeking for advice on diagnosis and improving the performance. I am running Openpose on my Ubuntu 16 cloud server (Redhat OpenStack) via a Docker image (Last updated 6 months ago) with the following spec:

My Docker Setting is the following:
CUDA: 8
cuDNN: 8.0
OpenCV: 2.4

I am running on a system with 2 Tesla P100 GPUs (NVIDIA-SMI 384.111 Driver Version: 384.111 ). The Docker Image is detecting the 2 Tesla P100 GPUs .

My host's Setting is the following:
CUDA: 7.5
cuDNN: 9.1
OpenCV: 2.0

When I run the demo video, it is really fast. Completed in 15 secs. Much faster than what I tried on the AWS server with the p2.xlarge instance.

./build/examples/openpose/openpose.bin --video examples/media/video.avi --write_video ../data/output/result.avi --write_keypoint_json ../data/output/poses --no_display

Starting pose estimation demo.
Auto-detecting GPUs... Detected 2 GPU(s), using them all.
Starting thread(s)
Real-time pose estimation demo successfully finished. Total time: 15.498089 seconds.

HOWEVER

When I just try to simply execute on 1 image ( .JPEG , resolution 360 × 640 , 250 KB) , it takes about 4 seconds. Which is twice slower than what I got from the p2.xlarge instance.

./build/examples/openpose/openpose.bin --image_dir /data/input/images/ --write_images ../data/output/images --no_display --hand --disable_blending --alpha_pose 1 --resolution 360x640

Starting pose estimation demo.
Auto-detecting GPUs... Detected 2 GPU(s), using them all.
Starting thread(s)
Real-time pose estimation demo successfully finished. Total time: 4.043151 seconds.

Any suggestions on if there is any particular issue I had bumped into? I basically just need to process 1 image at a time over cloud. Is there any preprocessing time that is taken for image batch proccessing?
What is the best practice for just running on 1 image?

Thank you for any suggestions greatly appreciated.

Thank you,
Heng

OpenPose has to load ~1.5GB into GPU memory every time you start it (therefore those 4 seconds for 1 image). If you just wanna run 1 image from time to time, you should keep OpenPose opened in the background, by implementing your own code (tutorial_wrapper/ examples)

Thank you so much Gines for the reply. This is a good insight to know. By looking for another pose, I will try to follow @appleweed's wrapper mechanism.
https://github.com/appleweed/OpenPose-Background-Process

I am fairly new to this so I may either ask @appleweed for his advice or comeback to this pose.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bowrian picture bowrian  Â·  3Comments

iremozkal picture iremozkal  Â·  3Comments

zhixuanli picture zhixuanli  Â·  3Comments

nigellima picture nigellima  Â·  5Comments

anonymous530 picture anonymous530  Â·  4Comments