Xgboost: Python predict() does not work with multiprocessing

Created on 11 Mar 2019 · 11Comments · Source: dmlc/xgboost

It has been reported that the predict() function in the Python interface does not work well with multiprocessing. We should find a way to allow multiple processes to predict with the same model simultaneously.

known-issue bug

Source

hcho3

Most helpful comment

Is there any update on this. It seems that this is complete stoper from using xgb on Production...?

andreieuganox on 1 Jul 2019

👍5

All 11 comments

Is there any update on this. It seems that this is complete stoper from using xgb on Production...?

andreieuganox on 1 Jul 2019

👍5

Any update? I am just discovering this now. This is indeed a problem...

It has been reported that the predict() function in the Python interface does not work well with multiprocessing. We should find a way to allow multiple processes to predict with the same model simultaneously.

What do you mean exactly?

In my context, I have a pool of processes that each load a pickled model and then try to make predictions, which is where I get the dmlc::Error.
Note that I also tried with a unique process in the pool and still got the same error.

Here is the error stack:

terminate called after throwing an instance of 'dmlc::Error'
  what():  [13:08:08] /workspace/include/xgboost/./../../src/common/common.h:41: /workspace/src/common/host_device_vector.cu: 150: initialization error

Stack trace returned 10 entries:
[bt] (0) /home/.../.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(dmlc::StackTrace(unsigned long)+0x47) [0x7f14b4c0ffc7]
[bt] (1) /home/.../.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x1d) [0x7f14b4c1042d]
[bt] (2) /home/.../.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(dh::ThrowOnCudaError(cudaError, char const*, int)+0x123) [0x7f14b4de2153]
[bt] (3) /home/.../.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::HostDeviceVectorImpl<float>::DeviceShard::Init(xgboost::HostDeviceVectorImpl<float>*, int)+0x278) [0x7f14b4e3fb68]
[bt] (4) /home/.../.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(+0x33b261) [0x7f14b4e17261]
[bt] (5) /home/.../.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::HostDeviceVectorImpl<float>::Reshard(xgboost::GPUDistribution const&)+0x1b6) [0x7f14b4e40d26]
[bt] (6) /home/.../.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::obj::RegLossObj<xgboost::obj::LinearSquareLoss>::PredTransform(xgboost::HostDeviceVector<float>*)+0xf9) [0x7f14b4e0d239]
[bt] (7) /home/.../.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(XGBoosterPredict+0x107) [0x7f14b4c08be7]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f14f3b21dae]
[bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f14f3b2171f]

It seems that CUDA is somehow involved in this. If that helps, I have CUDA v10.0.130 installed on my machine.

I tried to run it on a machine in the cloud that doesn't have any GPU and it seems to work as intended.

xEcEz on 25 Jul 2019

I ran into the same problem recently.

I noticed that if you use an older version of xgboost (0.72.1) the problem of "it hangs and doesn’t do anything" seems to disappear, but the process takes way too long.

Just for comparison I used multi Threading (which is slower than multi processing) on the latest version (0.90).
Results:
-Multi Processing on v.0.72.1: 672 sec
-Multi Threading on v.0.90: 164 sec

teopapad92 on 1 Aug 2019

Some related thoughts: The nthread is a runtime parameter, so when pickling (what Python do when spawning new process) can not include nthread in the pickle. This can be resolved once #4855 is materialized.

trivialfis on 17 Sep 2019

I had the same problem when I tried to run it on a machine that has GPUs

mayanxin on 27 Sep 2019

Any update on this? I have the same issue here

owenljn on 15 Nov 2019

Thanks for reminding. Let's see if I can get to this at the weekend.

trivialfis on 15 Nov 2019

👍3

I implemented a workaround using ZMQ Load Balancer.

So I cut out the code where XGBoost models are initialized and loaded in my master script, and put the code into an independent python script and implemented a worker routine that uses ZMQ load balancing techniques to serve the XGBoost models in the backend.

Due to system memory limit, I only initiated 4 workers, so 4 independent XGBoost models as backend workers. The frontend is still in the multiprocessing part of the original master script, but instead of utilizing XGBoost models to make predictions directly, the frontend now sends requests to backend XGBoost workers and receive the predictions from backend. Now no more dmlc errors.

Still, it will be awesome if XGBoost eventually make predict() work with multiprocessing
link to ZMQ Load Balancer which inspires my workaround

owenljn on 15 Nov 2019

Hi I implemented a demo which shows how ZMQ load balancer can help with this issue:
Link to the demo

owenljn on 18 Nov 2019

Right now another workaround is don't initialize XGBoost before forking (Like loading pickle only after fork). Maybe we can utilize some low level driver API to maintain the cuda context ourselves, but simply using a distributed framework like dask seems much simpler.

trivialfis on 24 Dec 2019

A quick update on this, thread safe prediction/inplace-prediction are now supported.

trivialfis on 15 Apr 2020

🎉2

Was this page helpful?

0 / 5 - 0 ratings