Xgboost: Python predict() does not work with multiprocessing

Created on 11 Mar 2019  路  11Comments  路  Source: dmlc/xgboost

Related:

It has been reported that the predict() function in the Python interface does not work well with multiprocessing. We should find a way to allow multiple processes to predict with the same model simultaneously.

known-issue bug

Most helpful comment

Is there any update on this. It seems that this is complete stoper from using xgb on Production...?

All 11 comments

Is there any update on this. It seems that this is complete stoper from using xgb on Production...?

Any update? I am just discovering this now. This is indeed a problem...

It has been reported that the predict() function in the Python interface does not work well with multiprocessing. We should find a way to allow multiple processes to predict with the same model simultaneously.

What do you mean exactly?

In my context, I have a pool of processes that each load a pickled model and then try to make predictions, which is where I get the dmlc::Error.
Note that I also tried with a unique process in the pool and still got the same error.

Here is the error stack:

terminate called after throwing an instance of 'dmlc::Error'
  what():  [13:08:08] /workspace/include/xgboost/./../../src/common/common.h:41: /workspace/src/common/host_device_vector.cu: 150: initialization error

Stack trace returned 10 entries:
[bt] (0) /home/.../.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(dmlc::StackTrace(unsigned long)+0x47) [0x7f14b4c0ffc7]
[bt] (1) /home/.../.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x1d) [0x7f14b4c1042d]
[bt] (2) /home/.../.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(dh::ThrowOnCudaError(cudaError, char const*, int)+0x123) [0x7f14b4de2153]
[bt] (3) /home/.../.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::HostDeviceVectorImpl<float>::DeviceShard::Init(xgboost::HostDeviceVectorImpl<float>*, int)+0x278) [0x7f14b4e3fb68]
[bt] (4) /home/.../.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(+0x33b261) [0x7f14b4e17261]
[bt] (5) /home/.../.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::HostDeviceVectorImpl<float>::Reshard(xgboost::GPUDistribution const&)+0x1b6) [0x7f14b4e40d26]
[bt] (6) /home/.../.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(xgboost::obj::RegLossObj<xgboost::obj::LinearSquareLoss>::PredTransform(xgboost::HostDeviceVector<float>*)+0xf9) [0x7f14b4e0d239]
[bt] (7) /home/.../.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so(XGBoosterPredict+0x107) [0x7f14b4c08be7]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f14f3b21dae]
[bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f14f3b2171f]

It seems that CUDA is somehow involved in this. If that helps, I have CUDA v10.0.130 installed on my machine.

I tried to run it on a machine in the cloud that doesn't have any GPU and it seems to work as intended.

I ran into the same problem recently.

I noticed that if you use an older version of xgboost (0.72.1) the problem of "it hangs and doesn鈥檛 do anything" seems to disappear, but the process takes way too long.

Just for comparison I used multi Threading (which is slower than multi processing) on the latest version (0.90).
Results:
-Multi Processing on v.0.72.1: 672 sec
-Multi Threading on v.0.90: 164 sec

Some related thoughts: The nthread is a runtime parameter, so when pickling (what Python do when spawning new process) can not include nthread in the pickle. This can be resolved once #4855 is materialized.

I had the same problem when I tried to run it on a machine that has GPUs
image

Any update on this? I have the same issue here

Thanks for reminding. Let's see if I can get to this at the weekend.

I implemented a workaround using ZMQ Load Balancer.

So I cut out the code where XGBoost models are initialized and loaded in my master script, and put the code into an independent python script and implemented a worker routine that uses ZMQ load balancing techniques to serve the XGBoost models in the backend.

Due to system memory limit, I only initiated 4 workers, so 4 independent XGBoost models as backend workers. The frontend is still in the multiprocessing part of the original master script, but instead of utilizing XGBoost models to make predictions directly, the frontend now sends requests to backend XGBoost workers and receive the predictions from backend. Now no more dmlc errors.

Still, it will be awesome if XGBoost eventually make predict() work with multiprocessing
link to ZMQ Load Balancer which inspires my workaround

Hi I implemented a demo which shows how ZMQ load balancer can help with this issue:
Link to the demo

Right now another workaround is don't initialize XGBoost before forking (Like loading pickle only after fork). Maybe we can utilize some low level driver API to maintain the cuda context ourselves, but simply using a distributed framework like dask seems much simpler.

A quick update on this, thread safe prediction/inplace-prediction are now supported.

Was this page helpful?
0 / 5 - 0 ratings