Keras: Hard locks (hangs) trying to use keras model in parallel with joblib

Created on 8 Jul 2016 · 22Comments · Source: keras-team/keras

I'm experiencing hard locks when trying to predict labels in parallel using joblib. I tried using multiprocessing directly instead of joblib and the same thing happens. The function that runs in parallel and that calls keras model (trained using tensorflow's backend) just gets locked, no prediction is made and the processed gets hung forever. This happens both on Mac and Linux.

The example in the gist I'm referencing below can't be run without the trained model, but it illustrates the kind of problem I'm talking about. Following the example should be enough to reproduce this issue.

https://gist.github.com/paulomalvar/4457018d4833dd9fd452f46788ef55a1

stale

Source

paulomalvar

👍5

Most helpful comment

Same problem here. I would like to load a model from json, load the weights in the parent process. Then run some predictions of this model in different child processes. I do not know if it is possible. I tried with multiprocessing module and had the same troubles.

JordanPeltier on 20 Mar 2017

👍6

All 22 comments

I tried retraining the models using Theano as the backend and this solved the issue for the code I shared above. However I'm using more models in my project. Retraining using Theano for those models didn't solve the issue.

I tried retraining with a lower dimensionality as output of the first neurons layer and weirdly enough that solved the issue. But this is not a solution, just a patch.

How is it possible that a model with more dimensions hangs my code when trying to parallelize it?

I did another test on a machine that has 256GB of RAM and the same issue happens. And the machine is only using around 1% of all the available memory so this is not a memory issue.

paulomalvar on 9 Jul 2016

👍5

how to solved it thanks

WangLianChen on 6 Mar 2017

JordanPeltier on 20 Mar 2017

👍6

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

stale[bot] on 19 Jun 2017

Same issue here

karishnu on 2 Nov 2017

same issue, and found some related info here:
https://stackoverflow.com/questions/42504669/keras-tensorflow-and-multiprocessing-in-python

yhcharles on 5 Nov 2017

👍1

@JordanPeltier Did you Find any solution for that?

I have a whole pipeline set, one of the steps is to predict something. I Load everything in the parent, and the throw a bunch of child processes, but at the predict step it gets hung.

I cant seem to find a solution for that.

tmdavid on 16 Nov 2017

Not possible. You need to load the model in the child processes (in the run method).

JordanPeltier on 16 Nov 2017

Switching to Theano as the backend solved the issue

karishnu on 16 Nov 2017

Wont it then replicate the memory used?

I mean, my idea is load model once, parallelize the predict method. Having several child processes who just call the predict. Otherwise it makes it memory consuming, and that I already have :(

tmdavid on 16 Nov 2017

@paulomalvar Hi, have you solved your problem? I encountered the same problem. I loaded trained model in main process, and try to use model.predict in child processes, but it hangs forever...

yangxiufengsia on 4 Dec 2017

solution from https://github.com/fchollet/keras/issues/6124

model = Model(inputs=[l_input], outputs=[out_actions, out_value])
model._make_predict_function()  # have to initialize before threading

yhcharles on 4 Dec 2017

@yhcharles thanks, but it doesn't work. I don't know why the model.predict works when I used python threading, while it hangs when using multiprocessing.

yangxiufengsia on 4 Dec 2017

I'm having the same problem. First, I tried passing the loaded model object into the child process. The call to model.predict() hangs. Then, I tried passing the model's path in and loading the model in the child before using it, but the call to keras.models.load_model() hangs in the child process too!

Has anyone gotten this to work?

kebwi on 27 Mar 2018

I used process based parallel instead of thread based parallel solved this. I think the trained model can not be loaded by several threads at same time.

yangxiufengsia on 28 Mar 2018

I am having this exact same issue when trying to load the model in child processes.
The hanging seems to occur when the weights are getting set on the model.
The weights themselves are stored in an array, which the child processes have access to.
Attempting to load the weights from file inside each of the child processes also causes the hang.

I have tried multiple different ways of resolving this but at the moment it seems there's no way for Keras and Tensorflow to run prediction in child processes - it simply doesn't seem to be written in a manner that supports this.

This may be a Keras problem but I suspect it could be due to Tensorflow requiring two threads to work:
https://github.com/tensorflow/tensorflow/issues/11066

If anyone makes any progress on this I'd be interested to hear it.

ckyleda on 17 Apr 2018

@kebwi load the model in the "run()" of child process, works, but very slow..
And remember to call K.clear_session() at the end of "run()", which will manually release resources.

lhdgriver on 18 Jun 2018

This issue is still a thing in 2019. However, I circumvented the halting by loading the model in the subprocess and let the processes communicate via queues. Here is a minimal example using the webcam and VGG16 for feature extraction: https://stackoverflow.com/a/54881298/2084944

tik0 on 26 Feb 2019

Load the trained model once, and then apply the model in multi-process to make predictions. I tried a couple of different ways, but no success. Anyone has good examples on it?

xhm1014 on 11 Dec 2019

I am facing the same issue. I saved the model in a process and when trying to load the model using model = tf.keras.models.load_model('model.h5') in the child process, it hangs forever.
Any solution to this?

sarthak-chakraborty on 17 May 2020

Haha... facing this issue in 2020. I guess the model predict works in multithread, so the lock needed. The child process clone the lock status, it may cause racing condition.

y18zhou on 9 Jun 2020

@y18zhou I think you're on the right track. Setting tf.config.threading.set_intra_op_parallelism_threads(1) seems to work.

It looks like loading a model into memory (i.e. via load_model) runs on multiple threads, and if you create a child process before this has finished, the child process will inherit the lock and hang. Setting mp.set_start_method('spawn') seems to make no difference.