See discussion on #2515 for details. In summary, if a user tries to use the Client object or uses multiprocessing in an unexpected way (especially when first starting out) they can run in to this error:
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
The solution when this is encountered is typically making sure that your script is started in a if __name__ == '__main__':. If this doesn't apply to you then you usually have to do some fancier handling. Is there a way that the above exception can be caught by distributed/dask and provide a simpler or more clear message?
To be clear, this fails if run within a script (it works fine in an interpretter):
from dask.distributed import Client
client = Client()
# user code follows
The solution is this
from dask.distributed import Client
if __name__ == '__main__':
client = Client()
# user code follows
This is exactly the same problem that exists with anything in Python that spins up processes, like a multiprocessing.Pool()
Another alternative is to not use processes with Client(process=False), but that has other performance implications
This works for me, I am curious though, why does this fix the issue?
@rgoggins This has to do with how the additional/child processes are created. Python has to "import" your script(s) in every child process. If you don't put initialization code (code that should only be run once) into the if __name__ == "__main__": block then it gets run for every child process (at "import" time). This can end up with an infinite recursion as each process creates child processes that create more child processes and so on.
This is how I understand it at least.
Most helpful comment
To be clear, this fails if run within a script (it works fine in an interpretter):
The solution is this
This is exactly the same problem that exists with anything in Python that spins up processes, like a
multiprocessing.Pool()Another alternative is to not use processes with
Client(process=False), but that has other performance implications