Starting with Python 3.8 the default start method for multiprocessing on macOS is spawn instead of fork - more info here and here. This means that some features that used to work will not continue to work on macOS. I specifically ran into this when using a DensityDist where the function was defined in place as follows because the function is not picklable:
import pymc3 as pm
with pm.Model() as model:
def val(x):
return -0.5 * pm.math.sum(x ** 2)
x = pm.Flat("x")
pm.DensityDist("obs", val)
pm.sample(chains=2, cores=2)
This fails with the following error that does not actually get caught. I was running in a Jupyter notebook and the notebook just hangs and this gets written to the command line:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Users/dforeman/research/projects/exoplanet-dev/case-studies/env/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/Users/dforeman/research/projects/exoplanet-dev/case-studies/env/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'val' on <module '__main__' (built-in)>
A temporary workaround that I found was to execute
import multiprocessing as mp
mp.set_start_method("fork")
before importing PyMC3, but my understanding is that this won't be a long term solution. It doesn't seem like there's anything obvious to do about this, but I wanted to bring it up here in case folks had clever ideas!
Wow -- haven't played around with this yet, but maybe we add mp.set_start_method to either the context manager (it'll fail if you don't use it in context mode, but will exit gracefully) or to _mp_sample.
And thus Mac now enters the glorious realm of broken pipes, which was exclusive to windows before py38!
As you said, this can be prevented by trying to set the fork start method, but sticking to the default spawn will get us some nasty theano bugs that we never managed to properly diagnose and fix.
The only thing that we did was to catch the broken pipe error on Windows and change the message to instruct users on where to look for the true error's traceback.
Hi @dfm (and thanks for black_nbconvert 😉 )
Thanks for reporting! I'm closing since I think this is fixed by #3919, but feel free to reopen if needed.
Unfortunately I can't seem to re-open the issue, but I can confirm that #3919 does not fix this issue: https://gist.github.com/dfm/413938425166b0f30fba47a5d7639da7
I can confirm that "fork" rather than "forkserver" does work - what was the argument for using "forkserver" instead of "fork"?
Note that the changes in #3919 have been superseded by a recent PR by @aseyboldt. Does this seem to solve the issue you're having? Also, which version of PyMC3 are you running?
For context, the argument for using forkserver was given here.
Sorry - no I'm running the master branch and the changes made in that PR don't help - the problem is fork vs. forkserver, not where it's defined!
I don't really understand enough of the underlying issues to see the relevant differences. Perhaps this doesn't matter if people don't regularly use DensityDist in notebooks on macOS (+ maybe Windows?) because that seems to be the only place where this matters, so no stress. I just wanted to make sure that it was known that this issue is not fixed! I think it makes sense to re-open, but I can't since that option seems to only be available to maintainers.
Yeah this issue is definitely weird. From my experience, sometimes it works with forkserver and not with fork, other times it's the other way around. So this is really puzzling -- maybe there even are several issues mixed up in here 🤷♂️ In the little digging I did at the time though, I always saw the recommendation to use forkserver as the default, not fork.
Does the new kwarg in sample help you with that (you can specify the method you want instead of sticking to the default)?
I'm also curious about what @aseyboldt and @lucianopaz think before reopening.
I think for the most part this issue is just inherent weirdness of multiprocessing/pickle. If the logp func can't be pickled, and the multiprocessing context is one of spawn or forkserver, then we can not expect this to work. We do expect an error message that is shown in the notebook, that mentions pickle however. If that doesn't happen, then it really is a bug.
@dfm For your example I get this:
import pymc3 as pm
with pm.Model() as model:
def val(x):
return -0.5 * pm.math.sum(x ** 2)
x = pm.Normal("x")
pm.DensityDist("obs", val)
pm.sample(chains=2, cores=2, mp_ctx="forkserver")
``` During handling of the above exception, another exception occurred: Traceback (most recent call last): The above exception was the direct cause of the following exception: ValueError Traceback (most recent call last) The above exception was the direct cause of the following exception: Just to show that this isn't specific to pymc3 in any way: In the notebook this doesn't work either. The child process dies and prints the unpickling error to the terminal where the notebook was started.
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/adr/git/pymc3/pymc3/parallel_sampling.py", line 114, in _unpickle_step_method
self._step_method = pickle.loads(self._step_method)
AttributeError: Can't get attribute 'val' on
File "/home/adr/git/pymc3/pymc3/parallel_sampling.py", line 135, in run
self._unpickle_step_method()
File "/home/adr/git/pymc3/pymc3/parallel_sampling.py", line 116, in _unpickle_step_method
raise ValueError(unpickle_error)
ValueError: The model could not be unpickled. This is required for sampling with more than one core and multiprocessing context spawn or forkserver.
"""
ValueError: The model could not be unpickled. This is required for sampling with more than one core and multiprocessing context spawn or forkserver.
...If you specify the multiprocessing context (please use `mp_ctx`, not `multiprocessing.set_start_method`) the behaviour should also be the same on all platforms (except for windows, which will error out on any context other than `spawn`).
We can avoid this problem at least in this case by using dill instead of pickle (assuming it is installed):
```python
import pymc3 as pm
with pm.Model() as model:
def val(x):
return -0.5 * pm.math.sum(x ** 2)
x = pm.Normal("x")
pm.DensityDist("obs", val)
pm.sample(chains=2, cores=2, mp_ctx="forkserver", pickle_backend='dill')
import multiprocessing
ctx = multiprocessing.get_context("forkserver")
def foo(x):
print(2 * x)
proc = ctx.Process(target=foo, args=(2.,))
proc.start()
I guess we could actually work around this by depending on dill, and overwrite __setstate__ in DensityDist, to always use dill to serialize that function. This would not change how it behaves with ctx=fork, but would fix densitydist in notebooks on windows and mac.
I guess there is a chance that it could break something that currently works in a script running on windows or mac though...
Somthing like this:
class DensityDist(Distribution):
...
def __getstate__(self):
logp = dill.dumps(self.logp)
vals = self.__dict__.copy()
vals['logp'] = logp
return vals
def __setstate__(self, vals):
vals['logp'] = dill.loads(vals['logp'])
self.__dict__ = vals'
@aseyboldt: I think that the suggestion to use dill for DensityDist is a great idea!
I guess there is a chance that it could break something that currently works in a script running on windows or mac though...
I haven't used dill much, are there cases where it can't pickle things that pickle can? Or are you worried about something else?
@AlexAndorra: thanks for the response! I was suggesting that it be reopened because the bug still exists even if it's not PyMC3 specific. If it's not re-opened there should probably at least be a wontfix label or something to let people know who might run into the same issue :D
I haven't used dill much, are there cases where it can't pickle things that pickle can?
I haven't used it much either, and I really do not know. That is why I'm a bit worried.
But then, just using dill to serialize the functions of DensityDist is a pretty confined use case, I can't think of much that this might break that isn't broken already...
I'll also reopen the issue.