Hi! I’m trying to speed up MCMC sampling for Bayesian Multinomial Regression using GPU, the code is below. When using CPU, pymc3 utilizes as many cores as it can while sampling 4 chains at once. Multicore support fails when sampling on GPU (pm.sample with arguments cores=4), it gives “RuntimeError: Chain 2 fails”. Is there any way to parallelize computations using GPU (via using more cores or more GPUs)? At the moment using CPU with 32 cores is faster than doing the same on 1 core and 1 GPU.
def make_model(X, y):
with pm.Model() as model:
sd_alpha = pm.HalfCauchy('sd_alpha', beta=10)
sd_beta = pm.HalfCauchy('sd_beta', beta=10)
alpha = pm.Normal('alpha', mu=0, sd=sd_alpha, shape=n_classes)
beta = pm.Normal('beta', mu=0, sd=sd_beta, shape=(n_features, n_classes))
mu = tt.dot(X, beta) + alpha
p = pm.Deterministic('p', tt.nnet.softmax(mu))
label = pm.Categorical('label', p=p, observed=y)
return model
X_shared = theano.shared(np.asarray(X_tr, theano.config.floatX))
y_shared = theano.shared(np.asarray(y_tr, theano.config.floatX))
model = make_model(X_shared, y_shared)
with model:
niter = 500
tune = 500
step = pm.NUTS()
trace = pm.sample(niter, tune=tune, chains=4, cores=4, init='jitter+adapt_diag', step=step)
The full log is attached as a file.
log.txt
Versions and components:
GPU sampling and multi-GPU are not supported, so I'm afraid you are largely on your own in figuring this out (of course great if someone else could chime in here). E.g. https://github.com/pymc-devs/pymc3/issues/2040 or @Spaak.
Thanks for the answer.
For now the workaround is to launch sampling on different GPU's independently and merge chains afterwards.
@adelkhafizova Do you see a speed-up with GPU sampling?
Me and @adelkhafizova performed a few tests in different configurations. For our model, we observed about 9x speed-up of NUTS sampling when ran on 4 GPUs in parallel as compared to 32 CPU cores.
The model is a hierarchical multinomial regression with some additional terms.
@adelkhafizova and @dsvolk, thanks for reporting back on this! I am wondering if you have sample code for how you launched sampling on independent GPUs, and code for how you merged chains? Did you have to launch different Python processes for this? (I'm asking mostly out of curiosity, with also the hope that I might be able to implement this myself.)
I am very interested as well - would be great if you can share your code.
it would also make for a killer blog post.
On Wed, Jan 30, 2019, 08:24 Junpeng Lao <[email protected] wrote:
I am very interested as well - would be great if you can share your code.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pymc-devs/pymc3/issues/3341#issuecomment-458942012,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AApJmORbAWC0AGxjffJ625Mp5gyrOlOLks5vIZ0egaJpZM4aCt3f
.
If being very short the main idea is to launch processes for each CUDA device in parallel (one process per one device), dump traces independently and use merge_chains function from pymc3.backends.base to get MultiTrace object with several chains. There is a little bug in merge_chains (setting a property without defined setter method). It is also important to set different chain_idx in different processes while sampling traces. Hope to prepare a more thorough example in a few days.
A fix for the bug in merge_traces():
https://github.com/pymc-devs/pymc3/pull/3374
@dsvolk @adelkhafizova Were you able to post sample code? I'm running into a similar issue with a similar model. Your example would be very helpful. Additionally, I'm even having issues running one GPU more efficient that a 8 core cpu.
I have no access to code since me (and @adelkhafizova ) do not work for that company anymore. But the idea is to launch a separate linux process for each GPU just from a bash script. For each of them you explicitly specify a different GPU for Theano, and save (pickle) the traces at the end. Then another python script merges these traces into a MultiTrace.
@dsvolk Thanks! Quick follow-up: did you do anything additional to optimize for GPU? Performance on a single GPU seems to be same as or worse than CPU. Before introducing multiple GPUs, I want to make sure I can get one to run correctly.
Nothing special, as far as I can remember. We were using four of NVidia GeForce GTX 1080. I do not remember CPU specs, though.
See #1246
Most helpful comment
it would also make for a killer blog post.
On Wed, Jan 30, 2019, 08:24 Junpeng Lao <[email protected] wrote: