Pymc3: Multicore GPU/MultiGPU sampling support

Created on 16 Jan 2019 · 14Comments · Source: pymc-devs/pymc3

Hi! I’m trying to speed up MCMC sampling for Bayesian Multinomial Regression using GPU, the code is below. When using CPU, pymc3 utilizes as many cores as it can while sampling 4 chains at once. Multicore support fails when sampling on GPU (pm.sample with arguments cores=4), it gives “RuntimeError: Chain 2 fails”. Is there any way to parallelize computations using GPU (via using more cores or more GPUs)? At the moment using CPU with 32 cores is faster than doing the same on 1 core and 1 GPU.

def make_model(X, y):
    with pm.Model() as model:
        sd_alpha = pm.HalfCauchy('sd_alpha', beta=10)
        sd_beta = pm.HalfCauchy('sd_beta', beta=10)
        alpha = pm.Normal('alpha', mu=0, sd=sd_alpha, shape=n_classes)
        beta = pm.Normal('beta', mu=0, sd=sd_beta, shape=(n_features, n_classes))
        mu = tt.dot(X, beta) + alpha
        p = pm.Deterministic('p', tt.nnet.softmax(mu))
        label = pm.Categorical('label', p=p, observed=y)
    return model

X_shared = theano.shared(np.asarray(X_tr, theano.config.floatX))
y_shared = theano.shared(np.asarray(y_tr, theano.config.floatX))
model = make_model(X_shared, y_shared)

with model:
    niter = 500
    tune = 500
    step = pm.NUTS()
    trace = pm.sample(niter, tune=tune, chains=4, cores=4, init='jitter+adapt_diag', step=step)

The full log is attached as a file.
log.txt

Versions and components:

PyMC3 Version: 3.5
Theano Version: 1.0.2, CUDA 9.0, cuDNN 7005
Python Version: 3.7
Operating system: Ubuntu 16.04.3
How did you install PyMC3: conda

gpu

Source

adelkhafizova

Most helpful comment

it would also make for a killer blog post.

On Wed, Jan 30, 2019, 08:24 Junpeng Lao <[email protected] wrote:

I am very interested as well - would be great if you can share your code.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pymc-devs/pymc3/issues/3341#issuecomment-458942012,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AApJmORbAWC0AGxjffJ625Mp5gyrOlOLks5vIZ0egaJpZM4aCt3f
.

twiecki on 30 Jan 2019

👍3 😄1

All 14 comments

GPU sampling and multi-GPU are not supported, so I'm afraid you are largely on your own in figuring this out (of course great if someone else could chime in here). E.g. https://github.com/pymc-devs/pymc3/issues/2040 or @Spaak.

twiecki on 16 Jan 2019

Thanks for the answer.
For now the workaround is to launch sampling on different GPU's independently and merge chains afterwards.

adelkhafizova on 25 Jan 2019

@adelkhafizova Do you see a speed-up with GPU sampling?

twiecki on 25 Jan 2019

Me and @adelkhafizova performed a few tests in different configurations. For our model, we observed about 9x speed-up of NUTS sampling when ran on 4 GPUs in parallel as compared to 32 CPU cores.

The model is a hierarchical multinomial regression with some additional terms.

dsvolk on 30 Jan 2019

👀2 🚀2

@adelkhafizova and @dsvolk, thanks for reporting back on this! I am wondering if you have sample code for how you launched sampling on independent GPUs, and code for how you merged chains? Did you have to launch different Python processes for this? (I'm asking mostly out of curiosity, with also the hope that I might be able to implement this myself.)

ericmjl on 30 Jan 2019

I am very interested as well - would be great if you can share your code.

junpenglao on 30 Jan 2019

it would also make for a killer blog post.

On Wed, Jan 30, 2019, 08:24 Junpeng Lao <[email protected] wrote:

I am very interested as well - would be great if you can share your code.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pymc-devs/pymc3/issues/3341#issuecomment-458942012,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AApJmORbAWC0AGxjffJ625Mp5gyrOlOLks5vIZ0egaJpZM4aCt3f
.

twiecki on 30 Jan 2019

👍3 😄1

If being very short the main idea is to launch processes for each CUDA device in parallel (one process per one device), dump traces independently and use merge_chains function from pymc3.backends.base to get MultiTrace object with several chains. There is a little bug in merge_chains (setting a property without defined setter method). It is also important to set different chain_idx in different processes while sampling traces. Hope to prepare a more thorough example in a few days.

adelkhafizova on 30 Jan 2019

👍3

A fix for the bug in merge_traces():
https://github.com/pymc-devs/pymc3/pull/3374

dsvolk on 13 Feb 2019

@dsvolk @adelkhafizova Were you able to post sample code? I'm running into a similar issue with a similar model. Your example would be very helpful. Additionally, I'm even having issues running one GPU more efficient that a 8 core cpu.

muunetheus on 26 Jun 2020

I have no access to code since me (and @adelkhafizova ) do not work for that company anymore. But the idea is to launch a separate linux process for each GPU just from a bash script. For each of them you explicitly specify a different GPU for Theano, and save (pickle) the traces at the end. Then another python script merges these traces into a MultiTrace.

dsvolk on 28 Jun 2020

@dsvolk Thanks! Quick follow-up: did you do anything additional to optimize for GPU? Performance on a single GPU seems to be same as or worse than CPU. Before introducing multiple GPUs, I want to make sure I can get one to run correctly.

muunetheus on 30 Jun 2020

Nothing special, as far as I can remember. We were using four of NVidia GeForce GTX 1080. I do not remember CPU specs, though.

dsvolk on 30 Jun 2020

See #1246