Turing.jl: GPU support

Created on 17 Jun 2019 · 9Comments · Source: TuringLang/Turing.jl

It would be nice to have full-fledged GPU support in Turing for cases where parts of the model are embarrassingly parallel. Essentially, this can be achieved using GPU arrays for parameters and/or data. Since we allow arbitrary Julia code in Turing models, this is already largely possible for all the bits except the ~ lines, i.e. VarInfo, observe and assume. Figuring out ways to adapt these components for the GPU is not trivial but may not be too hard.

For a start, GPU parallelism can be allowed in the ~ lines using GPU multivariate distributions. Complex kernels formed by the user may also be possible with things like map-do blocks but we may run into issues with closing over other arrays. I will need to try it out with CuArrays and see how things work. I think data-parallelism may be easier to begin with since the data is not coupled with the complicated VarInfo.

Another mode of GPU parallelism that can be exploited in Turing is in the sampling itself, so each GPU thread can do its own little MCMC sampling. This can also be possible by adapting the VarInfo and data input to work on the GPU. This may be a bigger effort though since it may require implementing the MCMC algorithms in a GPU-friendly way using GPU kernels.

This is a brainstorming issue on GPU support for Turing. So papers, ideas, comments and use cases are welcome.

discussion new-feature

Source

mohamed82008

All 9 comments

Paper mentions GPU accelerated HMC and compares to stan and PYMC with huge performance boost, among other relevant topics.

https://arxiv.org/pdf/1701.03757.pdf

Edit: @ mohamed82008 Also would be good to check out tensorflow probability and the associated PyMC 4

datnamer on 17 Jun 2019

👍1

Also relevant https://github.com/vchuravy/GPUifyLoops.jl

datnamer on 17 Jun 2019

Sounds good. Kai and I already talked about GPU support for HMC adaption phase. I also talked to the pyro team and they said they could not obtain any speed improvements when using GPU support for HMC sampling. To few matrix operations.

But I agree we should explore this more so that model that heavily use matrix operations can use the GPU. We might be able to use the GPU for PG. But it might easily be that the memory transfer overhead is again too large.

trappmartin on 17 Jun 2019

For sampling parallelism, I think if the model itself is reasonably heavy that there is enough work to do by each thread to amortize the data transfer and syncing cost then we could see some speedup.

For model parallelism, I think if the model is too heavy, e.g. many data points and/or parameters, the computation of logp is expensive and GPU parallelizable, and we avoid unnecessary syncing then we could also see some speedup.

I think shortlisting models that fit this description is the first step.

mohamed82008 on 17 Jun 2019

Maybe models with big vectorised observe statements?

cpfiffer on 18 Jun 2019

Yes but unless computing the distribution's logp at a data point is somewhat complicated, there won't be any speedup from using the GPU; the CPU is really fast! We can still multi-thread it on the CPU though using KissThreading.tmapreduce, which will give a speedup anyways.

mohamed82008 on 18 Jun 2019

I agree with the above. As a side note. The paper of Dustin Tran evaluates on a model that requires matrix operations. Thus, it makes sense he gets speed ups. But I’m sceptical this has anything to do with HMC but is probably because his computation costs are dominated by evaluating the model logjoint. Which can be done effectively on the GPU for this specific model.

trappmartin on 18 Jun 2019

This may be an interesting direction to pursue after I fix #665.

mohamed82008 on 18 Jun 2019

👍1

But I’m sceptical this has anything to do with HMC but is probably because his computation costs are dominated by evaluating the model logjoint.

This is almost certainly the case, unless you've got some O(> N) stuff, where N is the number of parameters, going on in the HMC adaptation (e.g. Riemannian Manifold HMC). Particular since the stuff they're doing is deep, as you point out @trappmartin .

willtebbutt on 18 Jun 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Parameter values initialization does not speed up NUTS

ClaudMor · 4Comments

Error with Gibbs compositional sampler and an array

mateuszbaran · 5Comments

`MethodError: no method matching (::BallTreeDensity)(::Array{ForwardDiff.Dual....` when trying to use a KernelDensityEstimate BallTreeDensity as prior

ClaudMor · 3Comments

Improve Turing website - turing.ml

yebai · 6Comments

RFC Sampler type

xukai92 · 5Comments