It would be nice to have full-fledged GPU support in Turing for cases where parts of the model are embarrassingly parallel. Essentially, this can be achieved using GPU arrays for parameters and/or data. Since we allow arbitrary Julia code in Turing models, this is already largely possible for all the bits except the ~ lines, i.e. VarInfo, observe and assume. Figuring out ways to adapt these components for the GPU is not trivial but may not be too hard.
For a start, GPU parallelism can be allowed in the ~ lines using GPU multivariate distributions. Complex kernels formed by the user may also be possible with things like map-do blocks but we may run into issues with closing over other arrays. I will need to try it out with CuArrays and see how things work. I think data-parallelism may be easier to begin with since the data is not coupled with the complicated VarInfo.
Another mode of GPU parallelism that can be exploited in Turing is in the sampling itself, so each GPU thread can do its own little MCMC sampling. This can also be possible by adapting the VarInfo and data input to work on the GPU. This may be a bigger effort though since it may require implementing the MCMC algorithms in a GPU-friendly way using GPU kernels.
This is a brainstorming issue on GPU support for Turing. So papers, ideas, comments and use cases are welcome.
Paper mentions GPU accelerated HMC and compares to stan and PYMC with huge performance boost, among other relevant topics.
https://arxiv.org/pdf/1701.03757.pdf
Edit: @ mohamed82008 Also would be good to check out tensorflow probability and the associated PyMC 4
Also relevant https://github.com/vchuravy/GPUifyLoops.jl
Sounds good. Kai and I already talked about GPU support for HMC adaption phase. I also talked to the pyro team and they said they could not obtain any speed improvements when using GPU support for HMC sampling. To few matrix operations.
But I agree we should explore this more so that model that heavily use matrix operations can use the GPU. We might be able to use the GPU for PG. But it might easily be that the memory transfer overhead is again too large.
For sampling parallelism, I think if the model itself is reasonably heavy that there is enough work to do by each thread to amortize the data transfer and syncing cost then we could see some speedup.
For model parallelism, I think if the model is too heavy, e.g. many data points and/or parameters, the computation of logp is expensive and GPU parallelizable, and we avoid unnecessary syncing then we could also see some speedup.
I think shortlisting models that fit this description is the first step.
Maybe models with big vectorised observe statements?
Yes but unless computing the distribution's logp at a data point is somewhat complicated, there won't be any speedup from using the GPU; the CPU is really fast! We can still multi-thread it on the CPU though using KissThreading.tmapreduce, which will give a speedup anyways.
I agree with the above. As a side note. The paper of Dustin Tran evaluates on a model that requires matrix operations. Thus, it makes sense he gets speed ups. But I鈥檓 sceptical this has anything to do with HMC but is probably because his computation costs are dominated by evaluating the model logjoint. Which can be done effectively on the GPU for this specific model.
This may be an interesting direction to pursue after I fix #665.
But I鈥檓 sceptical this has anything to do with HMC but is probably because his computation costs are dominated by evaluating the model logjoint.
This is almost certainly the case, unless you've got some O(> N) stuff, where N is the number of parameters, going on in the HMC adaptation (e.g. Riemannian Manifold HMC). Particular since the stuff they're doing is deep, as you point out @trappmartin .