Hello,
As an exercise I was doing some comparisons of an extremely simple EM shower simulation with both Alpaka and CUDA and noticed the Alpaka version was slower, which seemed to be related to calls to random number generation. So then I made the simplest examples I could think of which only initialise and then generate 1 random number in the kernels:
and the same via Alpaka:
https://github.com/shefmarkh/AdePT/blob/mhodgkin_cuda_alpaka/examples/FisherPrice_Alpaka/randGen.cu
and I see the same effect. The Alpaka GPU kernel takes 11 ms or so, whilst the CUDA kernel takes 0.1 ms or so. With a lot of random number throwing the difference added up to a large amount of time in the simple simulation example I was looking at.
I was wondering if this difference is known about and expected? Or I am misunderstanding something about random number generation in Alpaka and should use some different more optimal approach? (e.g the code does not do the same thing exactly, even though I think it does)
Cheers,
Mark
There is a hidden difference between the plain CUDA version and the alpaka CUDA version.
Alpaka internally also uses curand_uniform to generate the random numbers. For the plain c++ backends, alpaka used std::uniform_real_distribution. The tricky part is that alpaka tries to provide a uniform interface above of those functions.
curand_uniform produces values in the range (0.f, 1.0f] see Output range excludes 0.0f but includes 1.0f.std::uniform_real_distribution produces values in the range [0.f, 1.0f) see uniformly distributed on the interval [a, b)We had to decide for one of the two versions to adapt to the other one to get a consistent behaviour across multiple backends.
The decision was to adapt the CUDA backend, I do not know why.
Internally alpaka does:
float const fUniformRand = curand_uniform(&state);
float const result = fUniformRand * static_cast<float>( fUniformRand != 1.0f );
I would be interested in if you would see the same performance characteristics, if you would adapt your CUDA code accordingly.
Furthermore, I would be interested in the performance of a second CUDA version.
The alpaka code says
NOTE: (1.0f - curand_uniform) does not work, because curand_uniform seems to return denormalized floats around 0.f.
but newer CUDA documentations explicitly state:
Denormalized floating point outputs are never returned.
Maybe this has been changed since this was implemented.
Could you please try out the performance of the following version?
float const result = 1.0f - curand_uniform(&state);
@shefmarkh I have not tested your exmples yet but saw that you call curand_init per thread and than create one random number.
If possible I suggest to use a seperate kernel where you initilize N states of the RNG generator and store them in global memory. In the second kernel use the pre-created state to get random numbers.
The method curand_init is using a lot of spilled register and will reduce the occopancy of the kernel in a real application. By splitting the problem into two kernel you can avoid it but with the drawback that you need some global memory to store all rng states.
Hello @BenjaminW3
I tried both your suggestions in the CUDA version and neither affected the timing reported by nfprof in a significant way.
I played around a bit more and its the initialization itself that is slow (not the generation of the number itself) i.e
auto generator = rand::generator::createDefault(acc, 1984, iTh);
is much slower than:
curand_init(1984, iTh, 0, &local_rand_state);
(so in that sense its less of an issue given one only you only call this once per thread in principle, it may have impacted our simple simulation example because the example does not do too much yet....and hence in a far more complex example this would not be a major impediment timing wise).
Thanks @psychocoderHPC , we will bear that in mind (we did do that in our cuda example simulation, though not in the alpaka version).
Cheers,
Mark