I had an error with a Kokkos kernel when executing with CUDA. I tracked down the issue to using a view that is a member of the class directly. If I make a local copy of that view and then used it, there is no error anymore. I made a simple example program to reproduce the error. That program is below. The error can be reproduced with the latest version of kokkos and cuda/8.0
I have a parallel_for with a lambda. When using the class member "seq" directly, it will compile fine, but will have a cuda error on execution. When copied to a local variable "lseq", it will execute properly. Do we know what is causing this error?
Thanks
Ramanan
#include <Kokkos_Core.hpp>
class MyClass {
public:
typedef Kokkos::View<int *> IndexType;
IndexType seq;
public:
void set(int NP) {
seq=IndexType("seq", NP);
IndexType lseq=seq;
Kokkos::parallel_for(NP, KOKKOS_LAMBDA(const size_t& n) {
seq(n) = n; //This compiles fine, but will have a cuda error at runtime
//lseq(n) = n; //This compiles and executes correctly
} );
int sum;
Kokkos::parallel_reduce(NP, KOKKOS_LAMBDA(const size_t& n, int& lsum) { lsum+=lseq(n); }, sum);
printf("rsa set %2d\n", sum);
}
};
int main(int argc, char *argv[]){
Kokkos::initialize();
MyClass A;
A.set(4);
Kokkos::finalize();
}
I have also run into this issue multiple times. As far as I know, it is a defect (?) in the CUDA compiler, not really a problem with Kokkos. I don't know of any better short-term solution than what you pointed out, i.e. make local copies first.
My guess is that seq(n) gets expanded to this->seq(n), so this is copied by value to the GPU rather than seq, and finally the GPU attempts to dereference the host pointer this. This is speculation on my part as to exactly what the failure mechanism is.
I hope that this can be fixed by NVIDIA and I wonder if it has been brought to their attention. It may also be the case that the C++ standard mandates this behavior, and that CUDA cannot be simultaneously convenient and compliant. Hopefully other Kokkos developers can confirm or deny my speculations here.
I recommend you rename this issue to "CUDA Error when capturing a class member in a lambda".
Dan is exactly right, actually in all respects ;-). This expansion to this-> is exactly what happens, and thus you get invalid access on the device because 'this' points to the host. And this is also a defect in the C++ standard because a) this is mandated behaviour and b) this breaks asynchronous dispatch. Imagine a member function which returns a future for an async dispatching a lambda. If the class instance goes out of scope before one waits for the future the captured 'this' pointer points to invalid data.
To fix this we (i.e. the Kokkos team) actually initiated an addition to the C++ standard which made it into the C++ 17 standard. Namely that you can use the capture clause: [=,*this]. The latter means a copy of the class instance shall be captured instead of the 'this' pointer. This fixes both the C++ internal async issue as well as the deep_copy issue for Cuda.
CUDA 8.0 supports this feature and you can use KOKKOS_CLASS_LAMBDA to trigger this (even without requesting C++17). Also clang 3.9 supports this if you request the C++17 standard (--std=c++1z). Unfortunately you are back to device lambdas with CUDA since there is not CUDA/HostCompiler combination where both sides support the feature.
Most helpful comment
Dan is exactly right, actually in all respects ;-). This expansion to this-> is exactly what happens, and thus you get invalid access on the device because 'this' points to the host. And this is also a defect in the C++ standard because a) this is mandated behaviour and b) this breaks asynchronous dispatch. Imagine a member function which returns a future for an async dispatching a lambda. If the class instance goes out of scope before one waits for the future the captured 'this' pointer points to invalid data.
To fix this we (i.e. the Kokkos team) actually initiated an addition to the C++ standard which made it into the C++ 17 standard. Namely that you can use the capture clause: [=,*this]. The latter means a copy of the class instance shall be captured instead of the 'this' pointer. This fixes both the C++ internal async issue as well as the deep_copy issue for Cuda.
CUDA 8.0 supports this feature and you can use KOKKOS_CLASS_LAMBDA to trigger this (even without requesting C++17). Also clang 3.9 supports this if you request the C++17 standard (--std=c++1z). Unfortunately you are back to device lambdas with CUDA since there is not CUDA/HostCompiler combination where both sides support the feature.