PIConGPu is creating an invalid memory access with the latest dev branch
The error is only visible with if PIConGPU is started with cuda-memcheck --report-api-errors no picongpu ...
I can reproduce the error on hemera with NVIDIA P100.
========= Error: process didn't terminate successfully
========= Fatal UVM GPU fault of type invalid pde due to invalid address
========= during atomic access to address 0x20be00000
Normally cuda-memcheck shows the line with the invalid memory access but somehow not in this case.
effected also the latest master (release 0.4.3)
some additional information from version 0.4.3 example KelvinHelmholtz
mpiexec -n 1 cuda-memcheck --report-api-errors no ./picongpu -g 24 24 12 -d 1 1 1 --periodic 1 1 1 -s 0
========= CUDA-MEMCHECK
PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
terminate called after throwing an instance of 'std::runtime_error'
what(): /bigdata/hplsim/scratch/widera/dev/thirdParty/alpaka/include/alpaka/event/EventCudaRt.hpp(195) 'ret = cudaEventQuery( event.m_spEventImpl->m_CudaEvent)' returned error : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
[gp004:71056] *** Process received signal ***
[gp004:71056] Signal: Aborted (6)
[gp004:71056] Signal code: (-6)
[gp004:71056] [ 0] /usr/lib64/libpthread.so.0(+0xf6d0)[0x2aaaaacde6d0]
[gp004:71056] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2aaaae257277]
[gp004:71056] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2aaaae258968]
[gp004:71056] [ 3] /trinity/shared/pkg/compiler/gcc/7.3.0/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125)[0x2aaaadaeaea5]
[gp004:71056] [ 4] /trinity/shared/pkg/compiler/gcc/7.3.0/lib64/libstdc++.so.6(+0x8ec96)[0x2aaaadae8c96]
[gp004:71056] [ 5] /trinity/shared/pkg/compiler/gcc/7.3.0/lib64/libstdc++.so.6(+0x8ece1)[0x2aaaadae8ce1]
[gp004:71056] [ 6] /trinity/shared/pkg/compiler/gcc/7.3.0/lib64/libstdc++.so.6(+0x8ef23)[0x2aaaadae8f23]
[gp004:71056] [ 7] ./picongpu(_ZN6alpaka4cuda6detail11cudaRtCheckERK9cudaErrorPKcS6_RKi+0x2ae)[0x96a37e]
[gp004:71056] [ 8] ./picongpu(_Z15cuplaEventQueryPv+0x2c2)[0xbd7322]
[gp004:71056] [ 9] ./picongpu(_ZN5pmacc9CudaEvent10isFinishedEv+0x2b)[0x9209ab]
[gp004:71056] [10] ./picongpu(_ZN5pmacc16TaskSetValueBaseINS_4math6VectorIfLi3ENS1_16StandardAccessorENS1_17StandardNavigatorENS1_6detail17Vector_componentsEEELj3EE13executeInternEv+0x32)[0x967f72]
[gp004:71056] [11] ./picongpu[0x9223ee]
[gp004:71056] [12] ./picongpu(_ZN5pmacc16HostBufferInternINS_4math6VectorIfLi3ENS1_16StandardAccessorENS1_17StandardNavigatorENS1_6detail17Vector_componentsEEELj3EEC2ENS_9DataSpaceILj3EEE+0x8a)[0x9c06da]
[gp004:71056] [13] ./picongpu(_ZN5pmacc14ExchangeInternINS_4math6VectorIfLi3ENS1_16StandardAccessorENS1_17StandardNavigatorENS1_6detail17Vector_componentsEEELj3EEC1ERNS_12DeviceBufferIS7_Lj3EEENS_10GridLayoutILj3EEENS_9DataSpaceILj3EEEjjjb+0x368)[0xa69648]
[gp004:71056] [14] ./picongpu(_ZN8picongpu6FieldEC2EN5pmacc18MappingDescriptionILj3ENS1_4math2CT6VectorIN4mpl_10integral_cIiLi8EEES8_NS7_IiLi4EEEEEEE+0xfea)[0x93828a]
[gp004:71056] [15] ./picongpu(_ZN8picongpu12MySimulation4initEv+0x145)[0xb55675]
[gp004:71056] [16] ./picongpu(_ZN5pmacc16SimulationHelperILj3EE15startSimulationEv+0x1a)[0xb43c0a]
[gp004:71056] [17] ./picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE5startEv+0xa0)[0xb44410]
[gp004:71056] [18] ./picongpu(main+0x9b)[0x91bfeb]
[gp004:71056] [19] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaae243445]
[gp004:71056] [20] ./picongpu[0x91d51f]
K80 with dev is not showing any errors.
After a week of debugging and hacking I would say the alpaka function for pinning and unpinning memory is the root of all evil.
Based on the new knowledge I found https://devtalk.nvidia.com/default/topic/1031862/cuda-programming-and-performance/calling-cudahostunregister-on-the-same-4kb-page-twice-cuda-9-1-/
I am currently not able to create a mini App with native CUDA which reproduces the cuda-memcheck issue.
At the current state it is not clear if this is a CUDA tools issue or if it is a PIConGPU/cupla or alpaka issue.
I opened an alpaka issue https://github.com/ComputationalRadiationPhysics/alpaka/issues/873 since I think we are using the pinning in alpaka wrong. During the work with HIP I saw some documentations snippets which maybe describe why we have issues on Power9
solved with https://github.com/ComputationalRadiationPhysics/alpaka/issues/873 and merged into PIConGPU with #3175
Most helpful comment
After a week of debugging and hacking I would say the alpaka function for pinning and unpinning memory is the root of all evil.
Based on the new knowledge I found https://devtalk.nvidia.com/default/topic/1031862/cuda-programming-and-performance/calling-cudahostunregister-on-the-same-4kb-page-twice-cuda-9-1-/