Picongpu: invalid memory access

Created on 5 Jul 2019 · 6Comments · Source: ComputationalRadiationPhysics/picongpu

PIConGPu is creating an invalid memory access with the latest dev branch

The error is only visible with if PIConGPU is started with cuda-memcheck --report-api-errors no picongpu ...
I can reproduce the error on hemera with NVIDIA P100.

========= Error: process didn't terminate successfully
========= Fatal UVM GPU fault of type invalid pde due to invalid address
=========     during atomic access to address 0x20be00000

Normally cuda-memcheck shows the line with the invalid memory access but somehow not in this case.

affects latest release bug machinsystem

Source

psychocoderHPC

Most helpful comment

After a week of debugging and hacking I would say the alpaka function for pinning and unpinning memory is the root of all evil.

Based on the new knowledge I found https://devtalk.nvidia.com/default/topic/1031862/cuda-programming-and-performance/calling-cudahostunregister-on-the-same-4kb-page-twice-cuda-9-1-/

psychocoderHPC on 11 Jul 2019

👍3 🎉1

All 6 comments

effected also the latest master (release 0.4.3)

some additional information from version 0.4.3 example KelvinHelmholtz

mpiexec -n 1 cuda-memcheck --report-api-errors no ./picongpu -g 24 24 12 -d 1 1 1 --periodic 1 1 1 -s 0

========= CUDA-MEMCHECK
PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
terminate called after throwing an instance of 'std::runtime_error'
  what():  /bigdata/hplsim/scratch/widera/dev/thirdParty/alpaka/include/alpaka/event/EventCudaRt.hpp(195) 'ret = cudaEventQuery( event.m_spEventImpl->m_CudaEvent)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
[gp004:71056] *** Process received signal ***
[gp004:71056] Signal: Aborted (6)
[gp004:71056] Signal code:  (-6)
[gp004:71056] [ 0] /usr/lib64/libpthread.so.0(+0xf6d0)[0x2aaaaacde6d0]
[gp004:71056] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2aaaae257277]
[gp004:71056] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2aaaae258968]
[gp004:71056] [ 3] /trinity/shared/pkg/compiler/gcc/7.3.0/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125)[0x2aaaadaeaea5]
[gp004:71056] [ 4] /trinity/shared/pkg/compiler/gcc/7.3.0/lib64/libstdc++.so.6(+0x8ec96)[0x2aaaadae8c96]
[gp004:71056] [ 5] /trinity/shared/pkg/compiler/gcc/7.3.0/lib64/libstdc++.so.6(+0x8ece1)[0x2aaaadae8ce1]
[gp004:71056] [ 6] /trinity/shared/pkg/compiler/gcc/7.3.0/lib64/libstdc++.so.6(+0x8ef23)[0x2aaaadae8f23]
[gp004:71056] [ 7] ./picongpu(_ZN6alpaka4cuda6detail11cudaRtCheckERK9cudaErrorPKcS6_RKi+0x2ae)[0x96a37e]
[gp004:71056] [ 8] ./picongpu(_Z15cuplaEventQueryPv+0x2c2)[0xbd7322]
[gp004:71056] [ 9] ./picongpu(_ZN5pmacc9CudaEvent10isFinishedEv+0x2b)[0x9209ab]
[gp004:71056] [10] ./picongpu(_ZN5pmacc16TaskSetValueBaseINS_4math6VectorIfLi3ENS1_16StandardAccessorENS1_17StandardNavigatorENS1_6detail17Vector_componentsEEELj3EE13executeInternEv+0x32)[0x967f72]
[gp004:71056] [11] ./picongpu[0x9223ee]
[gp004:71056] [12] ./picongpu(_ZN5pmacc16HostBufferInternINS_4math6VectorIfLi3ENS1_16StandardAccessorENS1_17StandardNavigatorENS1_6detail17Vector_componentsEEELj3EEC2ENS_9DataSpaceILj3EEE+0x8a)[0x9c06da]
[gp004:71056] [13] ./picongpu(_ZN5pmacc14ExchangeInternINS_4math6VectorIfLi3ENS1_16StandardAccessorENS1_17StandardNavigatorENS1_6detail17Vector_componentsEEELj3EEC1ERNS_12DeviceBufferIS7_Lj3EEENS_10GridLayoutILj3EEENS_9DataSpaceILj3EEEjjjb+0x368)[0xa69648]
[gp004:71056] [14] ./picongpu(_ZN8picongpu6FieldEC2EN5pmacc18MappingDescriptionILj3ENS1_4math2CT6VectorIN4mpl_10integral_cIiLi8EEES8_NS7_IiLi4EEEEEEE+0xfea)[0x93828a]
[gp004:71056] [15] ./picongpu(_ZN8picongpu12MySimulation4initEv+0x145)[0xb55675]
[gp004:71056] [16] ./picongpu(_ZN5pmacc16SimulationHelperILj3EE15startSimulationEv+0x1a)[0xb43c0a]
[gp004:71056] [17] ./picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE5startEv+0xa0)[0xb44410]
[gp004:71056] [18] ./picongpu(main+0x9b)[0x91bfeb]
[gp004:71056] [19] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaae243445]
[gp004:71056] [20] ./picongpu[0x91d51f]

psychocoderHPC on 5 Jul 2019

K80 with dev is not showing any errors.

psychocoderHPC on 5 Jul 2019

After a week of debugging and hacking I would say the alpaka function for pinning and unpinning memory is the root of all evil.

Based on the new knowledge I found https://devtalk.nvidia.com/default/topic/1031862/cuda-programming-and-performance/calling-cudahostunregister-on-the-same-4kb-page-twice-cuda-9-1-/

psychocoderHPC on 11 Jul 2019

👍3 🎉1

I am currently not able to create a mini App with native CUDA which reproduces the cuda-memcheck issue.
At the current state it is not clear if this is a CUDA tools issue or if it is a PIConGPU/cupla or alpaka issue.

psychocoderHPC on 30 Jul 2019

I opened an alpaka issue https://github.com/ComputationalRadiationPhysics/alpaka/issues/873 since I think we are using the pinning in alpaka wrong. During the work with HIP I saw some documentations snippets which maybe describe why we have issues on Power9

psychocoderHPC on 13 Nov 2019

👍2

solved with https://github.com/ComputationalRadiationPhysics/alpaka/issues/873 and merged into PIConGPU with #3175

psychocoderHPC on 26 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings