Picongpu: Double Free for Ionization (RNG-using) Simulations

Created on 6 Apr 2017  路  28Comments  路  Source: ComputationalRadiationPhysics/picongpu

While investigating a PIConGPU crash on taurus using ADK as ionization method I stumbled upon a segmentation fault at the end of the default Laser Wakefield example using the dev version.
In the default setup with ions and ADK it finishes the simulation but runs into a segmentation fault right at the end. With the kernel blocking option, the segmentation fault happens right at start.

I am currently investigating the cause for this crash with cuda-gdb.

Could any one of you (@ax3l, @psychocoderHPC) verify this bug please?
(Not that is is just a bad module combination I use on hypnos and taurus)

Update: (2017-04-07)
It turns out there are two issues:

  • If using sm_20 instead of sm_35, both cuda_memtest and picongpu cause error. Switching to sm_35 solves this issue (this is now moved to issue #1954)
  • If using atomic Hydrogen, we still see an out of memory error even for very small simulation volumes.

Thus I will rename the topic of this issue to only cover ionization. Please see #1954 for the sm_20 vs sm_35 issue.

Update: (2017-04-12)
Modules used:

  1) gcc/4.9.2
  2) cmake/3.3.0  
  3) boost/1.60.0   
  4) cuda/8.0  
  5) openmpi/1.8.6.kepler.cuda80   
  6) pngwriter/0.5.6
  7) hdf5-parallel/1.8.15

and own libSplash (the current master) at 4aa0c039f98295aa75a490ed4fc4df93ae3c9dac.

bug core

Most helpful comment

This pull request mix two different bugs:

  1. compiling LWFA with sm_20 and run on sm_3x result in [CUDA] Error: invalid device function
  2. compiling LWFA+ADK with sm_35 and run on sm_35 result invalid memory access

The point 1. is also in addressed in #1954 and fixed with #1960.

From this point we use this issue only to discuss the ADK error

All 28 comments

With the "hack" in #1951 I git the following backtrace:

#0  0x00007ffff48a4c37 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff48a8028 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffff50be9ed in __gnu_cxx::__verbose_terminate_handler() () from /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6
#3  0x00007ffff50bc986 in __cxxabiv1::__terminate(void (*)()) () from /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6
#4  0x00007ffff50bc9d1 in std::terminate() () from /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6
#5  0x00007ffff50bcc18 in __cxa_throw () from /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6
#6  0x00000000009e1b76 in PMacc::exec::KernelStarter<PMacc::exec::Kernel<PMacc::random::kernel::InitRNGProvider<PMacc::random::methods::XorMin> >, unsigned int, unsigned int>::operator()<PMacc::DataBox<PMacc::PitchedBox<PMacc::random::methods::XorMin::StateType, 3u> >, unsigned int, PMacc::DataSpace<3u> > (this=this@entry=0x7fffffffc9c0)
    at ___/src/libPMacc/include/eventSystem/events/kernelEvents.hpp:217
#7  0x00000000009e24f9 in operator()<PMacc::DataBox<PMacc::PitchedBox<PMacc::random::methods::XorMin::StateType, 3u> >, unsigned int, PMacc::DataSpace<3u> > (this=0x7fffffffc990)
    at ___/src/libPMacc/include/eventSystem/events/kernelEvents.hpp:241
#8  PMacc::random::RNGProvider<3u, PMacc::random::methods::XorMin>::init (this=0x7fffe8179520, seed=0)
    at ___/src/libPMacc/include/random/RNGProvider.tpp:76
#9  0x0000000000a1296d in picongpu::MySimulation::init (this=0x217f0f0)
    at ___/src/picongpu/include/simulationControl/MySimulation.hpp:326
#10 0x00000000009b36d8 in PMacc::SimulationHelper<3u>::startSimulation (this=0x217f0f0)
    at ___/src/libPMacc/include/simulationControl/SimulationHelper.hpp:215
#11 0x00000000009b3f04 in picongpu::SimulationStarter<picongpu::InitialiserController, picongpu::PluginController, picongpu::MySimulation>::start (this=this@entry=0x7fffffffd1c0)
    at ___/src/picongpu/include/simulationControl/SimulationStarter.hpp:81
#12 0x00000000008b81d9 in main (argc=11, argv=0x7fffffffd2f8)
    at ___/src/picongpu/main.cu:56

with ___ being the path to the source

Please run with blocking kernek and cuda-memtest.

Could you please also provide the error message. Something like invalid memory access or so.

The function void RNGProvider<T_dim, T_RNGMethod>::init(uint32_t seed) is called several times (while according to gdb, not all calls start a kernel).
The value of gridSize is optimized out till the crash, but right before the crash it is set to 16384.

Thus from the kernel call:

PMACC_KERNEL(kernel::InitRNGProvider<RNGMethod>{})
    (gridSize, blockSize)
    (bufferBox, seed, m_size);

with:

  • gridSize = 16384
  • blockSize = 256
  • bufferBox = {<PMacc::private_Box::Box<3u, PMacc::PitchedBox<PMacc::random::methods::XorMin::StateType, 3u> >> = {<PMacc::PitchedBox<PMacc::random::methods::XorMin::StateType, 3u>> = {pitch = 3072, pitch2D = 786432, fixedPointer = 0x131f980000}, <No data fields>}, <No data fields>}
  • seed = 0
  • m_size = {<PMacc::math::Vector<int, 3, PMacc::math::StandardAccessor, PMacc::math::StandardNavigator, PMacc::math::detail::Vector_components>> = {<PMacc::math::detail::Vector_components<int, 3>> = {static isConst = <optimized out>, static dim = 3, v = {128, 256, 128}}, <PMacc::math::StandardAccessor> = {<No data fields>}, <PMacc::math::StandardNavigator> = {<No data fields>}, static dim = 3}, static Dim = <optimized out>}

Surprisingly with the first two values and the previously called line of code

const uint32_t gridSize = (m_size.productOfComponents() + blockSize - 1u) / blockSize; // Round up

m_size.productOfComponents() = 4194049
which is a number with the prime factors 53 * 79133 which looks kind of weird for the typical data structure used in PIConGPU.
Further it makes absolutely no sense if this is a product of more than 2 components (and m_size is 3 dimensional).

@psychocoderHPC

When running cudamemtest, there is an error (on various k20/hypnos nodes)

mpiexec --prefix $MPIHOME -tag-output --display-map -x LIBRARY_PATH -x LD_LIBRARY_PATH -am  .../002_test_1GPU/tbg/openib.conf --mca mpi_leave_pinned 0 -npernode 1 -n 1 .../002_test_1GPU/picongpu/bin/cuda_memtest.sh
 Data for JOB [59174,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: kepler004   Num slots: 4    Max slots: 0    Num procs: 1
    Process OMPI jobid: [59174,1] App: 0 Process rank: 0

 =============================================================
[1,0]<stderr>:[04/06/2017 21:55:21][kepler004][0]:ERROR: CUDA error: invalid device function, line 312, file ___/thirdParty/cuda_memtest/tests.cu
[1,0]<stderr>:cuda_memtest crash: see file ___/002_test_1GPU/simOutput/cuda_memtest_kepler004_0.err
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[59174,1],0]
  Exit code:    1
--------------------------------------------------------------------------

When running picongpu with blocking kernel on, the following error message is given:

mpiexec --prefix $MPIHOME -x LIBRARY_PATH -x LD_LIBRARY_PATH -tag-output --display-map -am  .../002_test_1GPU/tbg/openib.conf --mca mpi_leave_pinned 0 -npernode 1 -n 1 .../002_test_1GPU/picongpu/bin/picongpu  -d 1 1 1                         -g 128 256 128                        -s 10
 Data for JOB [58224,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: kepler004   Num slots: 4    Max slots: 0    Num procs: 1
    Process OMPI jobid: [58224,1] App: 0 Process rank: 0

 =============================================================
[1,0]<stderr>:vsetenv LD_LIBRARY_PATH failed
[1,0]<stderr>:vsetenv LD_LIBRARY_PATH failed
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
[1,0]<stderr>:[CUDA] Error: < ___/src/libPMacc/include/eventSystem/events/kernelEvents.hpp>:220 Last error after kernel launch N5PMacc6random6kernel15InitRNGProviderINS0_7methods6XorMinEEE [ ___/src/libPMacc/include/random/RNGProvider.tpp:76 ]
[1,0]<stderr>:terminate called after throwing an instance of 'std::runtime_error'
[1,0]<stderr>:  what():  [CUDA] Error: invalid device function
[1,0]<stderr>:[kepler004:25014] *** Process received signal ***
[1,0]<stderr>:[kepler004:25014] Signal: Aborted (6)
[1,0]<stderr>:[kepler004:25014] Signal code:  (-6)
[1,0]<stderr>:[kepler004:25014] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7ff4adbfc330]
[1,0]<stderr>:[kepler004:25014] [ 1] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7ff4aa8d4c37]
[1,0]<stderr>:[kepler004:25014] [ 2] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7ff4aa8d8028]
[1,0]<stderr>:[kepler004:25014] [ 3] [1,0]<stderr>:/opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d)[0x7ff4ab0ee9ed]
[1,0]<stderr>:[kepler004:25014] [ 4] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5d986)[0x7ff4ab0ec986]
[1,0]<stderr>:[kepler004:25014] [ 5] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5d9d1)[0x7ff4ab0ec9d1]
[1,0]<stderr>:[kepler004:25014] [ 6] [1,0]<stderr>:/opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5dc18)[0x7ff4ab0ecc18]
[1,0]<stderr>:[kepler004:25014] [ 7] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(_ZNK5PMacc4exec13KernelStarterINS0_6KernelINS_6random6kernel15InitRNGProviderINS3_7methods6XorMinEEEEEjjEclIINS_7DataBoxINS_10PitchedBoxINS7_9StateTypeELj3EEEEEjNS_9DataSpaceILj3EEEEEEvDpRKT_+0x706)[0x9e1b76]
[1,0]<stderr>:[kepler004:25014] [ 8] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(_ZN5PMacc6random11RNGProviderILj3ENS0_7methods6XorMinEE4initEj+0x109)[0x9e24f9]
[1,0]<stderr>:[kepler004:25014] [ 9] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(_ZN8picongpu12MySimulation4initEv+0x5ad)[0xa1296d]
[1,0]<stderr>:[kepler004:25014] [10] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(_ZN5PMacc16SimulationHelperILj3EE15startSimulationEv+0x18)[0x9b36d8]
[1,0]<stderr>:[kepler004:25014] [11] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE5startEv+0xc4)[0x9b3f04]
[1,0]<stderr>:[kepler004:25014] [12] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu(main+0x99)[0x8b81d9]
[1,0]<stderr>:[kepler004:25014] [13] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7ff4aa8bff45]
[1,0]<stderr>:[kepler004:25014] [14] [1,0]<stderr>: .../002_test_1GPU/picongpu/bin/picongpu[0x8b856f]
[1,0]<stderr>:[kepler004:25014] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 25014 on node kepler004 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

without blocking kernel the error message differs:

mpiexec --prefix $MPIHOME -x LIBRARY_PATH -x LD_LIBRARY_PATH -tag-output --display-map -am .../002_test_1GPU/tbg/openib.conf --mca mpi_leave_pinned 0 -npernode 1 -n 1 .../002_test_1GPU/picongpu/bin/picongpu  -d 1 1 1                         -g 128 256 128                        -s 10
 Data for JOB [63601,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: kepler004   Num slots: 4    Max slots: 0    Num procs: 1
    Process OMPI jobid: [63601,1] App: 0 Process rank: 0

 =============================================================
[1,0]<stderr>:vsetenv LD_LIBRARY_PATH failed
[1,0]<stderr>:vsetenv LD_LIBRARY_PATH failed
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00229 ? 1
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0247974
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | species i: omega_p * dt <= 0.1 ? 0.000578698
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | y-cells per wavelength: 18.0587
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | macro particles per gpu: 16777216
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 6955.06
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.39e-16
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 4.16712e-08
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_MASS 6.33563e-27
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 1.11432e-15
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 1.22627e+13
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 40903.8
[1,0]<stdout>:PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 5.69418e-10
[1,0]<stdout>:initialization time:  1sec 934msec = 1 sec
[1,0]<stdout>:  0 % =        0 | time elapsed:                    0msec | avg time per step:   0msec
[1,0]<stdout>: 10 % =        1 | time elapsed:                    4msec | avg time per step:   4msec
[1,0]<stdout>: 20 % =        2 | time elapsed:                    6msec | avg time per step:   2msec
[1,0]<stdout>: 30 % =        3 | time elapsed:                    8msec | avg time per step:   2msec
[1,0]<stdout>: 40 % =        4 | time elapsed:                   11msec | avg time per step:   2msec
[1,0]<stdout>: 50 % =        5 | time elapsed:                   13msec | avg time per step:   2msec
[1,0]<stdout>: 60 % =        6 | time elapsed:                   15msec | avg time per step:   2msec
[1,0]<stdout>: 70 % =        7 | time elapsed:                   18msec | avg time per step:   2msec
[1,0]<stdout>: 80 % =        8 | time elapsed:                   20msec | avg time per step:   2msec
[1,0]<stdout>: 90 % =        9 | time elapsed:                   22msec | avg time per step:   2msec
[1,0]<stdout>:100 % =       10 | time elapsed:                   25msec | avg time per step:   2msec
[1,0]<stdout>:calculation  simulation time:  25msec = 0 sec
[1,0]<stderr>:[kepler004:31414] *** Process received signal ***
[1,0]<stderr>:[kepler004:31414] Signal: Segmentation fault (11)
[1,0]<stderr>:[kepler004:31414] Signal code: Address not mapped (1)
[1,0]<stderr>:[kepler004:31414] Failing at address: 0x31
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 31414 on node kepler004 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Oh sry my fault I mean cuda-memcheck. This should show the wronh line within the gpu kernel.

Maybe it is a issue triggered by our new DataConnector or the change within the Environment. We should check it next week.

result from cuda-memtest

cuda-memcheck  .../002_test_1GPU/picongpu/bin/picongpu -d 1 1 1   -g 128 256 128    -s 10
========= CUDA-MEMCHECK
========= Program hit cudaErrorSetOnActiveProcess (error 36) due to "cannot set while device is active in this process" on CUDA API call to cudaSetDeviceFlags. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2eea03]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x6ca610]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc6detail18EnvironmentContext9setDeviceEi + 0x1fd) [0x4d68dd]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc11EnvironmentILj3EE11initDevicesENS_9DataSpaceILj3EEES3_ + 0x9c) [0x5edd8c]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu12MySimulation10pluginLoadEv + 0x1bc) [0x5f501c]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE10pluginLoadEv + 0x2b) [0x4fc4fb]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (main + 0x8c) [0x4b81cc]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x4b856f]
=========
========= Program hit cudaErrorSetOnActiveProcess (error 36) due to "cannot set while device is active in this process" on CUDA API call to cudaGetLastError. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2eea03]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x6ba703]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc6detail18EnvironmentContext9setDeviceEi + 0x267) [0x4d6947]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc11EnvironmentILj3EE11initDevicesENS_9DataSpaceILj3EEES3_ + 0x9c) [0x5edd8c]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu12MySimulation10pluginLoadEv + 0x1bc) [0x5f501c]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE10pluginLoadEv + 0x2b) [0x4fc4fb]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (main + 0x8c) [0x4b81cc]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x4b856f]
=========
PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaLaunch. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2eea03]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x6bcfde]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc6nvidia16gpuEntryFunctionINS_6random6kernel15InitRNGProviderINS2_7methods6XorMinEEEINS_7DataBoxINS_10PitchedBoxINS6_9StateTypeELj3EEEEEjNS_9DataSpaceILj3EEEEEEvT_DpT0_ + 0x8b) [0x4e7b6b]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZNK5PMacc4exec13KernelStarterINS0_6KernelINS_6random6kernel15InitRNGProviderINS3_7methods6XorMinEEEEEjjEclIINS_7DataBoxINS_10PitchedBoxINS7_9StateTypeELj3EEEEEjNS_9DataSpaceILj3EEEEEEvDpRKT_ + 0x34f) [0x5e17bf]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc6random11RNGProviderILj3ENS0_7methods6XorMinEE4initEj + 0x109) [0x5e24f9]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu12MySimulation4initEv + 0x5ad) [0x61296d]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc16SimulationHelperILj3EE15startSimulationEv + 0x18) [0x5b36d8]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE5startEv + 0xc4) [0x5b3f04]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (main + 0x99) [0x4b81d9]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x4b856f]
=========
========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaGetLastError. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2eea03]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x6ba703]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZNK5PMacc4exec13KernelStarterINS0_6KernelINS_6random6kernel15InitRNGProviderINS3_7methods6XorMinEEEEEjjEclIINS_7DataBoxINS_10PitchedBoxINS7_9StateTypeELj3EEEEEjNS_9DataSpaceILj3EEEEEEvDpRKT_ + 0x29f) [0x5e170f]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc6random11RNGProviderILj3ENS0_7methods6XorMinEE4initEj + 0x109) [0x5e24f9]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu12MySimulation4initEv + 0x5ad) [0x61296d]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN5PMacc16SimulationHelperILj3EE15startSimulationEv + 0x18) [0x5b36d8]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE5startEv + 0xc4) [0x5b3f04]
[CUDA] Error: <=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu (main + 0x99) [0x4b81d9]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]
=========     Host Frame: .../002_test_1GPU/picongpu/bin/picongpu [0x4b856f]
=========
 ___/src/libPMacc/include/eventSystem/events/kernelEvents.hpp>:220 Last error after kernel launch N5PMacc6random6kernel15InitRNGProviderINS0_7methods6XorMinEEE [ ___/src/libPMacc/include/random/RNGProvider.tpp:76 ]
terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] Error: invalid device function
[kepler004:03512] *** Process received signal ***
[kepler004:03512] Signal: Aborted (6)
[kepler004:03512] Signal code:  (-6)
[kepler004:03512] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f31e092c330]
[kepler004:03512] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f31dd604c37]
[kepler004:03512] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f31dd608028]
[kepler004:03512] [ 3] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d)[0x7f31dde1e9ed]
[kepler004:03512] [ 4] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5d986)[0x7f31dde1c986]
[kepler004:03512] [ 5] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5d9d1)[0x7f31dde1c9d1]
[kepler004:03512] [ 6] /opt/pkg/compiler/gnu/gcc/4.9.2/lib64/libstdc++.so.6(+0x5dc18)[0x7f31dde1cc18]
[kepler004:03512] [ 7]  .../002_test_1GPU/picongpu/bin/picongpu(_ZNK5PMacc4exec13KernelStarterINS0_6KernelINS_6random6kernel15InitRNGProviderINS3_7methods6XorMinEEEEEjjEclIINS_7DataBoxINS_10PitchedBoxINS7_9StateTypeELj3EEEEEjNS_9DataSpaceILj3EEEEEEvDpRKT_+0x706)[0x9e1b76]
[kepler004:03512] [ 8]  .../002_test_1GPU/picongpu/bin/picongpu(_ZN5PMacc6random11RNGProviderILj3ENS0_7methods6XorMinEE4initEj+0x109)[0x9e24f9]
[kepler004:03512] [ 9]  .../002_test_1GPU/picongpu/bin/picongpu(_ZN8picongpu12MySimulation4initEv+0x5ad)[0xa1296d]
[kepler004:03512] [10]  .../002_test_1GPU/picongpu/bin/picongpu(_ZN5PMacc16SimulationHelperILj3EE15startSimulationEv+0x18)[0x9b36d8]
[kepler004:03512] [11]  .../002_test_1GPU/picongpu/bin/picongpu(_ZN8picongpu17SimulationStarterINS_21InitialiserControllerENS_16PluginControllerENS_12MySimulationEE5startEv+0xc4)[0x9b3f04]
[kepler004:03512] [12]  .../002_test_1GPU/picongpu/bin/picongpu(main+0x99)[0x8b81d9]
[kepler004:03512] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f31dd5eff45]
[kepler004:03512] [14]  .../002_test_1GPU/picongpu/bin/picongpu[0x8b856f]
[kepler004:03512] *** End of error message ***
========= Error: process didn't terminate successfully
========= Internal error (20)
========= No CUDA-MEMCHECK results found

Have you enabled the compile flag "show codeline". If not please enable it or do not delete the binary you used. We can extract the line number out of it.

Do you changed the architecture to sm_35? If not than this is the error. There is a bug in the cmake file that the ptx code is not embadded and if you used the wrong architecture a error like this can be triggered.
I will fix the cmake file next week.

setting 35 causes an error:

  0 % =        0 | time elapsed:                    0msec | avg time per step:   0msec
[CUDA] Error: < ___/src/libPMacc/include/eventSystem/events/kernelEvents.hpp>:220 Last error after kernel launch N8picongpu9particles10ionization21KernelIonizeParticlesE [ ___/src/picongpu/include/particles/ParticlesFunctors.hpp:313 ]
terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] Error: out of memory

and is extremely slow.

However, the cuda_memtest.sh runs now successfully.

But even .../002_test_1GPU/picongpu/bin/picongpu -d 1 1 1 -g 32 32 32 -s 1 crashes with the above memory error.

Using sm_35 solves the issue with running the examples with electrons only and electrons+ions (pre-ionized). But with ions both BSI and ADK crash.

Since this is a issue with the ionization I would like to mention @n01r.

I just did a short test on a k80 node.
Configured a LaserWakefield example from dev with -t 10 which activates ions and ionization. I used BSIHydrogenLike and kicked out the effectiveAtomicNumbers for this test.
I ran mpiexec -n 8 picongpu -d 2 2 2 -g 64 64 64 -s 2000 and received no errors.

@n01r How did you activate ionization in your simulation?
with PARAM_IONIZATION == 1 ?

$PICSRC/configure -t 10 ~/paramSets/088_Issue1953BugOutOfMemoryWithIonization
whereas the cmake flag set 10 contains
flags[10]="-DCUDA_ARCH=35 -DPARAM_OVERWRITES:LIST=-DPARAM_IONS=1;-DPARAM_IONIZATION=1"

edit by @ax3l: and manually setting BSIEffectiveZ to ADKLinPol: https://github.com/ComputationalRadiationPhysics/picongpu/pull/1960#issuecomment-293562609

I repeated the test with ADK w/o PMACC_BLOCKING_KERNEL.
This time I get an error:

 95 % =     1900 | time elapsed:            14sec 637msec | avg time per step:   7msec
100 % =     2000 | time elapsed:            15sec 390msec | avg time per step:   7msec
calculation  simulation time: 15sec 391msec = 15 sec
[kepler020:22371] *** Process received signal ***
[kepler020:22371] Signal: Segmentation fault (11)
[kepler020:22371] Signal code: Address not mapped (1)
[kepler020:22371] Failing at address: 0x30
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 22371 on node kepler020 exited on signal 11 (Segmentation fault).

please run on one gpu and try to reproduce it. After that please run with cuda-gdb and print out the backtrace.

Yeah, I know the drill - already at it.
I only posted the 8 GPU test and did already 1 GPU tests for both and now I'm compiling with the debug flags.

@psychocoderHPC Do you remember that I entered your office a couple of weeks ago to tell you the same thing, that after a completed simulation it would crash? You told me it is a known issue with freeing memory in the cleanup step.

I did not try with sm_20 so far, though. Only with sm_35.

@n01r The cleanup issues you mean was fixed with #1886. It was during the time @ax3l refactored the PMacc::DataConnector

@PrometheusPi: I can reproduce the error with the current dev (LWFA plane) and it is fixed in my test in #1960 (see this test

This pull request mix two different bugs:

  1. compiling LWFA with sm_20 and run on sm_3x result in [CUDA] Error: invalid device function
  2. compiling LWFA+ADK with sm_35 and run on sm_35 result invalid memory access

The point 1. is also in addressed in #1954 and fixed with #1960.

From this point we use this issue only to discuss the ADK error

the ADK problem looks a bit like a double free or use after free of the RNG. I will check the output of the new data connector in verbose mode with it:

[...]
calculation  simulation time: 102msec = 0 sec
PMaccVerbose MEMORY(1) | DataConnector: unshared 'MallocMCBuffer' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: being cleaned (7 datasets left to unshare)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'i' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'e' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'RNGProvider3XorMin' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'FieldTmp0' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'J' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'E' (0 uses left)
PMaccVerbose MEMORY(1) | DataConnector: unshared 'B' (0 uses left)
[kepler020:08170] *** Process received signal ***
[kepler020:08170] Signal: Segmentation fault (11)
[kepler020:08170] Signal code: Address not mapped (1)
[kepler020:08170] Failing at address: 0x35
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 8170 on node kepler020 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@ax3l Yes the issue is triggered by the data connector because of wrong owner ships.

The plain pointer of RNGProvider is deleted in here in MySimulation and later on the shared pointer is also freed from the DataConnector.

Solution

We need to hold the RNGProvider in MySimulation as shared Pointer and must trigger the share function from MySimulation and not from the class it self.

Note: I am currently sick and can not address this issue within the next 2 week.

yes, I just posted the same 1minute ago above :D currently testing...

should be fixed with #1963

Was this page helpful?
0 / 5 - 0 ratings

Related issues

berceanu picture berceanu  路  4Comments

psychocoderHPC picture psychocoderHPC  路  4Comments

HighIander picture HighIander  路  4Comments

ax3l picture ax3l  路  4Comments

saipavankalyan picture saipavankalyan  路  3Comments