Picongpu: N6splash11DCExceptionE and continuation of existing simulation

Created on 3 Feb 2020  路  5Comments  路  Source: ComputationalRadiationPhysics/picongpu

I got the error shown below on a model which had run fine for two hours. I am using the @develop version What can it be?. On the other hand in such situations or in general how can I continue the simulation for the remaining unsolved steps or for new ones?
Thank you

initialization time:  1min  5sec 360msec = 65 sec
  0 % =        0 | time elapsed:             9sec 608msec | avg time per step:   0msec
  4 % =     7186 | time elapsed:      10min 43sec 695msec | avg time per step:  57msec
  9 % =    14372 | time elapsed:      21min 42sec 810msec | avg time per step:  59msec
 14 % =    21558 | time elapsed:      33min  2sec 930msec | avg time per step:  59msec
 19 % =    28744 | time elapsed:      45min 10sec 726msec | avg time per step:  59msec
 24 % =    35930 | time elapsed:      57min 20sec  72msec | avg time per step:  60msec
 29 % =    43116 | time elapsed:   1h  8min 25sec 481msec | avg time per step:  60msec
 34 % =    50302 | time elapsed:   1h 19min 33sec 901msec | avg time per step:  60msec
 39 % =    57488 | time elapsed:   1h 30min 45sec 312msec | avg time per step:  61msec
 44 % =    64674 | time elapsed:   1h 42min  1sec 485msec | avg time per step:  61msec
 49 % =    71860 | time elapsed:   1h 53min 21sec 481msec | avg time per step:  62msec
 54 % =    79046 | time elapsed:   2h  4min 44sec 430msec | avg time per step:  63msec
Unhandled exception of type 'N6splash11DCExceptionE' with message 'Exception for DCDataSet [z] create: Failed to create dataset', terminating
full simulation time:  2h 13min 47sec 226msec = 8027 sec
[quasar:12271:0:12271] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x558)
==== backtrace (tid:  12271) ====
 0  /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x124) [0x7fd634ef0d24]
 1  /usr/local/ucx/lib/libucs.so.0(+0x2414c) [0x7fd634ef114c]
 2  /usr/local/ucx/lib/libucs.so.0(+0x243c4) [0x7fd634ef13c4]
 3  /home/quasar/src/spack/opt/spack/linux-linuxmint19-skylake/gcc-7.4.0/hdf5-1.10.6-fy2osaqrk4g2ibqlqhq2gjaz7rtl6c5n/lib/libhdf5.so.103(H5F__close_cb+0x39) [0x7fd65e429f29]
 4  /home/quasar/src/spack/opt/spack/linux-linuxmint19-skylake/gcc-7.4.0/hdf5-1.10.6-fy2osaqrk4g2ibqlqhq2gjaz7rtl6c5n/lib/libhdf5.so.103(+0x188850) [0x7fd65e4a6850]
 5  /home/quasar/src/spack/opt/spack/linux-linuxmint19-skylake/gcc-7.4.0/hdf5-1.10.6-fy2osaqrk4g2ibqlqhq2gjaz7rtl6c5n/lib/libhdf5.so.103(H5SL_try_free_safe+0x63) [0x7fd65e575733]
 6  /home/quasar/src/spack/opt/spack/linux-linuxmint19-skylake/gcc-7.4.0/hdf5-1.10.6-fy2osaqrk4g2ibqlqhq2gjaz7rtl6c5n/lib/libhdf5.so.103(H5I_clear_type+0xa0) [0x7fd65e4a72a0]
 7  /home/quasar/src/spack/opt/spack/linux-linuxmint19-skylake/gcc-7.4.0/hdf5-1.10.6-fy2osaqrk4g2ibqlqhq2gjaz7rtl6c5n/lib/libhdf5.so.103(H5F_term_package+0x34) [0x7fd65e41d264]
 8  /home/quasar/src/spack/opt/spack/linux-linuxmint19-skylake/gcc-7.4.0/hdf5-1.10.6-fy2osaqrk4g2ibqlqhq2gjaz7rtl6c5n/lib/libhdf5.so.103(+0x4bc1e) [0x7fd65e369c1e]
 9  /lib/x86_64-linux-gnu/libc.so.6(+0x43041) [0x7fd65d2c3041]
10  /lib/x86_64-linux-gnu/libc.so.6(+0x4313a) [0x7fd65d2c313a]
11  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xee) [0x7fd65d2a1b9e]
12  /media/quasar/Storage/IONS_lam_75_A0_0.005_pulse_10fs_phase_0_linX_thick_x10_tubes_1_units_0.1nm_0.2as/input/bin/picongpu(_start+0x2a) [0x5623dbed39ca]
=================================
[quasar:12271] *** Process received signal ***
[quasar:12271] Signal: Segmentation fault (11)
[quasar:12271] Signal code:  (-6)
[quasar:12271] Failing at address: 0x3e800002fef
[quasar:12271] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7fd65fee5890]
[quasar:12271] [ 1] /home/quasar/src/spack/opt/spack/linux-linuxmint19-skylake/gcc-7.4.0/hdf5-1.10.6-fy2osaqrk4g2ibqlqhq2gjaz7rtl6c5n/lib/libhdf5.so.103(H5F__close_cb+0x39)[0x7fd65e429f29]
[quasar:12271] [ 2] /home/quasar/src/spack/opt/spack/linux-linuxmint19-skylake/gcc-7.4.0/hdf5-1.10.6-fy2osaqrk4g2ibqlqhq2gjaz7rtl6c5n/lib/libhdf5.so.103(+0x188850)[0x7fd65e4a6850]
[quasar:12271] [ 3] /home/quasar/src/spack/opt/spack/linux-linuxmint19-skylake/gcc-7.4.0/hdf5-1.10.6-fy2osaqrk4g2ibqlqhq2gjaz7rtl6c5n/lib/libhdf5.so.103(H5SL_try_free_safe+0x63)[0x7fd65e575733]
[quasar:12271] [ 4] /home/quasar/src/spack/opt/spack/linux-linuxmint19-skylake/gcc-7.4.0/hdf5-1.10.6-fy2osaqrk4g2ibqlqhq2gjaz7rtl6c5n/lib/libhdf5.so.103(H5I_clear_type+0xa0)[0x7fd65e4a72a0]
[quasar:12271] [ 5] /home/quasar/src/spack/opt/spack/linux-linuxmint19-skylake/gcc-7.4.0/hdf5-1.10.6-fy2osaqrk4g2ibqlqhq2gjaz7rtl6c5n/lib/libhdf5.so.103(H5F_term_package+0x34)[0x7fd65e41d264]
[quasar:12271] [ 6] /home/quasar/src/spack/opt/spack/linux-linuxmint19-skylake/gcc-7.4.0/hdf5-1.10.6-fy2osaqrk4g2ibqlqhq2gjaz7rtl6c5n/lib/libhdf5.so.103(+0x4bc1e)[0x7fd65e369c1e]
[quasar:12271] [ 7] /lib/x86_64-linux-gnu/libc.so.6(+0x43041)[0x7fd65d2c3041]
[quasar:12271] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x4313a)[0x7fd65d2c313a]
[quasar:12271] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xee)[0x7fd65d2a1b9e]
[quasar:12271] [10] /media/quasar/Storage/IONS_lam_75_A0_0.005_pulse_10fs_phase_0_linX_thick_x10_tubes_1_units_0.1nm_0.2as/input/bin/picongpu(_start+0x2a)[0x5623dbed39ca]
[quasar:12271] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node quasar exited on signal 11 (Segmentation fault).
plugin question

Most helpful comment

Glad it resolved @cbontoiu. Anyways, please do not hesitate to report questions and issues, even if something might be wrong on your side. Once this is on github, we and other users could search through.

All 5 comments

Hello @cbontoiu. Unfortunately, little idea without us reproducing it. Does it happen consistently for you on this setup? The only thing to try out of my head is this in case you have OpenMPI + HDF5.

Generally, checkpointing might help. Just beware that the size of each checkpoint file is approximately equal to your simulation data size.

Thank you,
It happens only with lengthy and larger-memory simulations. Indeed I use openMPI version 3.1.5 and HDF5. What are the alternatives? Do you suggest adding export OMPI_MCA_io=^ompio in my cfg file?

The combination should work with you having export OMPI_MCA_io=^ompio in your .profile, not .cfg (with re-sourcing the profile in order for changes to happen). Or, in case you run PIConGPU locally, just do it in the terminal before running it. This is a workaround for HDF5 + OpenMPI failing for certain sizes of output. In PIConGPU output of particles is of variable size, so a "bad size" could happen somewhere in between of a simulation. Your case might be having this issue, but of course, this is just one of many things that could have gone wrong.

I have to admit that this error was due to lack of empty space on the disk. Shame on me again.

Glad it resolved @cbontoiu. Anyways, please do not hesitate to report questions and issues, even if something might be wrong on your side. Once this is on github, we and other users could search through.

Was this page helpful?
0 / 5 - 0 ratings